Emacs can convert unibyte text to multibyte; it can also convert multibyte text to unibyte, though this conversion loses information. In general these conversions happen when inserting text into a buffer, or when putting text from several strings together in one string. You can also explicitly convert a string's contents to either representation.
Emacs chooses the representation for a string based on the text that it is constructed from. The general rule is to convert unibyte text to multibyte text when combining it with other multibyte text, because the multibyte representation is more general and can hold whatever characters the unibyte text has.
When inserting text into a buffer, Emacs converts the text to the
buffer's representation, as specified by
enable-multibyte-characters
in that buffer. In particular, when
you insert multibyte text into a unibyte buffer, Emacs converts the text
to unibyte, even though this conversion cannot in general preserve all
the characters that might be in the multibyte text. The other natural
alternative, to convert the buffer contents to multibyte, is not
acceptable because the buffer's representation is a choice made by the
user that cannot be overridden automatically.
Converting unibyte text to multibyte text leaves ASCII characters
unchanged, and likewise 128 through 159. It converts the non-ASCII
codes 160 through 255 by adding the value nonascii-insert-offset
to each character code. By setting this variable, you specify which
character set the unibyte characters correspond to (see section Character Sets). For example, if nonascii-insert-offset
is 2048, which is
(- (make-char 'latin-iso8859-1) 128)
, then the unibyte
non-ASCII characters correspond to Latin 1. If it is 2688, which
is (- (make-char 'greek-iso8859-7) 128)
, then they correspond to
Greek letters.
Converting multibyte text to unibyte is simpler: it performs
logical-and of each character code with 255. If
nonascii-insert-offset
has a reasonable value, corresponding to
the beginning of some character set, this conversion is the inverse of
the other: converting unibyte text to multibyte and back to unibyte
reproduces the original unibyte text.
self-insert-command
inserts a character in the unibyte
non-ASCII range, 128 through 255. However, the function
insert-char
does not perform this conversion.
The right value to use to select character set cs is (-
(make-char cs) 128)
. If the value of
nonascii-insert-offset
is zero, then conversion actually uses the
value for the Latin 1 character set, rather than zero.
nonascii-insert-offset
. You can use it to specify independently
how to translate each code in the range of 128 through 255 into a
multibyte character. The value should be a vector, or nil
.
If this is non-nil
, it overrides nonascii-insert-offset
.
Go to the first, previous, next, last section, table of contents.