The standard defines
-
basic source character set
-
basic execution character set and its wide char counterpart
It also defines 'execution character set' and its wide char counterpart as follows
$2.2/3- "The execution character set
and the execution wide-character set
are supersets of the basic execution
character set and the basic execution
wide-character set, respectively. The
values of the members of the execution
character sets are
implementation-defined, and any
additional members are
locale-specific."
Q1. I don't think I understand this completely, particularly the last statement. Any pointers on this aspect?
Further,
$3.9.1 – "Objects declared as
characters (char) shall be large
enough to store any member of the
implementation’s basic character set."
Q2. In 3.9.1 the phrase 'basic character set' means 'basic execution character set'?
Best Answer
You need do distinguish between the source character set, the execution character set, the wire execution character set and it's basic versions:
The basic source character set:
This character set has exactly 96 characters. They fit into 7 bit. Characters like
@
are not included.Let's get some example binary representations for a few basic source characters. They can be completely arbitrary and there is no need these correspond to ASCII values.
The basic execution character set …
As stated the basic execution character set contains all members of basic source character set. It still doesn't include any other character like
@
. The basic execution character set can have a different binary representation.As stated the basic execution character set contains representations for carriage return, a null character and other characters.
If the basic execution character set is 11 bits long (like in this example) the char data type shall be large enough to store 11 bits but it may be longer.
… and The basic execution wide character set:
The basic execution wide character is used for wide characters (wchar_t). It basicallly the same as the basic execution wide character set but can have different binary representations as well.
The only fixed member is the null character which needs to be a sequence of
0
bits.Converting between basic character sets:
Then a c++ source file is compiled each character of the source character set is converted into the basic execution (wide) character set.
Example:
Since
string0
is a normal character it will be converted to the basic execution character set andstring1
will be converted to the basic execution wide character set.Something about file encodings:
There are several kind of file encodings. For example
ASCII
which is 7 bit long.Windows-1252
which is 8 bit long (known asANSI
).ASCII
doesn't contain non-English characters.ANSI
contains some European characters likeä Ö ä Õ ø
.Newer file encodings like
UTF-8
orUTF-32
can contain characters of any language.UTF-8
is characters are variable in length.UTF-32
are 32 bit characters long.File enconding requirements:
Most compilers offer command line switch to specify the file encoding of the source file.
A c++ source file needs to be encoded in an file encoding which has a representation of the basic source character set. For example: The file encoding of the source file needs to have a representation of the
;
character.If you can type the character
;
within the encoding chosen as the encoding of the source file that encoding is not suitable as a c++ source file encoding.Non-basic character sets:
Characters not included in the basic source character set belong to the source character set. The source character set is equivalent to the file encoding.
For example: the
@
character is not include in the basic source character but it may be included in the source character set. The chosen file encoding of the input source file might contain a representation of@
. If it doesn't contain a representation for@
you can't use the character@
within strings.Characters not included in the basic (wide) character set belong to the execution (wide) character set.
Remember that the compiler converts the character from the source character set to the execution character set and the execution wide character set. Therefore there needs to be way how these characters can be converted.
For example: If you specify
Windows-1252
as the encoding of the source character set and specifyASCII
as the execution wide character set there is no way to convert this string:These characters can not be represented in
ASCII
.Specifying character sets:
Here are some examples how to specify the character sets using gcc. The default values are included.
With UTF-8 and UTF-32 as default encoding c++ source files can contain strings with character of any language. UTF-8 characters can the converted both ways without problems.
The extended character set:
Multibyte character are longer than an entry of the normal characters. They contain an escape sequence marking them as multibyte character.
Multibyte characters are processed according the locale set in the user's runtime environment. These multibyte characters are converted at runtime to the encoding set in user's environment.