Introduction to Unicode
In the first chapter, I have predicted that any part of C language that plays an important role in Microsoft Windows programming will tell you that you may not have encountered these problems in traditional text mode programming. Wide character sets and Unicode are almost all such problems.
Simply put, Unicode is an extension of the ASCII character set. In strict ASCII, each character is represented by 7 bits, or each character commonly used on computers is 8 bits wide; Unicode uses the complete 16-bit character set. This enables Unicode to represent characters, hieroglyphs and other symbols that can be used for computer communication in all writing languages in the world. Unicode was originally used as a supplement to ASCII and will eventually replace it if possible. Considering that ASCII is the most dominant standard in computers, this is indeed a high goal.
Unicode affects every part of the computer industry, but it may have the greatest impact on operating systems and programming languages. In this regard, we are moving forward. Windows NT supports Unicode from the bottom (unfortunately, only a small part of Windows 98 supports Unicode). C programming language, which is naturally constrained by ANSI, supports Unicode by supporting wide character sets. These contents will be discussed in detail below.
Naturally, as programmers, we usually face a lot of heavy work. I have tried to reduce the burden by making all the programs in this book "Unicode". With the discussion of Unicode in this chapter, its meaning will become clear.
A brief history of character set
Although it is not certain when humans began to speak, writing has a history of about 6000 years. In fact, the content of early writing was hieroglyphics. Each character corresponds to the alphabet of sounds that appeared about 3000 years ago. Although people used all kinds of writing languages well in the past, several inventors in the19th century still saw more demands. Samuel F. B. Morse invented the telegraph from 1838 to 1854, when he also invented a code for the telegraph. Each character in the alphabet corresponds to a series of long and short pulses (dots and dashes). Although letters are not case-sensitive, numbers and punctuation have their own codes.
Morse code is not the first example of using other hieroglyphs or printed hieroglyphs to represent the writing language. From 182 1 to 1824, young Louis Braille, inspired by the military system to read and write information at night, invented a code, which used bumps on paper as codes to help the blind read. Braille coding is actually a 6-bit coding, which encodes characters, common letter combinations, common words and punctuation marks. Special escape codes indicate that subsequent character codes should be interpreted as uppercase. A special shift code allows subsequent codes to be interpreted as numbers.
Telex codes including Baudot (named after a French engineer who died in 1903) and CCITT #2 (standardized in 193 1) are all 5-digit codes including characters and numbers.
United States Standard
The character codes of early computers were developed from Hollerith cards (which claimed that they could not be folded, curled or damaged). This card was invented by herman hollerith, and was first used in the US census in 1890. Bcdic (binary coded decimal interchange code) is a 6-bit character coding system, which originated from Hollerith code and gradually expanded to 8-bit EBCDIC in 1960s. It has always been the standard of IBM mainframe, but it has not been used elsewhere.
American standard code for information interchange code (ASCII) began in the late 1950s and was finally completed in 1967. In the development of ASCII, there is a great controversy about whether the character length is 6, 7 or 8 digits. From the point of view of reliability, substitute characters should not be used, so ASCII can't be a 6-bit code, but the 8-bit version scheme is ruled out because of the cost (the storage space cost of each element was still very expensive at that time). In this way, the final character code has 26 lowercase letters, 26 uppercase letters, 10 numbers, 32 symbols, 33 symbols and a space, making a total of *** 128 character codes. ASCII is now recorded in the ANSI X3.4- 1986 character set -7-bit American National Standard Code for Information Interchange (7-bit ASCII) published by American National Standards Institute. The ASCII character code shown in figure 2- 1 is similar to the format in ANSI documents.
ASCII has many advantages. For example, the 26-letter code is continuous (EBCDIC code is not); By changing one bit of data, uppercase letters and lowercase letters can be converted to each other; 10 bit code can be easily obtained from the numerical value itself (in BCDIC code, the character "0" is coded after the character "9"! )
Most importantly, ASCII is a very reliable standard. In keyboards, video display cards, system hardware, printers, font files, operating systems and the Internet, other standards are not as popular and popular as ASCII codes.
0- 1- 2- 3- 4- 5- 6- 7-
-0 NUL DLE SP 0 @ P ` p
- 1 SOH DC 1! 1 ah q ah q
-2 STX DC2 " 2 B R b r
-DC3 ETX 3
-4 EOT DC4 $ 4D T D T T
-5 ENQ naked% 5 E U e u
-6 ACK SYN & amp; 6 F V f v
-No.7 ETB Street, Bell
-8 BS CAN ( 8 H X h x
-9ht EM)9i Y Y Y
-one LF connector *: J Z j z
-B VT ESC+; K [ k {
-C FF FS,& ltL \ l |
-D CR GS - = M ] m }
-So? & gtN ^ n ~
-F SI US /? O _ o del
Figure 2- 1 ASCII character set
International aspect
The biggest problem with ASCII is the first letter of the abbreviation. ASCII is a true American standard, so it can't meet the needs of other English-speaking countries. For example, where is the pound symbol?
English uses the Latin (or Roman) alphabet. In written languages that use Latin letters, English words usually rarely need accents (or pronunciation symbols). Even those traditional practices with phonetic symbols are not inappropriate English words, such as c? Operation or resume, spelling without phonetic symbols will be fully accepted.
However, in the south and north of the United States, as well as many countries in the Atlantic region, the use of phonetic symbols in languages is very common. These stress symbols were originally designed to adapt Latin letters to the different pronunciation needs of these languages. Traveling in the Far East or southern Western Europe, you will encounter languages that do not use Latin letters at all, such as Greek, Hebrew, Arabic and Russian (using Slavic letters). If you go further east, you will find that there are pictographic Chinese characters in China, and Japanese and Korean also use the Chinese character system.
The history of ASCII begins with 1967. Since then, ASCII has focused on overcoming its own limitations to make it more suitable for languages other than American English. For example, in 1967, ISO (International Organization for Standardization) recommended a variant of ASCII with codes 0x40, 0x5B, 0x5C, 0x5D, 0x7B, 0x7C and 0x7D "reserved for national use", while codes 0x5E, 0x60 and 0x7E were marked as "When special characters required by China need 8, 9 or 65438. But it shows how people try to code different languages.
Extended ASCII
In the early days of the development of small computers, 8 bytes have been strictly established. Therefore, if one byte is used to store characters, 128 additional characters are needed to supplement ASCII. 198 1 year, when the original IBM PC was introduced, the 256-character character set was burned in the ROM of the graphics card, which became an important part of the IBM standard.
The original IBM extended character set includes some stressed characters and a lowercase Greek letter (very useful in mathematical symbols), as well as some box and line graphic characters. Additional characters are also added to the encoding position of ASCII control characters, because most control characters are not used for display.
The IBM extended character set has been burned into the ROM of countless graphics cards and printers, and has been used by many applications to decorate the display mode of its text mode. However, this character set cannot provide enough stressed characters for all Western European languages that use Latin letters, and it is not suitable for Windows. Windows doesn't need graphic characters because it has a complete graphic system.
In Windows1.0 (1985 165438 released in October), Microsoft did not completely abandon the IBM extended character set, but retreated to the second important position. Because it follows the ANSI draft and ISO standard, the pure Windows character set is called "ANSI character set". The draft ANSI and ISO standards finally became ANSI/ISO 8859-1-1987. That is, "American national standard for information processing -8-bit single-byte coded graphic character set-part 1: Latin letter no 1", usually abbreviated as "Latin 1".
The original version of the ANSI character set is printed in the programmer's reference of Windows 1.0, as shown in Figure 2-2.
0- 1- 2- 3- 4- 5- 6- 7- 8- 9- A- B
-0 * * 0 @ P ` p * *? ? à ?
- 1 * * ! 1 ah q ah q * *? ? ? á ?
-2 * * " 2 B R b r * *? ? ? ò ? ò
-3 * * # 3 C S c s * *? ? ? ó ? ó
-4 * * $ 4 D T d t * *? ? ? ? ?
-5 * * % 5 E U e u * *? ? ? ? ? ?
-6 * * & amp; 6 F V f v * *? ? ? ? ? ?
-7 * * ' 7 G W g w * *? * ? *
-8 * * ( 8 H * h * * *? ? ? è ?
-9 * * ) 9 I Y I y * *? ? ? ? é ù
-A * * * : J Z j z * *? ? ? ? ê ú
-B * * +K [ k { * *? ? ? ? ? ?
-C * *,& ltL \ l | * *? ? ? ? ì ü
-D * * - = M ] m } * *? ? ? ? í ?
-E * *。 & gtN ^ n ~ * *? ? ? ? ? ?
-F * * /? * _ o DEL * *? ? ? ? ? ?
*-Not applicable
Figure 2-2 Windows ANSI Character Set (based on ANSI/ISO 8859- 1)
An empty box indicates that there are no characters defined at this position. This is consistent with the final definition of ANSI/ISO 8859- 1. ANSI/ISO 8859- 1 only displays graphic characters, not control characters, so DEL is not defined. In addition, the code 0xA0 is defined as a newline space (meaning that characters are not used for newline when formatting), and the code 0xAD is a soft hyphen (meaning that it is not displayed at the end of a line unless it is used for newline). In addition, ANSI/ISO 8859- 1 defines the code 0xD7 as a multiplication symbol (*) and 0xF7 as a division symbol (/). Some fonts in Windows also define some characters from 0x80 to 0x9F, but these characters are not part of the ANSI/ISO 8859- 1 standard.
MS-DOS 3.3 (1April 1987) introduced the concept of code page to IBM PC users, and Windows also used this concept. The internal code table defines the mapping code of characters. The original IBM character set was called internal code table 437, or "MS-DOS Latin US". The internal code table 850 is "MS-DOS Latin 1", and some linear characters are replaced by additional stressed letters (but not the Latin 1 ISO/ANSI standard shown in fig. 2-2). Other internal code tables are defined by other languages. The lowest 128 code is always the same; The higher 128 code depends on the language that defines the internal code table.
In MS-DOS, if the user specifies an internal code table for the keyboard, graphics card and printer of the PC, and then creates, edits and prints files on the PC, everything is normal and everything will be consistent. However, if a user tries to exchange files with a user who uses a different internal code table, or changes the internal code table on the machine, problems will occur. The character code is associated with the wrong character. Applications can save the information of internal code tables with files to try to reduce problems, but this strategy includes some work of converting between internal code tables.
Although the internal code table initially provided only additional Latin character sets, excluding stressed letters, the high-order 128 characters in the internal code table eventually included complete non-Latin letters, such as Hebrew, Greek and Slavic. Naturally, such diversity will lead to confusion in the internal code table; If several stressed letters are displayed incorrectly, the whole text will be confused and unreadable.
The expansion of the internal code table is based on all these reasons, but it is not enough. Slavic MS-DOS internal code table 855 is different from Slavic Windows internal code table 125 1 and Slavic Macintosh internal code table 10007. The internal code table in each environment is a revision of the standard character set in that environment. IBM OS/2 also supports various EBCDIC internal code tables.
But wait a minute, you will find that things get worse.
Double-byte character set
So far, we have seen a character set of 256 characters. But there are about 265,438+0,000 hieroglyphs in China, Japan and South Korea. How to adapt to these languages while still maintaining compatibility with ASCII?
The solution, if this statement is correct, is the double-byte character set (DBCS). DBCS starts with 256 codes, just like ASCII. Like any well-behaved internal code table, the first 128 code is ASCII code. However, some higher 128 codes always follow the second byte. These two words (called the first byte and the next byte) together define a character, usually a complex hieroglyph.
Although Chinese, Japanese and Korean * * * use some of the same hieroglyphs, it is obvious that these three languages are different, and often the same hieroglyph represents three different things in three different languages. Windows supports four different double-byte character sets: internal code table 932 (Japanese), 936 (Simplified Chinese), 949 (Korean) and 950 (Traditional Chinese). Only the versions of Windows produced for these countries support DBCS.
The problem of double character set does not mean that characters are represented by two bytes. The problem is that some characters (especially ASCII characters) are represented by 1 byte. This will lead to additional programming problems. For example, the number of characters in a string cannot be determined by the number of bytes in the string. The string must be parsed to determine its length, and each byte must be examined to determine whether it is the first tuple of a double-byte character. If there is a pointer to the middle of a DBCS string, what is the address of the first character of this string? The usual solution is to analyze the string from the start pointer!
Unicode solution
The basic problem we face is that the written language in the world cannot be simply represented by 256 8-bit codes. Previous solutions, including internal code table and DBCS, have proved to be inadequate and clumsy. So what is the real solution?
As programmers, we have all experienced such problems. If there are too many things represented by 8-bit values, then we will try wider values, such as 16-bit values. This is very interesting, which is why Unicode is formulated. Unlike the chaotic 256-character code mapping and the double-byte character set with some 1 byte codes and some 2-byte codes, Unicode is a unified 16-bit system, which allows 65,536 characters to be represented. This is enough to represent all the characters and languages that use hieroglyphics in the world, including a series of mathematics, symbols and currency unit symbols.
It is important to understand the difference between Unicode and DBCS. Unicode uses a "wide character set" (especially in the C programming language environment). "Every character in Unicode is 16 bits wide, not 8 bits wide. In Unicode, it makes no sense to use only 8-bit values. In contrast, we still handle 8-bit values in double-byte character sets. Some bytes define their own characters, while some bytes display that a character needs to be defined by another byte * * *.
Dealing with DBCS strings is very troublesome, but dealing with Unicode text is like dealing with ordered text. You may be pleased to know that the first 128 Unicode characters (16-bit code from 0x0000 to 0x007F) are ASCII characters, and the next 128 Unicode characters (code from 0x0080 to 0x00FF) are ISO 8859- 1 to ASCII. Characters in different parts of Unicode are also based on existing standards. This is to facilitate the conversion. The codes of 0x0370 to 0x03FF are used for Greek letters, 0x0400 to 0x04FF for Slavic languages, 0x0530 to 0x058F for the United States and 0x0590 to 0x05FF for Hebrew. The occupation codes of CJK hieroglyphs range from 0x3000 to 0x9FFF.
The biggest advantage of Unicode is that there is only one character set here, and there is no ambiguity. Unicode is actually the result of the cooperation of almost every important company in the personal computer industry, and it corresponds to the codes in ISO 10646- 1 standard one by one. The important reference of Unicode is Unicode Standard, Version 2.0 (Addison-Wesley Publishing House, 1996). This is a special book, which shows the richness and diversity of written languages in the world in a rare way in other documents. In addition, the book also provides the basic principles and details of developing Unicode.
What are the disadvantages of Unicode? Of course there is. Unicode strings take up twice as much memory as ASCII strings. However, compressing files can greatly reduce the disk space occupied by files. But perhaps the biggest disadvantage is that people are relatively unaccustomed to using Unicode. As programmers, this is our job.
Wide characters and c
For C programmers, the idea of 16 characters is really disappointing. One of the most uncertain things is that a character and a byte are the same width. Few programmers are familiar with ANSI/ISO 9899- 1990, which is the American national standard programming language -C (also known as ANSI C), and supports a character set with multiple bytes representing one character through a concept called wide characters. These wide characters are perfectly matched with commonly used characters.
ANSI C also supports multi-byte character sets, such as those supported by Chinese, Japanese and Korean versions of Windows. However, these multibyte character sets are regarded as strings composed of unit tuples, but some of them will change the meaning of subsequent characters. Multi-byte character set mainly affects the library functions when C language programs run. In contrast, wide characters are wider than ordinary characters, which will lead to some compilation problems.
Wide characters do not have to be Unicode. Unicode is a possible wide character set. However, because this book focuses on the theory of Windows rather than C execution, I will use wide characters and Unicode as synonyms.
Character data type
Suppose we are all familiar with the use of char data types in C programs to define and store characters and strings. But in order to understand how C handles wide characters, let's first review the standard character definitions that may appear in Win32 programs.
The following statement defines and initializes a variable that contains only one character:
char c = ' A
The variable c needs to be saved in 1 byte, and will be initialized with hexadecimal number 0x4 1, which is the ASCII code of the letter a.
You can define a pointer to a string like this:
char * p;
Because Windows is a 32-bit operating system, the pointer variable p needs to be stored in 4 bytes. You can also initialize a pointer to a string:
Char * p = "Hello!" ;
As before, the variable p needs to be stored in 4 bytes. The string is stored in static memory, occupying 7 bytes-6 bytes to store the string, 1 byte to store the terminator 0.
You can also define a character array like this:
char a[ 10];
In this case, the compiler reserves 10 bytes for the array. The expression sizeof(a) will return 10. If the array is a global variable (that is, a variable defined outside all functions), you can initialize the character array with the following statement:
Char a[] = "Hello!" ;
If an array is defined as a local variable of a function, it must be defined as a static variable, as follows:
Static char a[] = "Hello!" ;
In both cases, the string is stored in static program memory and ends with 0, which requires 7 bytes of storage space.
wide character
Unicode or wide characters have not changed the meaning of char data type in C. Char continues to represent the storage space of 1 byte, and sizeof (char) continues to return to 1. Theoretically, 1 byte in C can be longer than 8 bits, but for most of us, 1 byte (that is, 1 char) is 8 bits wide.
Wide characters in C # are based on wchar_t data type, which is defined in several header files including wchar. H, like this:
Typedef unsigned short integer wchar _ t;;
Therefore, the data type of wchar_t is the same as that of unsigned short integer, which is 16 bits wide.
To define a variable that contains wide characters, use the following statement:
wchar _ t c = ' A
The variable c is a double-byte value 0x004 1, which is the letter a in Unicode. (However, since Intel microprocessors store multibyte values from the smallest byte, these bytes are actually stored in memory in the order of 0x4 1 and 0x00. You should pay attention to this if you check the computer storage of Unicode text. )
You can also define a pointer to a wide string:
Wchar_t * p = L "Hello!" ;
Note the capital letter L (for "long") before the first quotation mark. This tells the compiler that strings are stored in wide characters-that is, each character takes up 2 bytes. Usually pointer variable p takes up 4 bytes, while string variable needs 14 bytes-2 bytes per character and 0 2 bytes at the end.
Similarly, you can define a wide character array using the following statement:
Static wchar_t a[] = L "Hello!" ;
The string also needs 14 bytes of storage space, and sizeof (a) will return 14. Index array a can get a single character. The value of a[ 1] is the wide character "e", which is 0x0065.
Although it looks more like a printed symbol, the L before the first quotation mark is very important, and there must be no space between the two symbols. Only when l is used does the compiler know that it is necessary to save the string as 2 bytes per character. Later, when we see a wide string instead of a variable definition, you will also encounter the L before the first quotation mark. Fortunately, if you forget to include L, the C compiler usually gives you a warning or an error message.
You can also use the prefix l before single-character text to indicate that they should be interpreted as wide characters. As follows:
wchar _ t c = L 'A
But this is usually unnecessary, and the C compiler will expand the characters to make them wide characters.
Wide character library function
We all know how to get the length of a string. For example, if we define a string pointer, it looks like this:
Char * pc = "Hello!" ;
We can call.
iLength = strlen(PC);
At this point, the variable iLength will be equal to 6, which is the number of characters in the string.
Great! Now let's try to define a pointer to a wide character:
Wchar_t * pw = L "Hello!" ;
Call strlen again:
iLength = strlen(pw);
Now the trouble is coming. First, the C compiler will display a warning message, which may be like this:
"function": incompatible type-from "unsigned short *" to "const char *"
This message means that when the strlen function is declared, it should receive a pointer of type char, but now it receives a pointer of type unsigned short. You can still compile and run the program, but you will find that iLength is equal to 1. Why?
That string of "hello! The six characters in "occupy 16 bits:
0x 0048 0x 0065 0x 006 c 0x 006 c 0x 006 f 0x 002 1
Intel processors store it in memory:
48 00 65 00 6C 00 6C 00 6F 00 2 1 00
Suppose the strlen function tries to get the length of a string and starts counting the 1 th byte as a character, but then suppose that if the next byte is 0, it means the end of the string.
This little exercise clearly illustrates the difference between the C language itself and the runtime library functions. The compiler converts the string l "hello!" It is interpreted as a set of short integer data with 16 bits, and stored in the wchar_t array. The compiler also handles array indexes and sizeof operators, so they all work properly, but runtime library functions (such as strlen) are added at link time. These functions assume that the string consists of single-byte characters. When encountering a wide string, the function will not perform as expected.
You might say, "Oh, what a bother! Now every C language library function must be rewritten to accept wide characters. But in fact, not every C language library function needs to be rewritten, only those functions with string parameters need to be rewritten, and you don't have to do this. They have been rewritten.
The wide character version of strlen function is wcslen (wide string length), which is interpreted by string. H (which also explains strlen) and wchar.h. The strlen function is described as follows:
size _ t _ _ cdecl strlen(const char *);
The wcslen function is described as follows:
size _ t _ _ cdecl wcs len(const wchar _ t *);
At this time, we know that to get the length of a wide string, we can call.
iLength = wcs len(pw);
This function returns the number of characters in a string, 6. Remember, after changing to wide bytes, the character length of the string remains the same, only the byte length changes.
All C runtime library functions with string parameters that you are familiar with have wide character versions. For example, wprintf is a wide-character version of printf. These functions are described in WCHAR. H and header files with standard function descriptions.
Maintain a single source code
Of course, there are disadvantages to using Unicode. The first and most important point is that each string in the program will occupy twice the storage space. In addition, you will find that the functions in the wide-character runtime library are larger than the regular functions. Therefore, you may want to create two versions of the program-one that handles ASCII strings and the other that handles Unicode strings. The best solution is to maintain a source code file that can be compiled in ASCII and Unicode.
Although it is only a short program, because the names of library functions are different at runtime, you have to define different characters, which will be more troublesome when dealing with string words starting with L.
One way is to use TCHAR. H header file contained in Microsoft Visual C++. The header file is not part of the ANSI C standard, so every function and macro definition defined there is preceded by an underscore. TCHAR。 H provides a series of alternative names (such as _tprintf and _tcslen) for standard runtime library functions that require string parameters. Sometimes these names are also called "generic" function names because they can point to Unicode and non-unicode versions of functions.
If an identifier named _UNICODE is defined and the program contains TCHAR. H header file, then _tcslen is defined as wcslen:
#define _tcslen wcslen
If UNICODE is not defined, _tcslen is defined as strlen:
#define _tcslen strlen
Wait a minute. TCHAR。 H also uses a new data type TCHAR to solve the problem of two character data types. If the _UNICODE identifier is defined, then TCHAR is wchar_t:
typedef wchar _ t TCHAR;
Otherwise, TCHAR is charming:
typedef char TCHAR;
Now let's discuss the l problem in string text. If _UNI is defined