Both UTF-8 and its subset ASCII were designed with computing in mind.
Here I'll describe some attributes of these encodings which make them good
candidates for representing strings in your programs.
Given the ubiquity of string processing it's important to use well
designed encodings when possible. This results in code that is quicker to
write, while also usually being faster and less buggy.
ASCII
The venerable 7 bit unibyte ASCII encoding has a bit layout that lends itself to elegant and simple routines. Consider the following attributes:Simple ordering
letters are ordered numerically to simplify comparison. This also means one can do a simple range check for islower() for example. int islower (int c) { return c>0x60 && c<0x7B; } This isn't the case for encodings like EBCDIC which is more aligned with punch cards and so has numerical jumps between some adjacent letters. Also in ASCII, spaces come before letters and numbers, which trivializes sorting right aligned numbers.Letter case
To change case, one just has to toggle a single bit (0x20), which is simple to implement in keyboard hardware and in software. For example toupper() could be implemented with something like: int toupper (int c) { return islower(c) ? c^0x20: c; }. Even greater efficiencies can be gained with an unconditional &0xDF when doing case insensitive comparisions in restricted character sets. Multi-byte case changing is an altogether more complicated beast.Control characters
The control characters like ^M etc. are also within a range, and one can get a printable representation of them by again just toggling a bit: int toprint (int c) { return iscntrl(c) ? c^0x40: c; }Decimal digits
ASCII digits are actually BCD so one can do arithmetic directly on ASCII strings without conversion to numeric types, which can be both efficient and immune to overflow. For example code using this property, see the getlimits::decimal_ascii_add routine.UTF8
Generality
UTF-8 can represent any unicode character unlike all unibyte encodings and some multi-byte encodings. To do this it uses between 1 and 4 bytes structured as follows:unicode range | # chars | UTF-8 bit pattern |
00-7F | 128 | 01234567 |
0080-07FF | 2K | 11012345 1067890A |
0800-FFFF | 65K | 11101234 10567890 10ABCDEF |
010000-10FFFF | all | 11110123 10456789 100ABCDE 10FGHIJK |
Compatibility
UTF-8 is backwards compatible with ASCII, which matches the single byte range above. Also since there is no overlap in the ASCII range with other UTF-8 bytes, it's largely compatible with traditional C string processing functions like strcmp(), strchr(), strstr(), or even strtok() if the delimiter is in the ASCII range. More discussion on this in the "searching" section below.Popularity
Due to the increasing popularity of UTF-8, robust and efficient processing routines are available. Unfortunately these weren't defined or standardised when the encoding was, but libraries like libunistring and ICU can now be used. Another aspect of the growing popularity of UTF-8 is that data needs to be converted to and from other encodings for processing, less often, thus further increasing efficiency.Buffering
Due to the self synchronizing nature of UTF-8, it can be read in very efficiently. One can read a chunk into a buffer, and just look at the last 3 bytes to know how many to keep for appending to the start of the next buffer to process. Here is an example of efficiently streaming UTF-8 (with node.js), that demonstrates this technique.Searching
Due to the uniqueness of each character, one can very easily search, unlike other multibyte encodings like SJIS. Consider the common task of searching for a character in a string, which can be done for any encoding using something like the following:mbschr (char const* s, char const* c) { /* GB18030 is the most restrictive for the 0x30 optimization below. */ if (MB_CUR_MAX == 1 || *c < 0x30) return strchr (s, *c); else if (in_utf8_locale) return *c < 0x80 ? strchr (s, *c) : strstr (s, c); else return resort_to_multi_byte_iteration (s, c); }Notice how only strchr() and strstr() are needed for UTF-8 which as well as being simple to code, they will be well tuned for your platform. One could even optimize further using the well designed attributes of UTF-8.
Summary
As demonstrated in the example above, it's worth special casing UTF-8 processing in your program, as it's usually simple to do, and can greatly speed up processing when in UTF-8 locales. It may even be worth in all locales to transform/validate strings to an internal UTF-8 representation to simplify or speed up subsequent processing.
© Jul 30 2010