c++ - Check for invalid UTF8 -


i converting utf8 format actual value in hex. there invalid sequences of bytes need catch. there quick way check if character doesn't belong in utf8 in c++?

follow tables in unicode standard, chapter 3. (i used unicode 5.1.0 version of chapter (p103); table 3-7 on p94 of unicode 6.0.0 version, , on p95 in unicode 6.3 version — , on p125 of unicode 8.0.0 version.)

bytes 0xc0, 0xc1, , 0xf5..0xff cannot appear in valid utf-8. valid sequences documented; others invalid.

table 3-7. well-formed utf-8 byte sequences

code points        first byte second byte third byte fourth byte u+0000..u+007f     00..7f u+0080..u+07ff     c2..df     80..bf u+0800..u+0fff     e0         a0..bf      80..bf u+1000..u+cfff     e1..ec     80..bf      80..bf u+d000..u+d7ff     ed         80..9f      80..bf u+e000..u+ffff     ee..ef     80..bf      80..bf u+10000..u+3ffff   f0         90..bf      80..bf     80..bf u+40000..u+fffff   f1..f3     80..bf      80..bf     80..bf u+100000..u+10ffff f4         80..8f      80..bf     80..bf 

note irregularities in second byte ranges of values of first byte. third , fourth bytes, when needed, consistent. note not every code point within ranges identified valid has been allocated (and explicitly 'non-characters'), there more validation needed still.

the code points u+d800..u+dbff utf-16 high surrogates , u+dc00..u+dfff utf-16 low surrogates; cannot appear in valid utf-8 (you encode values outside bmp — basic multilingual plane — directly in utf-8), why range marked invalid.

other excluded ranges (initial byte c0 or c1, or initial byte e0 followed 80..9f, or initial byte f0 followed 80..8f) non-minimal encodings. example, c0 80 encode u+0000, that's encoded 00, , utf-8 defines non-minimal encoding c0 80 invalid. , maximum unicode code point u+10ffff; utf-8 encodings starting f4 90 upwards generate values out of range.


Comments

Popular posts from this blog

c# - SharpSVN - How to get the previous revision? -

c++ - Is it possible to compile a VST on linux? -

url - Querystring manipulation of email Address in PHP -