c# - To which character encoding (Unicode version) set does a char object correspond? -


what unicode character encoding char object correspond in:

  • c#

  • java

  • javascript (i know there not char type assuming string type still implemented array of unicode characters)

in general, there common convention among programming languages use specific character encoding?

update

  1. i have tried clarify question. changes made discussed in comments below.
  2. re: "what problem trying solve?", interested in code generation language independent expressions, , particular encoding of file relevant.

i'm not sure answering question, let me make few remarks shed light.

at core, general-purpose programming languages ones talking (c, c++, c#, java, php) not have notion of "text", merely of "data". data consists of sequences of integral values (i.e. numbers). there no inherent meaning behind numbers.

the process of turning stream of numbers text 1 of semantics, , left consumer assign relevant semantics data stream.

warning: use word "encoding", unfortunately has multiple inequivalent meanings. first meaning of "encoding" assignment of meaning number. semantic interpretation of number called "character". example, in ascii encoding, 32 means "space" , 65 means "captial a". ascii assigns meanings 128 numbers, every ascii character can conveniently represented single 8-bit byte (with top bit 0). there many encodings assign characters 256 numbers, using 1 byte per character. in these fixed-width encodings, text string has many characters takes bytes represent. there other encodings in characters take variable amount of bytes represent.

now, unicode encoding, i.e. assignment of meaning numbers. on first 128 numbers same ascii, assigns meanings (theoretically) 2^21 numbers. because there lots of meanings aren't strictly "characters" in sense of writing (such zero-width joiners or diacritic modifiers), term "codepoint" preferred on "character". nonetheless, integral data type @ least 21 bits wide can represent 1 codepoint. typically 1 picks 32-bit type, , encoding, in every element stands 1 codepoint, called utf-32 or ucs-4.

now have second meaning of "encoding": can take string of unicode codepoints , transform string of 8-bit or 16-bit values, further "encoding" information. in new, transformed form (called "unicode transformation format", or "utf"), have strings of 8-bit or 16-bit values (called "code units"), each individual value not in general correspond meaningful -- first has decoded sequence of unicode codepoints.

thus, programming perspective, if want modify text (not bytes), should store text sequence of unicode codepoints. practically means need 32-bit data type. char data type in c , c++ 8 bits wide (though that's minimum), while on c# , java 16 bits wide. 8-bit char conceivably used store transformed utf-8 string, , 16-bit char store transformed utf-16 string, in order @ raw, meaningful unicode codepoints (and in particular @ length of string in codepoints) have perform decoding.

typically text processing libraries able decoding , encoding you, happily accept utf8 , utf16 strings (but @ price), if want spare indirection, store strings raw unicode codepoints in sufficiently wide type.


Comments

Popular posts from this blog

c# - SharpSVN - How to get the previous revision? -

c++ - Is it possible to compile a VST on linux? -

url - Querystring manipulation of email Address in PHP -