UTF-16 Character Encoding – Purpose and Benefits

character-encodingutfutf-16utf-32utf-8

I've never understood the point of UTF-16 encoding. If you need to be able to treat strings as random access (i.e. a code point is the same as a code unit) then you need UTF-32, since UTF-16 is still variable length. If you don't need this, then UTF-16 seems like a colossal waste of space compared to UTF-8. What are the advantages of UTF-16 over UTF-8 and UTF-32 and why do Windows and Java use it as their native encoding?

Best Answer

When Windows NT was designed UTF-16 didn't exist (NT 3.51 was born in 1993, while UTF-16 was born in 1996 with the Unicode 2.0 standard); there was instead UCS-2, which, at that time, was enough to hold every character available in Unicode, so the 1 code point = 1 code unit equivalence was actually true - no variable-length logic needed for strings.

They moved to UTF-16 later, to support the whole Unicode character set; however they couldn't move to UTF-8 or to UTF-32, because this would have broken binary compatibility in the API interface (among the other things).

As for Java, I'm not really sure; since it was released in ~1995 I suspect that UTF-16 was already in the air (even if it wasn't standardized yet), but I think that compatibility with NT-based operating systems may have played some role in their choice (continuous UTF-8 <-> UTF-16 conversions for every call to Windows APIs can introduce some slowdown).


Edit

Wikipedia explains that even for Java it went in the same way: it originally supported UCS-2, but moved to UTF-16 in J2SE 5.0.

So, in general when you see UTF-16 used in some API/Framework it is because it started as UCS-2 (to avoid complications in the string-management algorithms) but it moved to UTF-16 to support the code points outside the BMP, still maintaining the same code unit size.