“UTF-16” is an annoying term, as it has two meanings which are easily confused.
The first meaning is a series of 16-bit codepoints. Most of these correspond directly to the Unicode character of the same number; characters outside the Basic Multilingual Plane (U+10000 upwards) are stored as two 16-bit codepoints, each one of the Surrogates.
Many languages use UTF-16 in this sense for internal storage purposes, including as a native string type. This is the usual source of phrases like “.NET (or Java) uses UTF-16 as its default encoding”. .NET is accessing the elements of such a UTF-16 string 16 bits at a time (ie, at the implementation level, as a uint16).
The next thing to consider is the encoding of such a UTF-16 string into linear bytes, for storage in a file or network stream. As always when you store larger numbers into bytes, there are two possible encodings: little-endian or big-endian. So you can use “UTF-16LE”, the little-endian encoding of UTF-16 into bytes, or “UTF-16BE”, the big-endian encoding.
(“UTF-16LE” is the more commonly used. Just to add more confusion to the flames, Windows gives it the deeply misleading and ambiguous encoding name “Unicode”. In reality it is almost always better to use UTF-8 for file storage and network streams than either of UTF-16LE/BE.)
But if you don't know whether a bunch of bytes contains “UTF-16LE” or “UTF-16BE”, you can use the trick of looking at the first code point to work it out. This code point, the Byte Order Mark (BOM), is only valid when read one way around, so you can't mistake one encoding for the other.
This approach, of not caring what byte order you have but using a BOM to signal it, is usually referred to under the encoding name... “UTF-16”.
So, when someone says “UTF-16”, you can't tell whether they mean a sequence of short-int Unicode code points, or a sequence of bytes in unspecified order that will decode to one.
(“UTF-32” has the same problem.)
If you don't know what encoding to use when you create a file, don't specify one and .NET will use UTF16
If that's the actual direct quote it is a lie. Constructing a StreamWriter without an encoding argument is explicitly specified to give you UTF-8.
Best Answer
If you're happy ignoring surrogate pairs (or equivalently, the possibility of your app needing characters outside the Basic Multilingual Plane), UTF-16 has some nice properties, basically due to always requiring two bytes per code unit and representing all BMP characters in a single code unit each.
Consider the primitive type
char
. If we use UTF-8 as the in-memory representation and want to cope with all Unicode characters, how big should that be? It could be up to 4 bytes... which means we'd always have to allocate 4 bytes. At that point we might as well use UTF-32!Of course, we could use UTF-32 as the
char
representation, but UTF-8 in thestring
representation, converting as we go.The two disadvantages of UTF-16 are:
(As a side note, I believe Windows uses UTF-16 for Unicode data, and it makes sense for .NET to follow suit for interop reasons. That just pushes the question on one step though.)
Given the problems of surrogate pairs, I suspect if a language/platform were being designed from scratch with no interop requirements (but basing its text handling in Unicode), UTF-16 wouldn't be the best choice. Either UTF-8 (if you want memory efficiency and don't mind some processing complexity in terms of getting to the nth character) or UTF-32 (the other way round) would be a better choice. (Even getting to the nth character has "issues" due to things like different normalization forms. Text is hard...)