C# – Understanding ‘.NET Framework Uses UTF-16 Encoding Standard by Default’

.netc++encodingstream

My study guide (for 70-536 exam) says this twice in the text and encoding chapter, which is right after the IO chapter.

All the examples so far are to do with simple file access using FileStream and StreamWriter.

It aslo says stuff like "If you don't know what encoding to use when you create a file, don't specify one and .NET will use UTF16" and "Specify different encodings using Stream constructor overloads".

Never mind the fact that the actual overloads are on the StreamWriter class but hey, whatever.

I am looking at StreamWriter right now in reflector and I am certain I can see that the default is actaully UTF8NoBOM.

But none of this is listed in the errata. It's an old book (cheked the errat of both editions) so if it was wrong I would have thought someone had picked up on it…..

Makes me think maybe I didn't understand it.

So…..any ideas what it is talking about? Some other place where there is a default?

It's just totally confused me.

Best Answer

“UTF-16” is an annoying term, as it has two meanings which are easily confused.

The first meaning is a series of 16-bit codepoints. Most of these correspond directly to the Unicode character of the same number; characters outside the Basic Multilingual Plane (U+10000 upwards) are stored as two 16-bit codepoints, each one of the Surrogates.

Many languages use UTF-16 in this sense for internal storage purposes, including as a native string type. This is the usual source of phrases like “.NET (or Java) uses UTF-16 as its default encoding”. .NET is accessing the elements of such a UTF-16 string 16 bits at a time (ie, at the implementation level, as a uint16).

The next thing to consider is the encoding of such a UTF-16 string into linear bytes, for storage in a file or network stream. As always when you store larger numbers into bytes, there are two possible encodings: little-endian or big-endian. So you can use “UTF-16LE”, the little-endian encoding of UTF-16 into bytes, or “UTF-16BE”, the big-endian encoding.

(“UTF-16LE” is the more commonly used. Just to add more confusion to the flames, Windows gives it the deeply misleading and ambiguous encoding name “Unicode”. In reality it is almost always better to use UTF-8 for file storage and network streams than either of UTF-16LE/BE.)

But if you don't know whether a bunch of bytes contains “UTF-16LE” or “UTF-16BE”, you can use the trick of looking at the first code point to work it out. This code point, the Byte Order Mark (BOM), is only valid when read one way around, so you can't mistake one encoding for the other.

This approach, of not caring what byte order you have but using a BOM to signal it, is usually referred to under the encoding name... “UTF-16”.

So, when someone says “UTF-16”, you can't tell whether they mean a sequence of short-int Unicode code points, or a sequence of bytes in unspecified order that will decode to one.

(“UTF-32” has the same problem.)

If you don't know what encoding to use when you create a file, don't specify one and .NET will use UTF16

If that's the actual direct quote it is a lie. Constructing a StreamWriter without an encoding argument is explicitly specified to give you UTF-8.

Related Question