C# .NET – Why Use UTF-16 for Strings and UTF-8 for Files

.netc++stringutf-16utf-8

From here

Essentially, string uses the UTF-16 character encoding form

But when saving vs StreamWriter :

This constructor creates a StreamWriter with UTF-8 encoding without a
Byte-Order Mark (BOM),

I've seen this sample (broken link removed):

enter image description here

And it looks like utf8 is smaller for some strings while utf-16 is smaller in some other strings.

  • So why does .net use utf16 as default encoding for string and utf8 for saving files?

Thank you.

p.s. Ive already read the famous article

Best Answer

If you're happy ignoring surrogate pairs (or equivalently, the possibility of your app needing characters outside the Basic Multilingual Plane), UTF-16 has some nice properties, basically due to always requiring two bytes per code unit and representing all BMP characters in a single code unit each.

Consider the primitive type char. If we use UTF-8 as the in-memory representation and want to cope with all Unicode characters, how big should that be? It could be up to 4 bytes... which means we'd always have to allocate 4 bytes. At that point we might as well use UTF-32!

Of course, we could use UTF-32 as the char representation, but UTF-8 in the string representation, converting as we go.

The two disadvantages of UTF-16 are:

  • The number of code units per Unicode character is variable, because not all characters are in the BMP. Until emoji became popular, this didn't affect many apps in day-to-day use. These days, certainly for messaging apps and the like, developers using UTF-16 really need to know about surrogate pairs.
  • For plain ASCII (which a lot of text is, at least in the west) it takes twice the space of the equivalent UTF-8 encoded text.

(As a side note, I believe Windows uses UTF-16 for Unicode data, and it makes sense for .NET to follow suit for interop reasons. That just pushes the question on one step though.)

Given the problems of surrogate pairs, I suspect if a language/platform were being designed from scratch with no interop requirements (but basing its text handling in Unicode), UTF-16 wouldn't be the best choice. Either UTF-8 (if you want memory efficiency and don't mind some processing complexity in terms of getting to the nth character) or UTF-32 (the other way round) would be a better choice. (Even getting to the nth character has "issues" due to things like different normalization forms. Text is hard...)

Related Question