C# vs Java – Reasons to Prefer UTF-16 Over UTF-8

c++javaunicodeutf-16utf-8

Examining the attributes of UTF-16 and UTF-8, I can't find any reason to prefer UTF-16.

However, checking out Java and C#, it looks like strings and chars there default to UTF-16. I was thinking that it might be for historic reasons, or perhaps for performance reasons, but couldn't find any information.

Anyone knows why these languages chose UTF-16? And is there any valid reason for me to do that as well?

EDIT: Meanwhile I've also found this answer, which seems relevant and has some interesting links.

Best Answer

East Asian languages typically require less storage in UTF-16 (2 bytes is enough for 99% of East-Asian language characters) than UTF-8 (typically 3 bytes is required).

Of course, for Western lanagues, UTF-8 is usually smaller (1 byte instead of 2). For mixed files like HTML (where there's a lot of markup) it's much of a muchness.

Processing of UTF-16 for user-mode applications is slightly easier than processing UTF-8, because surrogate pairs behave in almost the same way that combining characters behave. So UTF-16 can usually be processed as a fixed-size encoding.