Why Windows Uses UTF-16LE – Explanation

unicodeutf-8windows

Whereas most of the Unix/POSIX/etc world uses UTF-8 for text representation, Windows uses UTF-16LE.

Why is that? There are multiple folks who say the Windows APIs were written before UTF-8 (and even Unicode as we know it) existed (1, 2, 3), so UTF-16 (or even earlier, UCS-2) was the best they had, and that converting the existing APIs to UTF-8 would be a ridiculous amount of work.

But are there any official sources for these 2 claims? The official MSDN page for Unicode makes it seem like UTF-16 may even be desirable (though I don't myself agree):

These functions use UTF-16 (wide character) encoding, which is the most common encoding of Unicode and the one used for native Unicode encoding on Windows operating systems.

Is there any official note (or an engineer who worked on the project) explaining the reasoning behind choosing UTF-16 and why Windows would/would not switch to UTF-8?

Disclaimer: I work for Microsoft.

Best Answer

Windows was one of the first Operating Systems to adopt Unicode. Back then, there was indeed no UTF-8 yet, and UCS-2 was the most common encoding used for Unicode. So Windows' initial Unicode support was based on UCS-2.

By the time Unicode outgrew UCS-2, and UTF-8 and UTF-16 became more popular, it was too late for Windows to change over to UTF-8 without breaking tons of existing code 1, however UTF-16 is backwards compatible with UCS-2, so Microsoft was able to switch to UTF-16 with minimal effort, and little-to-no changes to existing user code.

1: and now, 20-odd years later, in Windows 10, Microsoft is only just starting to really begin to support UTF-8 at the Win32 API layer, but that functionality is still experimental, has to be enabled manually by the user or on a per-application basis via app manifests, and typically requires changes to user code to take advantage of UTF8-enabled APIs rather than UTF16-based APIs.