East Asian languages typically require less storage in UTF-16 (2 bytes is enough for 99% of East-Asian language characters) than UTF-8 (typically 3 bytes is required).
Of course, for Western lanagues, UTF-8 is usually smaller (1 byte instead of 2). For mixed files like HTML (where there's a lot of markup) it's much of a muchness.
Processing of UTF-16 for user-mode applications is slightly easier than processing UTF-8, because surrogate pairs behave in almost the same way that combining characters behave. So UTF-16 can usually be processed as a fixed-size encoding.
In the UTF-16 version, you get 14 bytes because of a marker inserted to distinguish between Big Endian (default) and Little Endian. If you specify UTF-16LE you will get 12 bytes (little-endian, no byte-order marker added).
See http://www.unicode.org/faq/utf_bom.html#gen7
EDIT - Use this program to look into the actual bytes generated by different encodings:
public class Test {
public static void main(String args[]) throws Exception {
// bytes in the first argument, encoded using second argument
byte[] bs = args[0].getBytes(args[1]);
System.err.println(bs.length + " bytes:");
// print hex values of bytes and (if printable), the char itself
char[] hex = "0123456789ABCDEF".toCharArray();
for (int i=0; i<bs.length; i++) {
int b = (bs[i] < 0) ? bs[i] + 256 : bs[i];
System.err.print(hex[b>>4] + "" + hex[b&0xf]
+ ( ! Character.isISOControl((char)b) ? ""+(char)b : ".")
+ ( (i%4 == 3) ? "\n" : " "));
}
System.err.println();
}
}
For example, when running under UTF-8 (under other JVM default encodings, the characters for FE and FF would show up different), the output is:
$ javac Test.java && java -cp . Test hello UTF-16
12 bytes:
FEþ FFÿ 00. 68h
00. 65e 00. 6Cl
00. 6Cl 00. 6Fo
And
$ javac Test.java && java -cp . Test hello UTF-16LE
10 bytes:
68h 00. 65e 00.
6Cl 00. 6Cl 00.
6Fo 00.
And
$ javac Test.java && java -cp . Test hello UTF-16BE
10 bytes:
00. 68h 00. 65e
00. 6Cl 00. 6Cl
00. 6Fo
Best Answer
I believe there are a lot of good articles about this around the Web, but here is a short summary.
Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.
Main UTF-8 pros:
Main UTF-8 cons:
Main UTF-16 pros:
char
as the primitive component of the string.Main UTF-16 cons:
In general, UTF-16 is usually better for in-memory representation because BE/LE is irrelevant there (just use native order) and indexing is faster (just don't forget to handle surrogate pairs properly). UTF-8, on the other hand, is extremely good for text files and network protocols because there is no BE/LE issue and null-termination often comes in handy, as well as ASCII-compatibility.