Java Encoding – Does Java Use UTF-8 or UTF-16?

defaultencodingjavautf-16utf-8

I've already read the following posts:

Now consider the code given below:

public static void main(String[] args) {
    printCharacterDetails("最");
}

public static void printCharacterDetails(String character){
    System.out.println("Unicode Value for "+character+"="+Integer.toHexString(character.codePointAt(0)));
    byte[] bytes = character.getBytes();
    System.out.println("The UTF-8 Character="+character+"  | Default: Number of Bytes="+bytes.length);
    String stringUTF16 = new String(bytes, StandardCharsets.UTF_16);
    System.out.println("The corresponding UTF-16 Character="+stringUTF16+"  | UTF-16: Number of Bytes="+stringUTF16.getBytes().length);
    System.out.println("----------------------------------------------------------------------------------------");
}

When I tried to debug the line character.getBytes() in the code above, the debugger took me into the getBytes() method of String class and then subsequently into the static byte[] encode(char[] ca, int off, int len) method of StringCoding class. The first line of the encode method (String csn = Charset.defaultCharset().name();) returned "UTF-8" as the default encoding during the debugging. I expected it to be "UTF-16".

The output of the program is:

Unicode Value for 最=6700
The UTF-8 Character=最 | Default: Number of Bytes=3

The corresponding UTF-16 Character=� | UTF-16: Number of Bytes=6

When I converted it to UTF-16 explicitly in the program it took 6 bytes to represent the character. Shouldn't it use 2 or 4 bytes for UTF-16? Why 6 bytes were used?

Where am I going wrong in my understanding?
I use Ubuntu 14.04 and the locale command shows the following:

LANG=en_US.UTF-8

Does this mean that JVM decides which encoding to use on the basis of underlying OS or does it use UTF-16 only?
Please help me understand the concept.

Best Answer

Characters are a graphical entity which is part of human culture. When a computer needs to handle text, it uses a representation of those characters in bytes. The exact representation used is called an encoding.

There are many encodings that can represent the same character - either through the Unicode character set, or through other character sets like the various ISO-8859 encodings, or the JIS X 0208.

Internally, Java uses UTF-16. This means that each character can be represented by one or two sequences of two bytes. The character you were using, 最, has the code point U+6700 which is represented in UTF-16 as the byte 0x67 and the byte 0x00.

That's the internal encoding. You can't see it unless you dump your memory and look at the bytes in the dumped image.

But the method getBytes() does not return this internal representation. Its documentation says:

public byte[] getBytes()

Encodes this String into a sequence of bytes using the platform's default charset, storing the result into a new byte array.

The "platform's default charset" is what your locale variables say it is. That is, UTF-8. So it takes the UTF-16 internal representation, and converts that into a different representation - UTF-8.

Note that

new String(bytes, StandardCharsets.UTF_16);

does not "convert it to UTF-16 explicitly" as you assumed it does. This string constructor takes a sequence of bytes, which is supposed to be in the encoding that you have given in the second argument, and converts it to the UTF-16 representation of whatever characters those bytes represent in that encoding.

But you have given it a sequence of bytes encoded in UTF-8, and told it to interpret that as UTF-16. This is wrong, and you do not get the character - or the bytes - that you expect.

You can't tell Java how to internally store strings. It always stores them as UTF-16. The constructor String(byte[],Charset) tells Java to create a UTF-16 string from an array of bytes that is supposed to be in the given character set. The method getBytes(Charset) tells Java to give you a sequence of bytes that represent the string in the given encoding (charset). And the method getBytes() without an argument does the same - but uses your platform's default character set for the conversion.

So you misunderstood what getBytes() gives you. It's not the internal representation. You can't get that directly. only getBytes(StandardCharsets.UTF_16) will give you that, and only because you know that UTF-16 is the internal representation in Java. If a future version of Java decided to represent the characters in a different encoding, then getBytes(StandardCharsets.UTF_16) would not show you the internal representation.

Edit: in fact, Java 9 introduced just such a change in internal representation of strings, where, by default, strings whose characters all fall in the ISO-8859-1 range are internally represented in ISO-8859-1, whereas strings with at least one character outside that range are internally represented in UTF-16 as before. So indeed, getBytes(StandardCharsets.UTF_16) no longer returns the internal representation.

Related Solutions

Java – Difference Between UTF-8 and UTF-16

I believe there are a lot of good articles about this around the Web, but here is a short summary.

Both UTF-8 and UTF-16 are variable length encodings. However, in UTF-8 a character may occupy a minimum of 8 bits, while in UTF-16 character length starts with 16 bits.

Main UTF-8 pros:

Basic ASCII characters like digits, Latin characters with no accents, etc. occupy one byte which is identical to US-ASCII representation. This way all US-ASCII strings become valid UTF-8, which provides decent backwards compatibility in many cases.
No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too.
UTF-8 is independent of byte order, so you don't have to worry about Big Endian / Little Endian issue.

Main UTF-8 cons:

Many common characters have different length, which slows indexing by codepoint and calculating a codepoint count terribly.
Even though byte order doesn't matter, sometimes UTF-8 still has BOM (byte order mark) which serves to notify that the text is encoded in UTF-8, and also breaks compatibility with ASCII software even if the text only contains ASCII characters. Microsoft software (like Notepad) especially likes to add BOM to UTF-8.

Main UTF-16 pros:

BMP (basic multilingual plane) characters, including Latin, Cyrillic, most Chinese (the PRC made support for some codepoints outside BMP mandatory), most Japanese can be represented with 2 bytes. This speeds up indexing and calculating codepoint count in case the text does not contain supplementary characters.
Even if the text has supplementary characters, they are still represented by pairs of 16-bit values, which means that the total length is still divisible by two and allows to use 16-bit char as the primitive component of the string.

Main UTF-16 cons:

Lots of null bytes in US-ASCII strings, which means no null-terminated strings and a lot of wasted memory.
Using it as a fixed-length encoding “mostly works” in many common scenarios (especially in US / EU / countries with Cyrillic alphabets / Israel / Arab countries / Iran and many others), often leading to broken support where it doesn't. This means the programmers have to be aware of surrogate pairs and handle them properly in cases where it matters!
It's variable length, so counting or indexing codepoints is costly, though less than UTF-8.

In general, UTF-16 is usually better for in-memory representation because BE/LE is irrelevant there (just use native order) and indexing is faster (just don't forget to handle surrogate pairs properly). UTF-8, on the other hand, is extremely good for text files and network protocols because there is no BE/LE issue and null-termination often comes in handy, as well as ASCII-compatibility.

Java UTF-16 – Understanding Character Encoding

In the UTF-16 version, you get 14 bytes because of a marker inserted to distinguish between Big Endian (default) and Little Endian. If you specify UTF-16LE you will get 12 bytes (little-endian, no byte-order marker added).

See http://www.unicode.org/faq/utf_bom.html#gen7

EDIT - Use this program to look into the actual bytes generated by different encodings:

public class Test {
    public static void main(String args[]) throws Exception {
        // bytes in the first argument, encoded using second argument
        byte[] bs = args[0].getBytes(args[1]);
        System.err.println(bs.length + " bytes:");

        // print hex values of bytes and (if printable), the char itself
        char[] hex = "0123456789ABCDEF".toCharArray();
        for (int i=0; i<bs.length; i++) {
            int b = (bs[i] < 0) ? bs[i] + 256 : bs[i];
            System.err.print(hex[b>>4] + "" + hex[b&0xf] 
                + ( ! Character.isISOControl((char)b) ? ""+(char)b : ".")
                + ( (i%4 == 3) ? "\n" : " "));
        }
        System.err.println();   
    }
}

For example, when running under UTF-8 (under other JVM default encodings, the characters for FE and FF would show up different), the output is:

$ javac Test.java  && java -cp . Test hello UTF-16
12 bytes:
FEþ FFÿ 00. 68h
00. 65e 00. 6Cl
00. 6Cl 00. 6Fo

And

$ javac Test.java  && java -cp . Test hello UTF-16LE
10 bytes:
68h 00. 65e 00.
6Cl 00. 6Cl 00.
6Fo 00.

And

$ javac Test.java  && java -cp . Test hello UTF-16BE
10 bytes:
00. 68h 00. 65e
00. 6Cl 00. 6Cl
00. 6Fo

Best Answer

Related Solutions

Java – Difference Between UTF-8 and UTF-16

Java UTF-16 – Understanding Character Encoding

Related Question