I was trying to understand character encoding in Java. Characters in Java are being stored in 16 bits using UTF-16 encoding. So while i am converting a string containing 6 character to byte i am getting 6 bytes as below, I am expecting it to be 12. Is there any concept i am missing ?
package learn.java;
public class CharacterTest {
public static void main(String[] args) {
String str = "Hadoop";
byte bt[] = str.getBytes();
System.out.println("the length of character array is " + bt.length);
}
}
O/p :the length of character array is 6
As per @Darshan When trying with UTF-16 encoding to get bytes the result is also not expecting .
package learn.java;
public class CharacterTest {
public static void main(String[] args) {
String str = "Hadoop";
try{
byte bt[] = str.getBytes("UTF-16");
System.out.println("the length of character array is " + bt.length);
}
catch(Exception e)
{
}
}
}
o/p: the length of character array is 14
Best Answer
In the UTF-16 version, you get 14 bytes because of a marker inserted to distinguish between Big Endian (default) and Little Endian. If you specify UTF-16LE you will get 12 bytes (little-endian, no byte-order marker added).
See http://www.unicode.org/faq/utf_bom.html#gen7
EDIT - Use this program to look into the actual bytes generated by different encodings:
For example, when running under UTF-8 (under other JVM default encodings, the characters for FE and FF would show up different), the output is:
And
And