Basically:
- charset is the set of characters you can use
- encoding is a way these characters are stored into memory
People sometimes use charset to refer both to the character repertoire and the encoding scheme. The Unicode Standard charset has multiple encodings, e.g., UTF-8, UTF-16, UTF-32, UCS-4, UTF-EBCDIC, Punycode, and GB18030.
Reasons are many I believe but the main point is: "How many characters you need to display (encode)?" If you live in US for example, you could go pretty far with ASCII. But in many counties we need characters like ä, å, ü etc. (If SO was ASCII only or you try to read this text as ASCII encoded text, you'd see some weird characters in the places of ä, å and ü.) Think also the China, Japan, Thailand and other "exotic" countires. Those weird figures on photos you may have seen around the world just might be letters, not pretty pictures.
As for the differences between different encoding types you need to see their specification. Here's something for UTF-8.
I'm not familiar with UTF-16. Here's some information about the differences.
Base64 is used when there is a need to encode binary data that needs to be stored and transferred over media that are designed to deal with textual data. If you've ever made somesort of email system with PHP, you've probably encountered Base64.
Is short: To support computer program's user interface localizations to many different languages. (Programming languages still mainly consist of characters found in ASCII encoding, althought it's possible for example in Java to use UTF-8 encoding in variable names, and the source code file is usually stored as something else than ASCII encoded text, for example UTF-8 encoding.)
In short vol.2: Always when different people are trying to solve some problem from a specific point of view (or even without a point of view if it's even possible), results may be quite different. Quote from Joel's unicode article (link below): "Because bytes have room for up to eight bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our own purposes." The trouble was, lots of people had this idea at the same time, and they had their own ideas of what should go where in the space from 128 to 255."
Thanks to Joachim and tchrist for all the info and discussion. Here's two articles I just read. (Both links are on the page I linked to earlier.) I'd forgotten most of the stuff from Joel's article since I last read it a few years back. Good introduction to the subject I hope. Mark Davis goes a little deeper.
Best Answer
ASCII is fundamental
Originally 1 character was always stored as 1 byte. A byte (8 bits) has the potential to distinct 256 possible values. But in fact only the first 7 bits were used. So only 128 characters were defined. This set is known as the ASCII character set.
0x00
-0x1F
contain steering codes (e.g. CR, LF, STX, ETX, EOT, BEL, ...)0x20
-0x40
contain numbers and punctuation0x41
-0x7F
contain mostly alphabetic characters0x80
-0xFF
the 8th bit = undefined.French, German and many other languages needed additional characters. (e.g.
à, é, ç, ô, ...
) which were not available in the ASCII character set. So they used the 8th bit to define their characters. This is what is known as "extended ASCII".The problem is that the additional 1 bit has not enough capacity to cover all languages in the world. So each region has its own ASCII variant. There are many extended ASCII encodings (
latin-1
being a very popular one).Popular question: "Is ASCII a character set or is it an encoding" ?
ASCII
is a character set. However, in programmingcharset
andencoding
are wildly used as synonyms. If I want to refer to an encoding that only contains the ASCII characters and nothing more (the 8th bit is always 0): that'sUS-ASCII
.Unicode goes one step further
Unicode is a great example of a character set - not an encoding. It uses the same characters like the ASCII standard, but it extends the list with additional characters, which gives each character a codepoint in format
u+xxxx
. It has the ambition to contain all characters (and popular icons) used in the entire world.UTF-8, UTF-16 and UTF-32 are encodings that apply the Unicode character table. But they each have a slightly different way on how to encode them. UTF-8 will only use 1 byte when encoding an ASCII character, giving the same output as any other ASCII encoding. But for other characters, it will use the first bit to indicate that a 2nd byte will follow.
GBK is an encoding, which just like UTF-8 uses multiple bytes. The principle is pretty much the same. The first byte follows the ASCII standard, so only 7 bits are used. But just like with UTF-8, The 8th bit can be used to indicate the presence of a 2nd byte, which it then uses to encode one of 22,000 Chinese characters. The main difference, is that this does not follow the Unicode character set, by contrast it uses some Chinese character set.
Decoding data
When you encode your data, you use an encoding, but when you decode data, you will need to know what encoding was used, and use that same encoding to decode it.
Unfortunately, encodings aren't always declared or specified. It would have been ideal if all files contained a prefix to indicate what encoding their data was stored in. But still in many cases applications just have to assume or guess what encoding they should use. (e.g. they use the standard encoding of the operating system).
There still is a lack of awareness about this, as still many developers don't even know what an encoding is.
Mime types
Mime types are sometimes confused with encodings. They are a useful way for the receiver to identify what kind of data is arriving. Here is an example, of how the HTTP protocol defines it's content type using a mime type declaration.
And that's another great source of confusion. A mime type describes what kind of data a message contains (e.g.
text/xml
,image/png
, ...). And in some cases it will additionally also describe how the data is encoded (i.e.charset=utf-8
). 2 points of confusion:charset=utf-8
adds up to the semantic confusion, because as explained earlier, UTF-8 is an encoding and not a character set. But as explained earlier, some people just use the 2 words interchangeably.For example, in the case of
text/xml
it would be pointless to declare an encoding (and acharset
parameter would simply be ignored). Instead, XML parsers in general will read the first line of the file, looking for the<?xml encoding=...
tag. If it's there, then they will reopen the file using that encoding.The same problem exists when sending e-mails. An e-mail can contain a html message or just plain text. Also in that case mime types are used to define the type of the content.
But in summary, a mime type isn't always sufficient to solve the problem.
Data types in programming languages
In case of Java (and many other programming languages) in addition to the dangers of encodings, there's also the complexity of casting bytes and integers to characters because their content is stored in different ranges.
-128
to127
).char
type in java is stored in 2 unsigned bytes (range:0
-65535
)-1
to255
.If you know that your data only contains ASCII values. Then with the proper skill you can parse your data from bytes to characters or wrap them immediately in Strings.
Shortcuts
The shortcut in java is to use readers and writers and to specify the encoding when you instantiate them.
As explained earlier for XML files it doesn't matter that much, because any decent DOM or JAXB marshaller will check for an encoding attribute.