Character Encoding – UTF-8 and ISO 8859-9 Differences Explained

character-encodingdatabaseisounicodeutf-8

I have been reading about UTF-8 and unicode for the last couple of days and when I thought I figured it all, I am confused when I read that UTF-8 and ISO 8859-9 are not compatible.

I have a database that stores data as UTF-8. I have a requirement from a customer to support various ISO 8859-x code pages (i.e. 8859-3, 8859-2, and also ISO 6937). My questions are:

  1. Since my data ingest and database engine type is UTF-8, would it be correct to assume that I am using unicode?

  2. I understand that unicode can support all characters and it is the way to go. However, my customer is an european entity that wants us to use ISO code pages. so my question is how can I support multiple client use cases using existing UTF-8 data? Since ISO 8859-x is not a subset of unicode, do I have to write code to send appropriate character set of ISO 8859-x depending on my use cases? Is that I need to do or there is more to it?

btw, my understanding is that UTF-8 is merely an encoding algorithm to get a numeric value from binary data. if so, how character set is applied? Do I have to write a code to return 8859-x response or all that's needed is to set an appropriate character set value in the response header?

Best Answer

Topic is pretty vast so let me simplify (a lot, even too much) and answer point by point.

Since my data ingest and database engine type is UTF-8, would it be correct to assume that I am using unicode?

Yes, you're using UNICODE and you're storing UNICODE characters (formally called code points) using UTF-8 encoding. Please note that UNICODE defines rules and sets of characters (even if same word is often used as synonym of UTF-16 encoding), the way you encode such characters in a byte stream is another thing.

... However, my customer is an european entity that wants us to use ISO code pages. so my question is how can I support multiple client use cases using existing UTF-8 data?

Of course if you store UNICODE characters (it doesn't matter with which encoding) then you can always convert them to a specific ASCII code page (or to any other encoding). OK this isn't formally always true (because UNICODE doesn't define every possible characters actually in use/used in the past) but I would ignore this point...

... Since ISO 8859-x is not a subset of unicode, do I have to write code to send appropriate character set of ISO 8859-x depending on my use cases?

All characters from ISO 8859 code pages are also available in UNICODE then (from this point of view) it's a subset. Of course encoded values are different so they need to be converted. If you know needed code page for each customer then you can always convert an UNICODE UTF-8 encoded text into an ASCII (with right code page) text.

Is that I need to do or there is more to it?

Just that. Code could be pretty short but you didn't tag your question with any language so I won't provide links/examples. Just for a rudimentary example take a look to this post.

Let me also say one important thing: if they want to consume your data in ASCII with their code page then you have to perform a conversion. If they can consume directly UTF-8 data (or you present them somehow in your own application) then you don't have to worry about code pages (that's why we're using UNICODE) because - no matters encoding - UNICODE character set contains all characters they may need.

btw, my understanding is that UTF-8 is merely an encoding algorithm to get a numeric value from binary data.

Not exactly. You have a table of characters, right? For example A. Now you have to store a numeric value that will be interpreted as A. In ASCII they arbitrary decided that 65 is the numeric value that represents that character. UNICODE is a long list of characters (and rules to combine them), UTF-X are arbitrary representations used to store them as numeric values.

if so, how character set is applied?

"Character set" is a pretty vague sentence. With UNICODE character set you mean all characters available with UNICODE. If you mean code page then (simplifying) it represents a subset of available character set. Imagine you have 8 bit ASCII (then up to 256 symbols), you can't accommodate all characters used in Europe, right? Code pages solve this problem, half of these symbols are always the same and the other half represent different characters according to code page (each "Country" will use a specific code page with its preferred characters).

For an introductory overview about this topic: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets

Related Question