Character Encoding – Most Common Encoding for Each Language


I am developing a plain-text reader application. Sometimes app can't auto determine the encoding of a file, so user needs select an encoding from a list of encodings. If this list contains all supported encodings, it will be too long. I want to provide a simplified list, only contains most common encodings of each language.

This is some relationship I am known:

  • Traditional Chinese: Big5
  • Simplified Chinese: GB18030
  • Japanese: Shift-JIS, EUC-JP
  • Russian: KOI8-R

If you know any other language's most common encoding, please tell me.

Best Answer

On the web, UTF-8 is by far the most common encoding for all languages.

That being said, here are the Windows XP locales grouped by default character encoding ("Language for non-Unicode programs"):

  • Big5: zh_HK, zh_MO, zh_TW
  • GBK (≈GB2312): zh_CN, zh_SG
  • Windows-31J (≈Shift_JIS): ja_JP
  • windows-874 (≈TIS-620, ISO-8859-11): th_TH
  • windows-949 (≈EUC-KR): ko_KR
  • windows-1250: bs_BA, cs_CZ, hr_BA, hr_HR, hu_HU, pl_PL, ro_RO, sk_SK, sl_SI, sq_AL, sr_BA, sr_SP
  • windows-1251: az_AZ, be_BY, bg_BG, kk_KZ, ky_KG, mk_MK, mn_MN, ru_RU, sr_BA, sr_SP, tt_RU, uk_UA, uz_UZ
  • windows-1252 (≈ISO-8859-1): af_ZA, arn_CL, ca_ES, cy_GB, da_DK, de_AT, de_CH, de_DE, de_LI, de_LU, en_AU, en_BZ, en_CA, en_CB, en_GB, en_IE, en_JM, en_NZ, en_PH, en_TT, en_US, en_ZA, en_ZW, es_AR, es_BO, es_CL, es_CO, es_CR, es_DO, es_EC, es_ES, es_GT, es_HN, es_MX, es_NI, es_PA, es_PE, es_PR, es_PY, es_SV, es_UY, es_VE, eu_ES, fi_FI, fil_PH, fo_FO, fr_BE, fr_CA, fr_CH, fr_FR, fr_LU, fr_MC, fy_NL, ga_IE, gl_ES, id_ID, is_IS, it_CH, it_IT, iu_CA, iv_IV, lb_LU, moh_CA, ms_BN, ms_MY, nb_NO, nl_BE, nl_NL, nn_NO, ns_ZA, pt_BR, pt_PT, qu_BO, qu_EC, qu_PE, rm_CH, se_FI, se_NO, se_SE, sv_FI, sv_SE, sw_KE, tn_ZA, xh_ZA, zu_ZA
  • windows-1253: el_GR
  • windows-1254 (≈ISO-8859-9): az_AZ, tr_TR, uz_UZ
  • windows-1255: he_IL
  • windows-1256: ar_AE, ar_BH, ar_DZ, ar_EG, ar_IQ, ar_JO, ar_KW, ar_LB, ar_LY, ar_MA, ar_OM, ar_QA, ar_SA, ar_SY, ar_TN, ar_YE, fa_IR, ps_AF, ur_PK
  • windows-1257: et_EE, lt_LT, lv_LV
  • windows-1258: vi_VN

and the most common encodings overall on the Web as of October 30th 2020:

  1. UTF-8 95.7%
  2. ISO-8859-1 1.8%
  3. Windows-1251 1.0%
  4. Windows-1252 0.4%
  5. GB2312 0.3%
  6. Shift JIS 0.2%
  7. GBK 0.1%
  8. EUC-KR 0.1%
  9. ISO-8859-9 0.1%
  10. Windows-1254 0.1%
  11. EUC-JP 0.1%
  12. Big5 0.1%
Related Question