Don's Home
Technology
Character Sets
| Contact | ||||||||
Character Sets and Languages
Character Sets (Language Scripts)You may see the following tags in web pages: <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1"> Other common charset Values: us-ascii & windows-1252 (IBM Extended Character Set - ECS) These are basically the same. Other character sets are listed below. The first 128 characters are control characters and the standard letters, numbers, and special chaaracters (punctuation, +, /, =, ...). The extended characters (128-256)(80-FF octal) include multinational characters and things like cent "¢", pound "£", copyright "©". The 8-bit character file here has the printable characters from 32-255. The extended characters are standardized for web pages, but when viewed in native text editors may appear differently in different operating systems. e.g. Windows, Macintosh OS, and UNIX. Other HTML Coding for Character SetsThe most common standard is ANSI X3.4-1968 which is commonly called US-ASCII or simply ASCII is defined in RFC1345 is the same as ISO-8859-1 (Latin1). International Standards Orginization (ISO) 8859 and Microsoft Codepages are other common standards. BOLD - Prefered MIME Name < Table at Encoding.CodePage Property msdn.microsoft.com Name CodePage BodyName HeaderName WebName Encoding.EncodingName shift_jis 932 iso-2022-jp iso-2022-jp shift_jis Japanese (Shift-JIS) windows-1250 1250 iso-8859-2 windows-1250 windows-1250 Central European1 Latin-2 windows-1251 1251 koi8-r windows-1251 windows-1251 Cyrillic Windows-1252 1252 iso-8859-1 Windows-1252 Windows-1252 Western European Latin-1 windows-1253 1253 iso-8859-7 windows-1253 windows-1253 Greek windows-1254 1254 iso-8859-9 windows-1254 windows-1254 Turkish csISO2022JP 50221 iso-2022-jp iso-2022-jp csISO2022JP Japanese (JIS-Allow 1 byte Kana) iso-2022-kr 50225 iso-2022-kr euc-kr iso-2022-kr Korean (ISO) 1. Latin-2 Central or East European (Czech, Hungarian, Polish and Slovak) B>US-ASCII (U.S. national variant of ISO/IEC 646. Formally, the U.S. standard ANSI X3.4.) ISO Windows ISO-8859-1 - 1252 - Latin-1 Westerm European (Default)ANSI ISO-8859-2 - 1250 - Latin-2 East European (Czech, Hungarian, Polish and Slovak) ISO-8859-3 - Latin-3 South European ISO-8859-4 - 1257 - Latin-4 North European (Baltic) ISO-8859-10 - Latin-6 Nordic replaces latin-4 (Sámi, Inuit, Icelandic) ISO-8859-5 - 1251 - Cyrillic (Azerbaijani, Bulgarian, Buryat, Byelorussian, Karakalpak, Kazakh, Khalkha, Kirghiz, Macedonian, Moldavian, Russian, Serbian, Tajik, Turkmen, Ukrainian and Uzbek languages) ISO-8859-6 - 1256 - Arabic (Arabic, Farsi [Iran], Jawi, Kurdish, Pashto [Afghanistan], Persian, Sindhi and Urdu [Pakistan], Panjabi) ISO-8859-7 - 1253 - Greek ISO-8859-8 - 1255 - Hebrew ISO-8859-9 - 1254 - Latin-5 Turkish 1258 - Viet Nam ISO-8859-11 - 874 - Thai ISO-2022-KR - 949 - Korean Extended Wansung 936 - PRC GBK (XGB) 950 - Chinese (Taiwan, Hong Kong) 2 Byte (16 bit) codes UTF-8 - ISO-10646 Unicode Big5 - Chinese Traditional (Taiwan, HongKong) EUC-TW - Chinese Traditional GB2312 - Chinese Simplified (China mainland, Singapore and Malaysia) GB18030 is the registered Internet name for the official character set of the People's Republic of China (PRC) superseding GB 2312. GB - (GuoBiao) Chinese Simplified GBK - Chinese Simplified HZ - Chinese Simplified ISO-2022-GB - an emerging new international Internet standard for encoding Chinese text EUC-JP Japanese EUC-JIS Japanese Shift_JIS Japanese ISO-2022-JP Japanese ISO-2022-JP2 Japanese KSC5601 - Korean EUC-KR - Korean KSC5601 KO18-R - Cyrillic KO18-U - Cyrillic Devanagari (Bhojpuri, Bihari, Hindi, Kashmiri, Konkani, Marathi, Nepali and Sanskrit. It is also used for writing Panjabi by Indians who are not Sikhs) Gujarati (Indian State of Gujarat) Gurmukhi (Panjabi[Pakstan and India] Panjabi can also be written with Devanagari and Arabic) Bengali Bengali [Bangladesh] Thai Vietnamese (Character sets recognized in Netscape 1.1 were us-ascii, iso-8859-1, iso-2022-jp, x-sjis, x-euc-jp, x-mac-roman) Central EuropeanUnicode : Unicode provides a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. Unicode is required by modern standards such as XML, Java, ECMAScript (JavaScript), LDAP, CORBA 3.0, WML, etc., and is the official way to implement ISO/IEC 10646.
There 8, 16 and 32 bit Unicode transformation formats (UTF)
UTF-16 and UTF-32 are not byte oriented and so a byte order must be selected when transmitting them over a byte oriented network or storing them in a byte oriented file. Some systems store data with most significant byte (MSB) first (big-endian) and others with it last (little-endian). A byte order mark (BOM) consists of the character code U+FEFF at the beginning of a data stream, where it can be used as a signature defining the byte order and encoding form, primarily of unmarked plaintext files. See: unicode.org, Allan Wood's Unicode Page Reference.com (Table of Unicode characters, 128 to 999 Table of Unicode characters from 1 to 65535 at unicode.coeurlumiere.com Some of the Languages in the Unicode Character Database (UCD) . See a larger list at Alan Wood's Unicode and Multilingual Support in HTML
ISO/ANSI vs charactersCharacters 33-126 (letters, numbers and special characters (standard keyboard characters) are the same for ANSI and ECS, however the other characters are not the same. Eg. the British Pound character is 156 in ECS and 163 in ANSI.MicrosoftMicorsoft has defined the Windows Glyph List 4 (WGL4) standard, which incorporates codepages 1250 (Eastern Europe), 1251 (Cyrillic), 1252 (US English = ANSI), 1253 (Greek) and 1254 (Turkish).Language:<META HTTP-EQUIV="Content-Language" CONTENT="zh"> <HTML LANG="fr"> <BLOCKQUOTE LANG="fr"> <P LANG="fr"> Common values: (See: ISO639A List) ar - Arabic, de - German, en - english, es - Spanish, fr - french, ga - Irish, gu - Gujarati, he - Hebrew, hi - Hindi, it - Italian, ja - japanese, ko - korean, pa - Punjabi, yi - Yiddish, zh - Chinese They may also contain sub-tags e.g.: fr-CA - French Canadian ar-EG - Egyptian Arabic en-US - American English zh-TW - Taiwanese Chinese zh-Hant - Traditional Chinese zh-Hans - Simplified Chinese FontsEnglish: Times *, Times New Roman , Helvetica *, Arial , Courier ** - Standard on Macintosh and UNIX, - Standard on Windows Chinese Traditional: MingLiU (IE5), PMingLiU (office 2000) Chinese Simplified: MS Song (IE5), MS Hei (IE5), SimSun (Office 2000) Japanese: MS Gothic, MS Mincho Korean: Gulimche Web HebrewAD and Web Hebrew Monospace Others: (See Alan Wood's list) FZNew XiuLi-Z11
See Also: Character set list at IANA ,ASCII - ISO 8859-1 (Latin-1) Table with HTML Entity Names, Examples of Characters, Keystrokes and Glyphs, Unicocd Standard, The Multilingual World Wide Web, Using National and Special Characters in HTML, Alan Wood's Demonstrations of Special Characters, HTML and JavaScript Microsoft: Character sets Encoding.CodePage Property (System.Text) INFO: XML Encoding and DOM Interface Methods ISO 8859 Alphabet Soup
TermsANSI - American National Standards Institute Big5 - Character Set used for Traditional Chinese Characters CJK - Chinese/Japanese/Korean ECS - IBM Extended Character Set EUC - Extended UNIX Code GB (GuoBiao) - Character Set used for Simplified Chinese Characters ISO - International Standards Organization UCD - Unicode Character Database UTC - Unicode Technical Committee UTF - Unicode transformation format Return to Technology | |||||||||