Special characters

As of Unicode version 17.0, there are 297,334 assigned characters with code points, covering 172 modern and historical scripts, as well as multiple symbol sets. As it is to list all of these characters in a single page, this list is limited to a subset of the most important characters for English-language readers, with links to other pages which list the supplementary characters. Accordingly, this article lists the 1,062 characters in the Multilingual European Character Set 2 (MES-2) subset, and some additional related characters. (The term Unicode character was coined to categorise characters that do not also have ASCII code points.)

Character reference overview

HTML and XML provide ways to reference Unicode characters when the characters themselves either cannot or should not be used. A numeric character reference refers to a character by its Universal Character Set/Unicode code point, and a character entity reference refers to a character by a predefined name.

A numeric character reference uses the format

or

where nnnn is the code point in decimal form, and hhhh is the code point in hexadecimal form. The x must be lowercase in XML documents. The nnnn or hhhh may be any number of digits and may include leading zeros. The hhhh may mix uppercase and lowercase, though uppercase is the usual style.

In contrast, a character entity reference refers to a character by the name of an entity which has the desired character as its replacement text. The entity must either be predefined (built into the markup language) or explicitly declared in a Document Type Definition (DTD). The format is the same as for any entity reference:

where name is the case-sensitive name of the entity. The semicolon is required.

Because numbers are harder for humans to remember than names, character entity references are most often written by humans, while numeric character references are most often produced by computer programs.

Control codes

65 characters, including DEL. All belong to the common script.

Footnotes:

1 Control-C has typically been used as a "break" or "interrupt" key.

2 Control-D has been used to signal "end of file" for text typed in at the terminal on Unix / Linux systems. Windows, MS-DOS, and older minicomputers used Control-Z for this purpose.

3 Control-G is an artifact of the days when teletypes were in use. Important messages could be signaled by striking the bell on the teletype. This was carried over on PCs by generating a buzz sound.

4 Line feed is used for "end of line" in text files on Unix / Linux systems.

5 Carriage Return (accompanied by line feed, and thus usually written as 'CRLF') is used as "end of line" character by Windows, MS-DOS, and most minicomputers other than Unix- / Linux-based systems. Classic Mac OS and other vintage OS used CR only.

6 Control-O has been the "discard output" key. Output is not sent to the terminal, but discarded, until another Control-o is typed.

7 Control-Q has been used to tell a host computer to resume sending output after it was stopped by Control-S.

8 Control-S has been used to tell a host computer to postpone sending output to the terminal. Output is suspended until restarted by the Control-Q key.

9 Control-U was originally used by Digital Equipment Corporation computers to cancel the current line of typed-in text. Other manufacturers used Control-X for this purpose.

10 Control-X was commonly used to cancel a line of input typed in at the terminal.

11 Control-Z has commonly been used on minicomputers, Windows and MS-DOS systems to indicate "end of file" either on a terminal or in a text file. Unix / Linux systems use Control-D to indicate end-of-file at a terminal.

Latin script

The Unicode Standard (version ) classifies 1,492 characters as belonging to the Latin script.

Basic Latin

95 characters; the 52 alphabet characters belong to the Latin script. The remaining 43 belong to the common script. The 33 characters classified as ASCII Punctuation & Symbols are also sometimes referred to as ASCII special characters. Often only these characters (and not other Unicode punctuation) are what is meant when an organization says a password "requires punctuation marks".

Latin-1 Supplement

96 characters; the 62 letters, and two ordinal indicators belong to the Latin script. The remaining 32 belong to the common script.

Latin Extended-A

128 characters; all belong to the Latin script.

Latin Extended-B

208 characters; all belong to the Latin script; 33 in the MES-2 subset.

Latin Extended Additional

256 characters; all belong to the Latin script; 23 in the MES-2 subset.

Additional Latin Extended

Latin Extended-C (Unicode block)
Latin Extended-D (Unicode block)
Latin Extended-E (Unicode block)
Latin Extended-F (Unicode block)
Latin Extended-G (Unicode block)

Phonetic scripts

IPA Extensions

96 characters; all belong to the Latin script; three in the MES-2 subset.

Spacing modifier letters

80 characters; 15 in the MES-2 subset.

Phonetic Extensions

Phonetic Extensions (Unicode block)
Phonetic Extensions Supplement (Unicode block)

Combining marks

Greek and Coptic

144 code points; 135 assigned characters; 85 in the MES-2 subset.

Greek Extended

For polytonic orthography. 256 code points; 233 assigned characters, all in the MES-2 subset (#670 Ã¢ÂÂ 902).

Cyrillic

256 characters; 191 in the MES-2 subset.

Cyrillic supplements

Cyrillic Supplement (Unicode block)
Cyrillic Extended-A (Unicode block)
Cyrillic Extended-B (Unicode block)
Cyrillic Extended-C (Unicode block)
Cyrillic Extended-D (Unicode block)

Armenian

Semitic languages

Arabic

Hebrew

Syriac

Mandaic

Mandaic (Unicode block)

Samaritan

Samaritan (Unicode block)

Thaana

Brahmic (Indic) scripts

The range from U+0900 to U+0DFF includes Devanagari, Bengali script, Gurmukhi, Gujarati script, Odia alphabet, Tamil script, Telugu script, Kannada script, Malayalam script, and Sinhala script.

Devanagari

Bengali and Assamese

Gurmukhi

Gujarati

Oriya

Tamil

Telugu

Kannada

Malayalam

Sinhala

Other Brahmic scripts

Other Brahmic and Indic scripts in Unicode include:

Other South and Central Asian writing systems

Southeast Asian writing systems

Georgian

African scripts

Ge'ez/Ethiopic script

Other African scripts

American scripts

Unified Canadian Aboriginal Syllabics

Other American scripts

Mongolian

Unicode symbols

General Punctuation

112 code points; 111 assigned characters; 24 in the MES-2 subset.

Superscripts and Subscripts

Currency Symbols

Letterlike Symbols

Number Forms

Arrows

Miscellaneous Symbols and Arrows (Unicode block)
Supplemental Arrows-A (Unicode block)
Supplemental Arrows-B (Unicode block)
Supplemental Arrows-C (Unicode block)

Mathematical symbols

Supplemental Mathematical Operators (Unicode block)
Miscellaneous Mathematical Symbols-A (Unicode block)
Miscellaneous Mathematical Symbols-B (Unicode block)
Mathematical Alphanumeric Symbols: Mathematical Alphanumeric Symbols (Unicode block)

Miscellaneous Technical

Control Pictures

Optical Character Recognition

Enclosed Alphanumerics

Box Drawing

Block Elements

Geometric Shapes

Symbols for Legacy Computing

Symbols for Legacy Computing Supplement

Miscellaneous Symbols

Miscellaneous Symbols Supplement

Dingbats

East Asian writing systems

CJK Symbols and Punctuation

Hiragana

Katakana

Kana Extended-A (Unicode block)
Kana Extended-B (Unicode block)
Kana Supplement (Unicode block)
Katakana Phonetic Extensions (Unicode block)
Small Kana Extension (Unicode block)

Bopomofo

Hangul Jamo and Compatibility Jamo

Kanbun

Enclosed CJK Letters and Months

CJK Compatibility

CJK Compatibility Forms

CJK Unified Ideographs

CJK Unified Ideographs

CJK Radicals

Other East Asian writing systems

Counting Rod Numerals (Unicode block)
Halfwidth and Fullwidth Forms (Unicode block)
Ideographic Description Characters (Unicode block)
Khitan Small Script (Unicode block)
Lisu (Unicode block)
Lisu Supplement (Unicode block)
Miao (Unicode block)
Modifier Tone Letters (Unicode block)
Nushu (Unicode block)
Nyiakeng Puachue Hmong (Unicode block)
Small Form Variants (Unicode block)
Tai Xuan Jing Symbols (Unicode block)
Tangut (Unicode block)
Tangut Components (Unicode block)
Tangut Components Supplement (Unicode block)
Tangut Supplement (Unicode block)
Vertical Forms (Unicode block)
Wancho (Unicode block)
Yi Syllables (Unicode block)
Yi Radicals (Unicode block)
Yijing Hexagram Symbols (Unicode block)

Alphabetic Presentation Forms

Ancient and historic scripts

Shavian

Notational systems

Emoji

Emoji in Unicode

Alchemical symbols

Game symbols

Mahjong Tiles

Domino Tiles

Playing Cards

Chess Symbols

Special areas and format characters

References

Unicode Character Code Charts, Unicode, Inc.
CWA 13873:2000 Ã¢ÂÂ Multilingual European Subsets in ISO/IEC 10646-1 CEN Workshop Agreement 13873
Multilingual European Character Set 2 (MES-2) Rationale, Markus Kuhn, 1998

External links

Official web site of the Unicode Consortium (English)