CJK Unified Ideographs

Last updated
CJKV character Ci in traditional and simplified Chinese, Korean, Vietnamese and Japanese forms CJKV variant glyphs.png
CJKV character in traditional and simplified Chinese, Korean, Vietnamese and Japanese forms

The Chinese, Japanese and Korean (also known as CJK) scripts share a common background, collectively known as CJK characters . During the process called Han unification, the common (shared) characters were identified and named CJK Unified Ideographs. As of Unicode 17.0, Unicode defines a total of 101,996 characters. [1]

Contents

The term ideographs is a misnomer, as the Chinese script is not ideographic but rather logographic.[ citation needed ]

Until the early 20th century, Vietnam also used Chinese characters (Chữ Nôm), so sometimes the abbreviation CJKV is used.

Sources

The Ideographic Research Group (IRG) is responsible for developing extensions to the encoded repertoires of CJK unified ideographs. IRG processes proposals for new CJK unified ideographs submitted by its member bodies, and after undergoing several rounds of expert review, IRG submits a consolidated set of characters to ISO/IEC JTC 1/SC 2 Working Group 2 (WG2) and the Unicode Technical Committee (UTC) for consideration for inclusion in the ISO/IEC 10646 and Unicode standards. The following IRG member bodies have been involved in the standardization of CJK unified ideographs:

The ideographs submitted by the UTC and the United Kingdom are not specific to any particular region, but are characters which have been suggested for encoding by individual experts. The ideographs submitted by SAT are required for the SAT Daizōkyō text database.

The table below gives the numbers of encoded CJK unified ideographs for each IRG source for Unicode 17.0. [4] The total number of characters (267,742) far exceeds the number of encoded CJK unified ideographs (101,996) as many characters have more than one source.

CJK unified ideographs by source
MemberCharacter count
Flag of the People's Republic of China.svg China 69,724
Flag of Hong Kong.svg Hong Kong 17,654
Flag of Japan.svg Japan 52,560
Flag of North Korea.svg North Korea 23,975
Flag of South Korea.svg South Korea 21,358
Flag of Macau.svg Macau 344
Flag of the Republic of China.svg Taiwan 59,570
Flag of the United Kingdom.svg United Kingdom 3,409
Flag of Vietnam.svg Vietnam 14,276
Flag of Buddhism.svg SAT3,715
UTC1,157
Total267,742

UTC sources

The majority of characters submitted by the UTC to the IRG are derived from Unicode Technical Committee (UTC) documents. [5] Other sources include:

Ordering

The ordering of CJK Unified Ideographs within Unicode blocks (not counting those added to the block later) was initially determined by consulting the following four dictionaries. Primarily, they were arranged in Kangxi Dictionary order, with the other dictionaries consulted, in order, for characters not found in the Kangxi Dictionary, to determine which Kangxi Dictionary character they should follow in the ordering. [6]

  1. Kangxi Dictionary
  2. Dai Kan-Wa Jiten
  3. Hanyu Da Zidian
  4. Dae Jaweon

This system is not used for more recently-added Unicode blocks. The Ideographic Research Group no longer uses the Dae Jaweon, [7] nor the Dai Kan-Wa Jiten, [8] in its work. The Kangxi Dictionary and Hanyu Da Zidian are still used [7] both in existing character source references, [9] and as potential replacements for existing source references discovered to be erroneous. [10] Similarly, although a (real or virtual) Kangxi Dictionary index was previously provided as part of the submission data for UTC-source characters, this is no longer the case. [11] Instead, the stroke type of the first residual stroke (first stroke which does not form part of the radical) is supplied with all submitted characters, and used to order characters with the same radical and stroke count within the new Unicode block. [12]

CJK Unified Ideographs blocks

CJK Unified Ideographs

The basic block named CJK Unified Ideographs (4E00–9FFF) contains 20,992 basic Chinese characters in the range U+4E00 through U+9FFF. The block not only includes characters used in the Chinese writing system but also kanji used in the Japanese writing system, hanja in Korea, and chữ Nôm characters in Vietnamese. Many characters in this block are used in all three writing systems, while others are in only one or two of the three.

This block is also known as the Unified Repertoire and Ordering (URO), especially when it needs to be differentiated from the other CJK Unified Ideographs blocks. [13]

The first 20,902 characters in the block are arranged according to the Kangxi Dictionary ordering of radicals. In this system the characters written with the fewest strokes are listed first. The remaining characters were added later, and so are not in radical order.

The block is the result of Han unification, [14] which was somewhat controversial within East Asia. [15] Since single characters used in more than one of Chinese, Japanese and Korean were coded in the same location, and the modern typographical conventions and handwriting curricula differ slightly between regions (not necessarily along language boundaries—for example, Hong Kong and Taiwan, which both use Traditional Chinese, have slightly different local conventions), [16] the appearance of a selected glyph could depend on the particular font being used. However, the URO applies the source separation rule, meaning that pairs of characters treated as distinct in a character set used as a source for the URO (e.g. JIS X 0208 as used in e.g. Shift JIS) would remain pairs of separate characters in the new Unicode encoding. [17]

Using variation selectors, it is possible to specify certain variant CJK ideograms within Unicode. [18] The Adobe-Japan1 character set, which has 14,684 ideographic variation sequences, [19] is an extreme example of the use of variation selectors. [20]

Charts

4E00–62FF, 6300–77FF, 7800–8CFF, 8D00–9FFF.

Sources

Note: Most characters appear in multiple sources, so the sum of individual character counts (108,493) is far greater than the number of encoded characters (20,992). [21]

In Unicode 4.1, 14 HKSCS-2004 characters and 8 GB 18030 characters were assigned to between U+9FA6 and U+9FBB code points. Since then, other additions were added to this block for various reasons, all summarized in the version history section below.

CJK Unified Ideographs Extension A

The block named CJK Unified Ideographs Extension A (3400–4DBF) contains 6,592 additional characters in the range U+3400 through U+4DBF.

Charts

3400–4DBF.

Sources

Note: Most characters appear in more than one source, so the sum of individual character counts (23,997) is far greater than the number of encoded characters (6,592). [21]

CJK Unified Ideographs Extension B

The block named CJK Unified Ideographs Extension B (20000–2A6DF) contains 42,720 characters in the range U+20000 through U+2A6DF. These include most of the characters used in the Kangxi Dictionary that are not in the basic CJK Unified Ideographs block, as well as many Hán-Nôm characters that were formerly used to write Vietnamese.

Charts

20000–215FF, 21600–230FF, 23100–245FF, 24600–260FF, 26100–275FF, 27600–290FF, 29100–2A6DF.

Sources

Note: Many characters appear in more than one source, so the sum of individual character counts (100,887) is far greater than the number of encoded characters (42,720). [21]

CJK Unified Ideographs Extension C

The block named CJK Unified Ideographs Extension C (2A700–2B73F) contains 4,160 characters in the range U+2A700 through U+2B73F. It was initially added in Unicode 5.2 (2009).

Charts

2A700–2B73F.

Sources

Note: Some characters appear in more than one source, so the sum of individual character counts (4,967) is greater than the number of encoded characters (4,160). [21]

CJK Unified Ideographs Extension D

The block named CJK Unified Ideographs Extension D (2B740–2B81F) contains 222 characters in the range U+2B740 through U+2B81D that were added in Unicode 6.0 (2010).

Charts

2B740–2B81F.

Sources

Note: Some characters appear in more than one source, so the sum of individual character counts (260) is greater than the number of encoded characters (222). [21]

CJK Unified Ideographs Extension E

The block named CJK Unified Ideographs Extension E (2B820–2CEAF) contains 5,774 characters in the range U+2B820 through U+2CEAD. It was originally added in Unicode 8.0 (2015).

Charts

2B820–2CEAF.

Sources

Note: Some characters appear in more than one source, so the sum of individual character counts (6,272) is greater than the number of encoded characters (5,774). [21]

CJK Unified Ideographs Extension F

The block named CJK Unified Ideographs Extension F (2CEB0–2EBEF) contains 7,473 characters in the range U+2CEB0 through 2EBE0 that were added in Unicode 10.0 (2017). It includes more than 1,000 Sawndip characters for Zhuang.

Charts

2CEB0–2EBEF.

Sources

Note: Some characters appear in more than one source, so the sum of individual character counts (8,015) is greater than the number of encoded characters (7,473). [21]

CJK Unified Ideographs Extension G

A block named CJK Unified Ideographs Extension G was added as part of Unicode 13.0 to the Tertiary Ideographic Plane in the range U+30000 through U+3134F, containing 4,939 characters. [24]

Charts

30000–3134F.

Sources

Note: Some characters appear in more than one source, so the sum of individual character counts (5,239) is greater than the number of encoded characters (4,939). [21]

CJK Unified Ideographs Extension H

A block named CJK Unified Ideographs Extension H was added as part of Unicode 15.0 to the Tertiary Ideographic Plane in the range U+31350 through U+323AF, containing 4,192 characters. [25]

Charts

31350–323AF.

Sources

Note: Some characters appear in more than one source, so the sum of individual character counts (4,541) is greater than the number of encoded characters (4,192). [21]

CJK Unified Ideographs Extension I

A block named CJK Unified Ideographs Extension I was added as part of Unicode 15.1 to the Supplementary Ideographic Plane in the range U+2EBF0 through U+2EE5F, containing 622 characters. [26]

Charts

2EBF0–2EE5F.

Sources

Note: Some characters appear in more than one source, making the sum of individual character counts (625) more than the number of encoded characters (622). [21]

CJK Unified Ideographs Extension J

A block named CJK Unified Ideographs Extension J was added as part of Unicode 17.0 to the Tertiary Ideographic Plane in the range U+323B0-U+33479, containing 4,298 characters.

Charts

323B0–3347F.

Sources

Note: Some characters appear in more than one source, making the sum of individual character counts (4,406) more than the number of encoded characters (4,298). [21]

CJK Compatibility Ideographs

The block named CJK Compatibility Ideographs (F900–FAFF) was created to retain round-trip compatibility with other standards.

However, twelve characters in this block actually have the "Unified Ideograph" property: U+FA0E 﨎, U+FA0F 﨏, U+FA11 﨑, U+FA13 﨓, U+FA14 﨔, U+FA1F 﨟, U+FA21 﨡, U+FA23 﨣, U+FA24 﨤, U+FA27 﨧, U+FA28 﨨, and U+FA29 﨩. [1] None of the other characters in this and other "Compatibility" blocks relate to CJK unification.

While 龜 and 亀 are not considered unifiable, U+FA20CJK COMPATIBILITY IDEOGRAPH-FA20 is considered a duplicate to U+8612CJK UNIFIED IDEOGRAPH-8612.

Charts

F900–FAFF.

Sources

Note: All characters appear in more than one source, so the sum of individual character counts (40) is greater than the number of encoded characters (12). [21]

Known issues

Disunification

U+4039

The character U+4039 (䀹) was a unification of two different characters (one with jiā 夾 phonetic and one with shǎn 㚒 phonetic) until Unicode 5.0. However, they were lexically different characters that should not have been unified; they have different pronunciations and different meanings.

The proposal of disunification of U+4039 [27] was accepted for Unicode 5.1, encoding a new character at U+9FC3 (鿃) to represent shǎn.

Other 3 glyphs in Extension B

In CJK Unified Ideographs Extension B, some characters were incorrectly unified with others. These characters include U+2017B (𠅻), U+204AF (𠒯) and U+24CB2 (𤲲). The first two characters contained a wrong unification of Chinese Mainland and Vietnamese source of their glyph, while the last one unifies the Chinese Mainland and Taiwanese ones. [28]

The glyphs for U+2017B (𠅻) and U+204AF (𠒯) were corrected in version 10.0, and the erroneous UCS2003 source glyph U+24CB2 (𤲲) was removed in version 13.0.

Unifiable variants and exact duplicates

Also in CJK Unified Ideographs Extension B, hundreds of glyph variants were encoded by mistake. [29] Additionally, an ISO/IEC JTC 1/SC 2 report has found that six exact duplicates (where the same character has inadvertently been encoded twice) and two semi-duplicates (where the CJK-B character represents a de facto disunification of two glyph forms unified in the corresponding BMP character) were encoded by mistake: [30]

Other CJK ideographs in Unicode, not Unified

Apart from the eleven blocks of "Unified Ideographs," Unicode has about a dozen more blocks with not-unified CJK-characters. These are mainly CJK radicals, strokes, punctuation, marks, symbols and compatibility characters. Although some characters have their (decomposable) counterparts in other blocks, the usages can be different. An example of a not-unified CJK-character is U+3007IDEOGRAPHIC NUMBER ZERO in the CJK Symbols and Punctuation block. Although it is not covered under "CJK Unified Ideographs", it is treated as a CJK-character for all other intents and purposes. [31]

Four blocks of compatibility characters are included for compatibility with legacy text handling systems and older character sets:

They include forms of characters for vertical text layout and rich text characters that Unicode recommends handling through other means. Therefore, their use is discouraged.

Font support

The blocks CJK Unified Ideographs and CJK Unified Ideographs Extension A, being parts of the Basic Multilingual Plane, are supported by the majority of the CJK fonts. However, Japanese and Korean fonts usually have fewer characters (about 13,000 and 8,000, respectively) than Chinese. Extensions B, C, D are supported by additional fonts MingLiU-ExtB, MingLiU_HKSCS-ExtB, PMingLiU-ExtB, SimSun-ExtB included in Microsoft Windows since Vista. [32]

Unicode version history

See also

Notes

  1. Characters presumably intended for Singapore Chinese characters, but apparently an ad hoc collection rather than a Singapore national standard. [23]

References

  1. 1 2 "Unicode PropList.txt". 2025-06-30. Retrieved 2025-09-11.
  2. IRG Convenor (2024-12-10). "IRG Experts List". ISO/IEC JTC1/SC2/WG2/IRG N2769.
  3. Lunde, Ken (2024-09-13). "US/Unicode Activity Report for IRG #63 Meeting" (PDF). ISO/IEC JTC1/SC2/WG2/IRG N2700.
  4. "Unihan_IRGSources.txt". 2025-07-24. Retrieved 2025-09-12.
  5. "UAX #45: U-source Ideographs". Unicode Consortium. 2025-07-24.
  6. "18.1.7. Han Ideograph Arrangement". The Unicode Standard: Core Specification. Version 16.0.0. Unicode Consortium.
  7. 1 2 "3.3. Dictionary Indices". Unicode Han Database (Unihan). UAX #38. Three of the dictionary properties represent official IRG indices for the dictionaries used in the four dictionary sorting algorithm. Two (kIRGHanyuDaZidian and kIRGKangXi) are still being used by the IRG, but the other one (kIRGDaeJaweon) is not.
  8. Lunde, Ken (2022-09-01). "Proposal to remove/improve provisional Unihan database properties" (PDF). p. 6. UTC L2/22-188. In addition, the IRG no longer uses this dictionary for its ongoing work.
  9. "kIRG_GSource". Unicode Han Database (Unihan). UAX #38. GKX: Kangxi Dictionary ideographs (康熙字典) 9th edition (1958) including the addendum (康熙字典)補遺. GHZ: Hanyu Dazidian ideographs (漢語大字典).
  10. Lunde, Ken (2018-02-22). "Proposed kIRG_GSource Changes & Corrections" (PDF). UTC L2/18-065; ISO/IEC JTC1/SC2/WG2/IRG N2297.
  11. "2. Text File Data". U-Source Ideographs. Unicode Consortium. UAX #45. A KangXi dictionary index for the ideograph, as described in Unicode Standard Annex #38, "Unicode Han Database (Unihan)" [UAX38]. This field is no longer used and contains no data.
  12. Lunde, Ken (2024-09-30). "Proposal to remove FS (first residual stroke) value from submissions" (PDF). ISO/IEC JTC1/SC2/WG2/IRG N2713. This document proposes that the inclusion of first residual stroke (aka FS) values be removed from the submission requirements for new CJK Unified Ideographs […] The ISO/IEC 10646 Project Editor, when compiling an IRG working set into a new CJK Unified Ideographs extension block, uses the FS values to sort ideographs that share the same Radical-Stroke (Radical + SC) value.
  13. Lunde, Ken (2012-09-16). "URO". CJK Type Blog. Adobe Inc.
  14. The Unicode Standard 4.0, Appendix A - Han Unification History
  15. Suzanne Topping, "The secret life of Unicode". Archived from the original on 2007-11-14. Retrieved 2010-05-12.{{cite web}}: CS1 maint: bot: original URL status unknown (link)
  16. Lu, Qin (2015-06-08). "The Proposed Hong Kong Character Set" (PDF). ISO/IEC JTC1/SC2/WG2/IRG N2074.
  17. "Chapter 11 - East Asian scripts", The Unicode standard, 4.0.
  18. "Ideographic Variation Database". 2022-09-13. Retrieved 2022-09-20.
  19. "IVD Stats". 2025-07-14. Retrieved 2025-09-12.
  20. PRI 108: Combined registration of the Adobe Japan1 collection and of sequences in that collection
  21. 1 2 3 4 5 6 7 8 9 10 11 12 "Unihan_IRGSources.txt (from Unihan.zip)". 2025-07-24. Retrieved 2025-09-12.
  22. 1 2 3 4 5 6 7 8 9 10 11 12 "UAX #38: Unicode Han Database (Unihan)". Unicode Consortium.
  23. Lunde, Ken (2009). "Chapter 3: Character Set Standards § Chinese Character Set Standards—Singapore". CJKV information processing (2nd ed.). Sebastopol, Calif.: O'Reilly Media, Inc. p. 130. ISBN   978-0-596-15611-4. OCLC   317878469. To what extent these 226 characters are ad-hoc, or codified by a Singapore national standard, is unknown, at least to me. My suspicion is that they are ad-hoc simply for the apparent lack of any Singapore national standard.
  24. "Unicode 13.0.0". 10 March 2020. Retrieved 10 March 2020.
  25. "Unicode 15.0.0". 13 September 2022. Retrieved 14 September 2022.
  26. "Unicode 15.1.0". 2023-09-12. Retrieved 2023-09-12.
  27. Andrew West and John Jenkins, proposal of disunification of U+4039
  28. Eiso Chan (陈永聪), Comments on four error glyphs on CJK Unified Ideographs Ext B & E.
  29. Taichi Kawabata. "IRGN1155 Possible Duplicates" (.zip). Retrieved 2019-06-22.
  30. Cook, Richard (6 October 2003). "Defect Report on Duplicate Encoded CJK Forms" (PDF). ISO/IEC JTC1/SC2/WG2. Retrieved 2025-08-21.
  31. GB/T 15835-2011《出版物上数字用法》. China Guojia Biaozhun. https://journals.usst.edu.cn/uploadfile/file/GBT%2015835-2011%E3%80%8A%E5%87%BA%E7%89%88%E7%89%A9%E4%B8%8A%E6%95%B0%E5%AD%97%E7%94%A8%E6%B3%95%E3%80%8B.pdf
  32. Lunde, Ken (2009). CJKV Information Processing. O'Reilly. pp. 633–634. ISBN   978-0-596-51447-1.