(), also known by its full name , is a

character encoding Character encoding is the process of assigning numbers to graphical character (computing), characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using computers. The numerical v ...

scheme created to provide a complete index of characters used in the Chinese,

Japanese Japanese may refer to: * Something from or related to Japan, an island country in East Asia * Japanese language, spoken mainly in Japan * Japanese people, the ethnic group that identifies with Japan through ancestry or culture ** Japanese diaspor ...

Korean Korean may refer to: People and culture * Koreans, people from the Korean peninsula or of Korean descent * Korean culture * Korean language **Korean alphabet, known as Hangul or Korean **Korean dialects **See also: North–South differences in t ...

Vietnamese Vietnamese may refer to: * Something of, from, or related to Vietnam, a country in Southeast Asia * Vietnamese people, or Kinh people, a Southeast Asian ethnic group native to Vietnam ** Overseas Vietnamese, Vietnamese people living outside Vietna ...

Chữ Nôm Chữ Nôm (, ) is a logographic writing system formerly used to write the Vietnamese language. It uses Chinese characters to represent Sino-Vietnamese vocabulary and some native Vietnamese words, with other words represented by new characters ...

and other historical Chinese

logographic In a written language, a logogram (from Ancient Greek 'word', and 'that which is drawn or written'), also logograph or lexigraph, is a written character that represents a semantic component of a language, such as a word or morpheme. Chinese c ...

writing systems. The , which published the character set, also published

computer software Software consists of computer programs that instruct the Execution (computing), execution of a computer. Software also includes design documents and specifications. The history of software is closely tied to the development of digital comput ...

and TrueType

computer font A computer font is implemented as a digital data file containing a set of graphically related glyphs. A computer font is designed and created using a font editor. A computer font specifically designed for the computer screen, and not for printi ...

s to accompany it. The Mojikyō Institute, chaired by , originally had its character set and related software and data redistributed on

CD-ROM A CD-ROM (, compact disc read-only memory) is a type of read-only memory consisting of a pre-pressed optical compact disc that contains computer data storage, data computers can read, but not write or erase. Some CDs, called enhanced CDs, hold b ...

s sold in

Kinokuniya Kinokuniya (紀ノ国屋) a high-end Japanese supermarket chain headquartered in Shinjuku Ward, Tokyo. Kinokuniya Co., Ltd. became a wholly owned subsidiary of East Japan Railway Company on April 1, 2010. There is no relationship with retailer a ...

stores. Conceptualized in 1996, the first version of the CD-ROM was released in July 1997. For a time, the Mojikyō Institute also offered a web subscription, termed " WEB" (), which had more up-to-date characters. , ''Mojikyō'' encoded 174,975 characters. Among those, 150,366 characters (≈86%) then belonged to the extended Chinese–Japanese–Korean–Vietnamese (CJKV)For Korean,

Hanja Hanja (; ), alternatively spelled Hancha, are Chinese characters used to write the Korean language. After characters were introduced to Korea to write Literary Chinese, they were adapted to write Korean as early as the Gojoseon period. () ...

are referred to. For Vietnamese,

. family. Many of ''Mojikyō'''s characters are considered obsolete or obscure, and are not encoded by any other character set, including the most widely used international text encoding standard,

Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...

. Originally a paid proprietary software product, as of 2015, the Mojikyō Institute began to upload its latest releases to

Internet Archive The Internet Archive is an American 501(c)(3) organization, non-profit organization founded in 1996 by Brewster Kahle that runs a digital library website, archive.org. It provides free access to collections of digitized media including web ...

freeware Freeware is software, often proprietary, that is distributed at no monetary cost to the end user. There is no agreed-upon set of rights, license, or EULA that defines ''freeware'' unambiguously; every publisher defines its own rules for the free ...

, as a

memorial A memorial is an object or place which serves as a focus for the memory or the commemoration of something, usually an influential, deceased person or a historical, tragic event. Popular forms of memorials include landmark objects such as home ...

to honor one of its developers, , who died that year. On 15 December 2018, version 4.0 was released. The next day, Ishikawa announced that without Furuya this would be the final release of ''Mojikyō''.

Premise

The encoding was created to provide a complete index of characters used in the Chinese,

Korean writing system Korean is the native language for about 81 million people, mostly of Korean descent. It is the national language of both South Korea and North Korea. In the south, the language is known as () and in the north, it is known as (). Since the tu ...

s and

scripts. It also encodes a large number of characters in ancient scripts, such as the

oracle bone script Oracle bone script is the oldest attested form of written Chinese, dating to the late 2nd millennium BC. Inscriptions were made by carving characters into oracle bones, usually either the shoulder bones of oxen or the plastrons of turtl ...

, the

seal script Seal script or sigillary script () is a Chinese script styles, style of writing Chinese characters that was common throughout the latter half of the 1st millennium BC. It evolved organically out of bronze script during the Zhou dynasty (1 ...

, and

Sanskrit Sanskrit (; stem form ; nominal singular , ,) is a classical language belonging to the Indo-Aryan languages, Indo-Aryan branch of the Indo-European languages. It arose in northwest South Asia after its predecessor languages had Trans-cultural ...

( Siddhaṃ). For many characters, it is the only

to encode them, and its data is often used as a starting point for

proposals. However, has much looser standards than Unicode for encoding, which leads to have many encoded glyphs of dubious, or even unintentionally fictional, origin. As such, while many non-Unicode characters are suitable for addition to Unicode, not all can become Unicode characters, due to the differing standards of evidence required by each.

Composition

The fonts () are TrueType fonts that come in a

ZIP file ZIP is an archive file format that supports lossless data compression. A ZIP file may contain one or more files or directories that may have been compressed. The ZIP file format permits a number of compression algorithms, though DEFLATE is t ...

and are each around 25

megabytes The megabyte is a multiple of the unit byte for digital information. Its recommended unit symbol is MB. The unit prefix ''mega'' is a multiplier of (106) in the International System of Units (SI). Therefore, one megabyte is one million bytes ...

; the different fonts contain different numbers of characters.Download the file fro
the official website
/ref> Also included is a Windows executable that implements a graphical character map, the " Character Map" (), .English name from the title of the window produced by running the executable; Japanese name from the icon of the executable.Also called the "Mojikyō Cmap". allows users to browse through the fonts, and copy and paste characters in lieu of typing them on the keyboard. As opposed to the regular Windows character map, or for that matter KCharSelect, which both support TrueType fonts, displays the numbered encoding slot of the requested character.See the screenshots o
the official website
/ref> In order for to work, all fonts must be installed.Into the system fonts directory .

Encoding

When referring to a character encoded in , the format MXXXXXX is often used, similar to the U+XXXX format used for Unicode. A difference, however, is that encodings displayed this way are

decimal The decimal numeral system (also called the base-ten positional numeral system and denary or decanary) is the standard system for denoting integer and non-integer numbers. It is the extension to non-integer numbers (''decimal fractions'') of th ...

, while Unicode's U+ encoding is

hexadecimal Hexadecimal (also known as base-16 or simply hex) is a Numeral system#Positional systems in detail, positional numeral system that represents numbers using a radix (base) of sixteen. Unlike the decimal system representing numbers using ten symbo ...

. From the earliest days of Unicode, has both influenced—and been influenced by—the standard. Glyphs originating from first appear in a proposal to the

Ideographic Rapporteur Group The Ideographic Research Group (IRG), formerly called the Ideographic Rapporteur Group, is a subgroup of Working Group 2 (WG2) of ISO/IEC JTC1 Subcommittee 2 (SC2), which is the committee responsible for developing the Universal Coded Character Se ...

(IRG),As of 2019, the IRG rebranded as the Ideographic Research Group. which is responsible for maintaining all CJK blocks in Unicode, on 18 April 2002. In May 2007, played a minor role in an eventually successful series of proposals to encode the

Tangut script The Tangut script ( Tangut: ; ) is a logographic writing system, formerly used for writing the extinct Tangut language of the Western Xia dynasty. According to the latest count, 5863 Tangut characters are known, excluding variants. The Tangut ch ...

in Unicode;The history of the encoding of the Tangut script is quite complicated, see for a full listing of all the related proposals and a timeline. already had within its encoding 6,000 Tangut characters by October 2002. The Unicode Standard's Unihan Database refers to as the "Japanese

KOKUJI In Japanese, or are kanji created in Japan rather than borrowed from China. Like most Chinese characters, they are primarily formed by combining existing characters - though using combinations that are not used in Chinese. Since kokuji ar ...

Collection" (), abbreviated "JK". For example, , Ideographic Description Sequence: an ideograph read in Japanese as , has a J-SourceThis is a column name in the Unihan database; ⟨J⟩ here is short for "Japanese glyph source". The full name of the column is . Under

Han unification Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the Han characters of the so-called CJK languages into a single set of unified characters. Han characters are a featur ...

, there are nine such sources. See §3.1 of UAX#38 for a complete list and more information. equal to JK-66038. All Unicode characters with a JK-prefixed J-Source originate from .Other J-Source prefixes exist, such as J4, meaning the character originates from JIS X 0213:2004. According to Ken Lunde, a subject matter expert in character encodings and

East Asian languages The East Asian languages are a language family (alternatively '' macrofamily'' or ''superphylum'') proposed by Stanley Starosta in 2001. The proposal has since been adopted by George van Driem and others. Classifications Early proposals Early ...

, as of Unicode 13.0, 782 ideographs in Unicode originate from , split somewhat evenly between two blocks: CJK Unified Ideographs Extension C, with 367, and CJK Unified Ideographs Extension E, with 415. Not all Unicode characters with origins (JK-prefixed J-Sources) have the same representative glyph in the code chart as in the font;That is to say, a glyph made up of the same

radicals Radical (from Latin: ', root) may refer to: Politics and ideology Politics *Classical radicalism, the Radical Movement that began in late 18th century Britain and spread to continental Europe and Latin America in the 19th century *Radical politics ...

in the same positions. some characters had their shapes changed before final encoding, as investigation showed the shapes assigned by the Mojikyō Institute were wrong.Errors in large collections of ideographs are, of course, not uncommon. Such errors even accidentally occur in well funded government-produced collections, such as the famous kanji from unknown sources in the

Japanese Industrial Standards Committee The is a standards organization and is the International Organization for Standardization (ISO) member body for Japan. It is also a member of the International Electrotechnical Commission. The committee consists of a Council under the Ministry of ...

JIS X 0208 JIS X 0208 is a 2-byte character set specified as a Japanese Industrial Standards, Japanese Industrial Standard, containing 6879 graphic characters suitable for writing text, place names, personal names, and so forth in the Japanese language. Th ...

double-byte character encoding standard. All of these JIS X 0208 error kanji (

Ghost characters are erroneous kanji included in the Japanese Industrial Standard, JIS X 0208. 12 of the 6,355 kanji characters are ghost characters. Overview In 1978, the Ministry of Economy, Trade and Industry, Ministry of Trade and Industry established ...

, ; e.g., ) have made their way into Unicode despite not being "real" kanji.

Blocks

it encoded 174,975 characters. Among those, 150,366 characters then belonged to the extended CJKV family. Many of the encoded characters are considered obsolete or otherwise obscure, and are not encoded by any other character set, including the international standard, Unicode. Each character has a unique number, and the characters are organized into blocks. puts CJKV characters in different blocks according to their traditional ''Kangxi'' radical. Common radicals containing an especially high number of characters, such as Radicals 9 () and 162 (), are split further by stroke order.For proof, see the list in the Mojikyō Character Map, .

No unification

Unlike Unicode, purposely avoids

; no attempt at compactness of the encoding is made, nor is there an attempt to keep all common characters below U+FFFF as there is in Unicode. Unicode, on the other hand, sorts its CJK into blocks based on how common they are: the most common are generally put into the

Basic Multilingual Plane In the Unicode standard, a plane is a contiguous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal ...

, while those that are rare or obscure are put into the Supplementary Planes.

License

proprietary software Proprietary software is computer software, software that grants its creator, publisher, or other rightsholder or rightsholder partner a legal monopoly by modern copyright and intellectual property law to exclude the recipient from freely sharing t ...

under a restrictive license. Originally, the Mojikyō Institute tried to prevent its character data from being used, and threatened those who published conversion tables to and from its character set. In July 2010, the Mojikyō Institute abandoned its legal efforts to stop at least one Japanese user from publishing conversion tables or converting characters encoded in to Unicode or other character sets. Mere data, sometimes including the shapes of letters, are considered in many jurisdictions to be

common property Common ownership refers to holding the assets of an organization, enterprise, or community indivisibly rather than in the names of the individual members or groups of members as common property. Forms of common ownership exist in every economic ...

as they do not meet the

threshold of originality The threshold of originality is a concept in copyright law that is used to assess whether a particular work can be copyrighted. It is used to distinguish works that are sufficiently originality, original to warrant copyright protection from tho ...

.See also:

fictitious entry Fictitious or fake entries are deliberately incorrect entries in reference works such as Dictionary, dictionaries, encyclopedias, maps, and directories, added by the editors as #Copyright traps, copyright traps to reveal subsequent plagiarism or ...

;

trap street In cartography, a trap street is a fictitious entry in the form of a misrepresented street on a map, often outside the area the map nominally covers, for the purpose of "trapping" potential plagiarists of the map who, if caught, would be unable ...

. Due to this legacy, however, disallowed data as of 2020.

Collected writing systems

Living

* Chinese —

Hanzi Chinese characters are logographs used to write the Chinese languages and others from regions historically influenced by Chinese culture. Of the four independently invented writing systems accepted by scholars, they represent the only one ...

—

Kanji are logographic Chinese characters, adapted from Chinese family of scripts, Chinese script, used in the writing of Japanese language, Japanese. They were made a major part of the Japanese writing system during the time of Old Japanese and are ...

Kana are syllabary, syllabaries used to write Japanese phonology, Japanese phonological units, Mora (linguistics), morae. In current usage, ''kana'' most commonly refers to ''hiragana'' and ''katakana''. It can also refer to their ancestor , wh ...

(including

Hentaigana In the Japanese writing system, are variant forms of hiragana. Description In contrast to modern Japanese, originally hiragana had several forms for a single sound. For example, while the hiragana reading "ha" has only one form in modern ...

) *

—

Latin alphabet The Latin alphabet, also known as the Roman alphabet, is the collection of letters originally used by the Ancient Rome, ancient Romans to write the Latin language. Largely unaltered except several letters splitting—i.e. from , and from � ...

with diacritics *

Cyrillic script The Cyrillic script ( ) is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic languages, Slavic, Turkic languages, Turkic, Mongolic languages, Mongolic, Uralic languages, Uralic, C ...

with diacritics

Dead or obsolete

* Ancient Chinese **

Oracle bone script Oracle bone script is the oldest attested form of written Chinese, dating to the late 2nd millennium BC. Inscriptions were made by carving characters into oracle bones, usually either the shoulder bones of oxen or the plastrons of turtl ...

Seal script Seal script or sigillary script () is a Chinese script styles, style of writing Chinese characters that was common throughout the latter half of the 1st millennium BC. It evolved organically out of bronze script during the Zhou dynasty (1 ...

Taiwanese kana Taiwanese kana (, , ) is a katakana-based writing system that was used to write Taiwanese Hokkien (commonly called "Taiwanese") when the island of Taiwan was under Japanese rule. It functioned as a phonetic guide to hanzi, much like furig ...

—

— Siddhaṃ *

Sui script The Sui script (Sui: ''le1 sui3,'' Simplified Chinese: 水书, Traditional Chinese: 水書, Pinyin: ''Shuǐshū)'' or Shuishu, is a logographic writing system with some pictographic characters that can be used to write the Sui language (Wei 200 ...

References

Notes

External links

* {{Character encodings Character sets Encodings of Asian languages Encodings of Japanese 1997 establishments in Japan Software companies established in 1997 CJK input methods Chinese-language computing Japanese-language computing Korean-language computing Indic computing Language software for Windows CJK typefaces Symbol typefaces Latin-script typefaces Tangut script Windows-only freeware