Shift JIS (Shift Japanese Industrial Standards, also SJIS,
MIME
Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Message ...
name Shift_JIS, known as PCK in
Solaris contexts)
is a
character encoding
Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmi ...
for the
Japanese language
is spoken natively by about 128 million people, primarily by Japanese people and primarily in Japan, the only country where it is the national language. Japanese belongs to the Japonic or Japanese- Ryukyuan language family. There have been ...
, originally developed by a
Japanese company called
ASCII Corporation
was a Japanese publishing company based in Chiyoda, Tokyo. It became a subsidiary of Kadokawa Group Holdings in 2004, and merged with another Kadokawa subsidiary MediaWorks on April 1, 2008, becoming ASCII Media Works. The company published ' ...
in conjunction with
Microsoft
Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washin ...
and standardized as JIS X 0208 Appendix 1. , 0.2% of all web pages used Shift JIS, a decline from 1.3% in July 2014.
Shift JIS is the second-most popular character encoding for Japanese websites, used by 5.6% of sites in the .jp domain.
UTF-8
UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''.
UTF-8 is capable of ...
is used by 94.4% of Japanese websites.
Description
Shift JIS is based on character sets defined within
JIS standards (for the
single-byte character
SBCS, or Single Byte Character Set, is used to refer to character encodings that use exactly one byte for each graphic character. An SBCS can accommodate a maximum of 256 symbols, and is useful for scripts that do not have many symbols or accented ...
s) and (for the
double-byte character
A double-byte character set (DBCS) is a character encoding in which either all characters (including control characters) are encoded in two bytes, or merely every graphic character not representable by an accompanying single-byte character set ( ...
s). The lead bytes for the double-byte characters are "shifted" around the 64 halfwidth
katakana
is a Japanese syllabary, one component of the Japanese writing system along with hiragana, kanji and in some cases the Latin script (known as rōmaji). The word ''katakana'' means "fragmentary kana", as the katakana characters are derived f ...
characters in the single-byte range
0xA1 to 0xDF. The single-byte characters
0x00 to 0x7F match the
ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
encoding, except for a
yen
The is the official currency of Japan. It is the third-most traded currency in the foreign exchange market, after the United States dollar (US$) and the euro. It is also widely used as a third reserve currency after the US dollar and the e ...
sign (U+00A5) at 0x5C and an
overline
An overline, overscore, or overbar, is a typographical feature of a horizontal line drawn immediately above the text. In old mathematical notation, an overline was called a '' vinculum'', a notation for grouping symbols which is expressed in m ...
(U+203E) at 0x7E in place of the ASCII character set's backslash and tilde respectively. The single-byte characters from 0xA1 to 0xDF map to the half-width katakana characters found in .
HTML written in Shift JIS can still be interpreted to some extent when incorrectly tagged as ASCII, and when the charset tag is in the top of the document itself, since the important start and end of HTML tags and fields, <, >, /, ", &, ; are coded by the same single bytes as in ASCII, and those bytes won't appear in two-byte sequences. Shift JIS is possible to use in
string literal
A string literal or anonymous string is a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally " bracketed delimiters", as in x = "foo", where "foo" is a string ...
s in programming languages such as
C, but a few things must be taken into consideration. Firstly, that the
escape character 0x5C, normally
backslash
The backslash is a typographical mark used mainly in computing and mathematics. It is the mirror image of the common slash . It is a relatively recent mark, first documented in the 1930s.
History
, efforts to identify either the origin of ...
, is the
half-width yen sign
The yen and yuan sign, ¥, is a currency sign used for the Japanese yen and the Chinese yuan currencies when writing in Latin scripts. This monetary symbol resembles a Latin letter Y with a single or double horizontal stroke. The symbol is usua ...
(¥) in Shift JIS. If the programmer is aware of this, it would be possible to use
printf("ハローワールド¥n");
(where ハローワールド is
Hello, world and ¥n is an escape sequence), assuming the I/O system supports output. Secondly, the 0x5C byte will cause problems when it appears as second byte of a two-byte character, because it will be interpreted as an escape sequence, which will mess up the interpretation, unless followed by another 0x5C.
Shift JIS requires an
8-bit clean
''8-bit clean'' is an attribute of computer systems, communication channels, and other devices and software, that handle 8-bit character encodings correctly. Such encoding include the ISO 8859 series and the UTF-8 encoding of Unicode.
History
...
medium for transmission. It is fully
backwards compatible with the legacy
single-byte encoding
SBCS, or Single Byte Character Set, is used to refer to character encodings that use exactly one byte for each graphic character. An SBCS can accommodate a maximum of 256 symbols, and is useful for scripts that do not have many symbols or accented ...
, meaning it supports
half-width katakana are katakana characters displayed compressed at half their normal width (a 1:2 aspect ratio), instead of the usual square (1:1) aspect ratio. For example, the usual (full-width) form of the katakana ''ka'' is カ while the half-width form is カ. ...
and that any valid string is also a valid Shift JIS string. For two-byte characters, however, Shift JIS only guarantees that the first byte will be high bit set (0x80–0xFF); the value of the second byte can be either high or low. Appearance of byte values 0x40–0x7E as second bytes of
code word
In communication, a code word is an element of a standardized code or protocol. Each code word is assembled in accordance with the specific rules of the code and assigned a unique meaning. Code words are typically used for reasons of reliability, ...
s makes reliable Shift JIS detection difficult, because same codes are used for ASCII characters. Since the same byte value can be either first or second byte, string searches are difficult, since simple searches can match the second byte of a character and the first byte of the next, which is not a real character. String search algorithms must be tailor-made for .
On the other hand, the competing 8-bit format
EUC-JP
Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.
The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded char ...
, which does not support single-byte halfwidth katakana, allows for a much cleaner and direct conversion to and from JIS X 0208
code point
In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
s, as all high bit set bytes are parts of a double-byte character and all codes from ASCII range represent single-byte characters.
Unicode
Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, wh ...
also does not have some of the disadvantages of Shift JIS. Unicode does not have ambiguous versions: new characters are assigned to unused places by a single organization while
private use areas
In Unicode, a Private Use Area (PUA) is a range of code points that, by definition, will not be assigned characters by the Unicode Consortium. Three private use areas are defined: one in the Basic Multilingual Plane (), and one each in, and nearl ...
are clearly designated, will never be used for standard characters, and are rarely needed due to the comprehensive nature of Unicode. For Shift JIS, companies work in parallel.
UTF-8
UTF-8 is a variable-length character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode'' (or ''Universal Coded Character Set'') ''Transformation Format 8-bit''.
UTF-8 is capable of ...
-encoded Unicode is backwards compatible with ASCII also for 0x5C, and does not have the string search problem.
For a double-byte JIS sequence
, the transformation to the corresponding Shift JIS bytes
is:
:
:
Multiple versions
Many different versions of Shift JIS exist. There are two areas for expansion:
Firstly, JIS X 0208 does not fill the whole 94×94 space encoded for it in Shift JIS, therefore there is room for more characters here these are really extensions to JIS X 0208 rather than to Shift JIS itself.
Secondly, Shift JIS has more encoding space than is needed for and (see
§ Shift JIS byte map below), and this space can and is used for yet more characters.
Windows-932 / Windows-31J
The most popular extension is
Windows code page 932 (a
CCSID
A CCSID (coded character set identifier) is a 16-bit number that represents a particular encoding of a specific code page. For example, Unicode is a code page that has several encoding (so called "transformation") forms, like UTF-8, UTF-16 and U ...
also used for
IBM's extension to Shift JIS), which is registered with the
IANA as "Windows-31J",
separately from Shift JIS. This was popularized by Microsoft, although Microsoft itself does not recognize the Windows-31J name and instead calls that variation "shift_jis".
IBM's code page 943 includes the same double-byte codes as Microsoft's code page 932, while IBM's code page 932 includes fewer extensions (excluding those which Microsoft incorporates from NEC), and retains the character order from the 1978 edition of JIS X 0208, rather than implementing the
character variant swaps from the 1983 standard.
Windows-31J assigns 0x5C to U+005C REVERSE SOLIDUS (the
backslash
The backslash is a typographical mark used mainly in computing and mathematics. It is the mirror image of the common slash . It is a relatively recent mark, first documented in the 1930s.
History
, efforts to identify either the origin of ...
), and 0x7E to U+007E
TILDE
The tilde () or , is a grapheme with several uses. The name of the character came into English from Spanish, which in turn came from the Latin '' titulus'', meaning "title" or "superscription". Its primary use is as a diacritic (accent) i ...
, following
US-ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
. However, most localised fonts on Windows display U+005C as a
Yen sign
The yen and yuan sign, ¥, is a currency sign used for the Japanese yen and the Chinese yuan currencies when writing in Latin scripts. This monetary symbol resembles a Latin letter Y with a single or double horizontal stroke. The symbol is usua ...
for compatibility.
It includes several extensions, namely "
NEC special characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119)",
in addition to setting some encoding space aside for
end user definition.
Windows codepage 932 is the version used in the
W3C
The World Wide Web Consortium (W3C) is the main international standards organization for the World Wide Web. Founded in 1994 and led by Tim Berners-Lee, the consortium is made up of member organizations that maintain full-time staff working to ...
/
WHATWG
The Web Hypertext Application Technology Working Group (WHATWG) is a community of people interested in evolving HTML and related technologies. The WHATWG was founded by individuals from Apple Inc., the Mozilla Foundation and Opera Software, l ...
encoding standard used by
HTML5
HTML5 is a markup language used for structuring and presenting content on the World Wide Web. It is the fifth and final major HTML version that is a World Wide Web Consortium (W3C) recommendation. The current specification is known as the HTML ...
, which includes the "formerly proprietary extensions from IBM and NEC" from Windows-31J in its table for JIS X 0208, and also treats the label "shift_jis" interchangeably with "windows-31j" with the intent of being "compatible with deployed content".
MacJapanese
The version of Shift-JIS originating from the
classic Mac OS (known as
x-mac-japanese
, Code page 10001
or MacJapanese) assigned the
tilde
The tilde () or , is a grapheme with several uses. The name of the character came into English from Spanish, which in turn came from the Latin '' titulus'', meaning "title" or "superscription". Its primary use is as a diacritic (accent) i ...
to 0x7E (following
US-ASCII
ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
, not which assigns the
overline
An overline, overscore, or overbar, is a typographical feature of a horizontal line drawn immediately above the text. In old mathematical notation, an overline was called a '' vinculum'', a notation for grouping symbols which is expressed in m ...
here), but the
Yen sign
The yen and yuan sign, ¥, is a currency sign used for the Japanese yen and the Chinese yuan currencies when writing in Latin scripts. This monetary symbol resembles a Latin letter Y with a single or double horizontal stroke. The symbol is usua ...
to 0x5C (as in and standard ). It also extended by assigning the
backslash
The backslash is a typographical mark used mainly in computing and mathematics. It is the mirror image of the common slash . It is a relatively recent mark, first documented in the 1930s.
History
, efforts to identify either the origin of ...
to 0x80 (corresponding to 0x5C in US-ASCII), the
non-breaking space to 0xA0, the
copyright sign
The copyright symbol, or copyright sign, (a circled capital letter C for copyright), is the symbol used in copyright notices for works other than sound recordings. 17 U.S.C. The use of the symbol is described by the Universal Copyright Conv ...
to 0xFD, the
trademark symbol
The trademark symbol is a symbol to indicate that the preceding mark is a trademark, specifically an unregistered trademark. It complements the registered trademark symbol which is reserved for trademarks registered with an appropriate g ...
to 0xFE and the half-width
horizontal ellipsis
The ellipsis (, also known informally as dot dot dot) is a series of dots that indicates an intentional omission of a word, sentence, or whole section from a text without altering its original meaning. The plural is ellipses. The term origin ...
to 0xFF. It also added extended double byte characters; including 53 vertical presentation forms in the range 0xEB41–0xED96, at 84 JIS rows down from their canonical forms, and 260 special characters in the Shift_JIS range 0x8540–0x886D.
This variant was introduced in
KanjiTalk KanjiTalk was the name given by Apple to its Japanese language localization of the classic Mac OS. It consisted of translated applications, a set of Japanese fonts, and a Japanese input method called Kotoeri. The software was sold and supported onl ...
version 7.
However, certain Mac OS typefaces used other variants. Sai Mincho and Chu Gothic use a "
PostScript" variant of MacJapanese, which included additional vertical presentation forms and a different set of extended special characters, based on the
NEC special characters, some of which were only available in the printer versions of the fonts.
Older versions of Maru Gothic and Hon Mincho from
System 7.1
System 7, codenamed "Big Bang", and also known as Mac OS 7, is a graphical user interface-based operating system for Macintosh computers and is part of the classic Mac OS series of operating systems. It was introduced on May 13, 1991, by Apple Co ...
encoded vertical presentation forms at 10 (not 84) JIS rows down from their canonical forms, and did not include the special character extensions, this was subsequently changed.
The typical variant used with KanjiTalk version 6 placed the vertical presentation forms 10 rows down, and also used the NEC extension layout for row 13.
Shift_JISx0213 and Shift_JIS-2004
The newer
JIS X 0213
JIS X 0213 is a Japanese Industrial Standard defining coded character sets for encoding the characters used in Japan. This standard extends JIS X 0208. The first version was published in 2000 and revised in 2004 (JIS2004) and 2012. As well as a ...
standard defines an extended variant of Shift_JIS referred to as Shift_JISx0213 (in a previous version of the standard) or Shift_JIS-2004. It is a superset of standard Shift JIS.
In order to represent the allocated rows on both planes of JIS X 0213, Shift_JIS-2004 uses the following method of mapping codepoints.
:
:
In the above,
is a two-byte Shift_JIS-2004 sequence,
is the number (1 or 2),
is the number (1-94) and
is the number (1-94). The ''ku'' and ''ten'' numbers are equivalent to
and
respectively, where
is a two-byte JIS sequence referencing a given plane.
The same set of characters can represented by
EUC-JIS-2004
Extended Unix Code (EUC) is a multibyte character encoding system used primarily for Japanese, Korean, and simplified Chinese.
The most commonly used EUC codes are variable-length encodings with a character belonging to an compliant coded charac ...
, the EUC-JP based counterpart.
Some of the additions collide with popular Shift JIS extensions, including Windows codepage 932 which is used in web standards (see
above). For example, compare plane 1 row 89 in (beginning 硃, 硎, 硏…) to row 89 in the JIS X 0208 variant defined in web standards (beginning 纊, 褜, 鍈…). In addition, some of the characters map to Unicode characters beyond the BMP.
Other variants
The space with lead bytes 0xF5 to 0xF9 (beyond the region used for JIS X 0208) is used by Japanese
mobile phone
A mobile phone, cellular phone, cell phone, cellphone, handphone, hand phone or pocket phone, sometimes shortened to simply mobile, cell, or just phone, is a portable telephone that can make and receive calls over a radio frequency link whi ...
operators for
pictographs
A pictogram, also called a pictogramme, pictograph, or simply picto, and in computer usage an icon, is a graphic symbol that conveys its meaning through its pictorial resemblance to a physical object. Pictographs are often used in writing and g ...
for use in
E-mail
Electronic mail (email or e-mail) is a method of exchanging messages ("mail") between people using electronic devices. Email was thus conceived as the electronic (digital) version of, or counterpart to, mail, at a time when "mail" meant ...
.
KDDI
() is a Japanese telecommunications operator formed on October 1, 2000 through the merger of DDI Corp. (Daini-Denden Inc.), KDD (Kokusai Denshin Denwa) Corp. (itself a former listed state-owned enterprise privatized in 1998), and IDO Corp. It ...
goes further and defines hundreds more in the space with lead bytes 0xF3 and 0xF4.
Beyond even this, there have been numerous minor variations made on Shift JIS, with individual characters here and there altered. Most of these extensions and variants have no
IANA registration, so there is much scope for confusion, if the extensions are used.
A variant is the one that must be used if wanting to encode Shift JIS in source code
strings
String or strings may refer to:
*String (structure), a long flexible structure made from threads twisted together, which is used to tie, bind, or hang other objects
Arts, entertainment, and media Films
* ''Strings'' (1991 film), a Canadian anim ...
of
C and similar programming languages. This variant doubles the byte 0x5C if it appears as second byte of a two-byte character, but not if it appears as a single "¥" (ASCII: "\") character, because 0x5C is the beginning of an
escape sequence
In computer science, an escape sequence is a combination of characters that has a meaning other than the literal characters contained therein; it is marked by one or more preceding (and possibly terminating) characters.
Examples
* In C and ma ...
. The best way of handling this is a special editor which encodes this way.
Shift JIS byte map
As defined in JIS X 0208:1997
The chart below gives the detailed meaning of each byte in a stream encoded in standard (conforming to ).
With vendor or JIS X 0213 extensions
Some of the bytes which are not used for single-byte codes or initial bytes in are used by certain extensions, resulting in the layout detailed in the chart below.
See also
*
Japanese language and computers
In relation to the Japanese language and computers many adaptation issues arise, some unique to Japanese and others common to languages which have a very large number of characters. The number of characters needed in order to write in English is ...
*
Code page 932 (Microsoft Windows)
Microsoft Windows code page 932 (abbreviated MS932, Windows-932 or ambiguously CP932), also called Windows-31J amongst other names (see § Terminology below), is the Microsoft Windows code page for the Japanese language, which is an extended v ...
*
Mojibake
Mojibake ( ja, 文字化け; , "character transformation") is the garbled text that is the result of text being decoded using an unintended character encoding. The result is a systematic replacement of symbols with completely unrelated ones, oft ...
*
Shift JIS art
Shift_JIS art is artwork created from characters in the Shift JIS character set, a superset of the ASCII encoding standard intended for Japanese usage. Shift_JIS art has become popular on web-based bulletin boards, notably 2channel, and has e ...
References
External links
Shift-JIS Kanji Tablea table of the non-ASCII part of the codeset
* Microsoft's definition
* Forms of Shift-JIS in ICU (
International Components for Unicode
International Components for Unicode (ICU) is an open-source project of mature C/ C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and environ ...
)
*
ibm-942 (sjis78)*
ibm-943 (contains the \u00A5 ↔ \x5C mapping)*
Shift JIS (contains the \u005C ↔ \x5C mapping)
{{DEFAULTSORT:Shift JIS
Encodings of Japanese