HOME

TheInfoList



OR:

Shift JIS (Shift Japanese Industrial Standards, also SJIS, MIME name Shift_JIS, known as PCK in Solaris contexts) is a character encoding for the Japanese language, originally developed by a
Japan Japan ( ja, 日本, or , and formally , ''Nihonkoku'') is an island country in East Asia. It is situated in the northwest Pacific Ocean, and is bordered on the west by the Sea of Japan, while extending from the Sea of Okhotsk in the north ...
ese company called ASCII Corporation in conjunction with Microsoft and standardized as JIS X 0208 Appendix 1. , 0.2% of all web pages used Shift JIS, a decline from 1.3% in July 2014. Shift JIS is the second-most popular character encoding for Japanese websites, used by 5.6% of sites in the .jp domain. UTF-8 is used by 94.4% of Japanese websites.


Description

Shift JIS is based on character sets defined within JIS standards (for the single-byte characters) and (for the double-byte characters). The lead bytes for the double-byte characters are "shifted" around the 64 halfwidth katakana characters in the single-byte range 0xA1 to 0xDF. The single-byte characters 0x00 to 0x7F match the ASCII encoding, except for a yen sign (U+00A5) at 0x5C and an overline (U+203E) at 0x7E in place of the ASCII character set's backslash and tilde respectively. The single-byte characters from 0xA1 to 0xDF map to the half-width katakana characters found in . HTML written in Shift JIS can still be interpreted to some extent when incorrectly tagged as ASCII, and when the charset tag is in the top of the document itself, since the important start and end of HTML tags and fields, <, >, /, ", &, ; are coded by the same single bytes as in ASCII, and those bytes won't appear in two-byte sequences. Shift JIS is possible to use in string literals in programming languages such as C, but a few things must be taken into consideration. Firstly, that the
escape character In computing and telecommunication, an escape character is a character (computing), character that invokes an alternative interpretation on the following characters in a character sequence. An escape character is a particular case of metacharac ...
0x5C, normally backslash, is the half-width yen sign (¥) in Shift JIS. If the programmer is aware of this, it would be possible to use printf("ハローワールド¥n"); (where ハローワールド is Hello, world and ¥n is an escape sequence), assuming the I/O system supports output. Secondly, the 0x5C byte will cause problems when it appears as second byte of a two-byte character, because it will be interpreted as an escape sequence, which will mess up the interpretation, unless followed by another 0x5C. Shift JIS requires an 8-bit clean medium for transmission. It is fully
backwards compatible Backward or Backwards is a relative direction. Backwards or Sdrawkcab (the word "backwards" with its letters reversed) may also refer to: * "Backwards" (''Red Dwarf''), episode of sci-fi TV sitcom ''Red Dwarf'' ** ''Backwards'' (novel), a nov ...
with the legacy single-byte encoding, meaning it supports
half-width katakana are katakana characters displayed compressed at half their normal width (a 1:2 aspect ratio), instead of the usual square (1:1) aspect ratio. For example, the usual (full-width) form of the katakana ''ka'' is カ while the half-width form is カ. ...
and that any valid string is also a valid Shift JIS string. For two-byte characters, however, Shift JIS only guarantees that the first byte will be high bit set (0x80–0xFF); the value of the second byte can be either high or low. Appearance of byte values 0x40–0x7E as second bytes of code words makes reliable Shift JIS detection difficult, because same codes are used for ASCII characters. Since the same byte value can be either first or second byte, string searches are difficult, since simple searches can match the second byte of a character and the first byte of the next, which is not a real character. String search algorithms must be tailor-made for . On the other hand, the competing 8-bit format EUC-JP, which does not support single-byte halfwidth katakana, allows for a much cleaner and direct conversion to and from JIS X 0208 code points, as all high bit set bytes are parts of a double-byte character and all codes from ASCII range represent single-byte characters. Unicode also does not have some of the disadvantages of Shift JIS. Unicode does not have ambiguous versions: new characters are assigned to unused places by a single organization while private use areas are clearly designated, will never be used for standard characters, and are rarely needed due to the comprehensive nature of Unicode. For Shift JIS, companies work in parallel. UTF-8-encoded Unicode is backwards compatible with ASCII also for 0x5C, and does not have the string search problem. For a double-byte JIS sequence j_1 j_2, the transformation to the corresponding Shift JIS bytes s_1 s_2 is: :s_1 = \begin \left \lfloor \frac \right \rfloor + 112 & \mbox 33 \le j_1 \le 94 \\ \left \lfloor \frac \right \rfloor + 176 & \mbox 95 \le j_1 \le 126 \end :s_2 = \begin j_2 + 31 + \left \lfloor \frac \right \rfloor & \mbox j_1 \mbox\\ j_2 + 126 & \mbox j_1 \mbox \end


Multiple versions

Many different versions of Shift JIS exist. There are two areas for expansion: Firstly, JIS X 0208 does not fill the whole 94×94 space encoded for it in Shift JIS, therefore there is room for more characters here these are really extensions to JIS X 0208 rather than to Shift JIS itself. Secondly, Shift JIS has more encoding space than is needed for and (see § Shift JIS byte map below), and this space can and is used for yet more characters.


Windows-932 / Windows-31J

The most popular extension is Windows code page 932 (a CCSID also used for IBM's extension to Shift JIS), which is registered with the
IANA The Internet Assigned Numbers Authority (IANA) is a standards organization that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System (DNS), media types, and other Interne ...
as "Windows-31J", separately from Shift JIS. This was popularized by Microsoft, although Microsoft itself does not recognize the Windows-31J name and instead calls that variation "shift_jis". IBM's code page 943 includes the same double-byte codes as Microsoft's code page 932, while IBM's code page 932 includes fewer extensions (excluding those which Microsoft incorporates from NEC), and retains the character order from the 1978 edition of JIS X 0208, rather than implementing the character variant swaps from the 1983 standard. Windows-31J assigns 0x5C to U+005C REVERSE SOLIDUS (the backslash), and 0x7E to U+007E TILDE, following US-ASCII. However, most localised fonts on Windows display U+005C as a Yen sign for compatibility. It includes several extensions, namely " NEC special characters (Row 13), NEC selection of IBM extensions (Rows 89 to 92), and IBM extensions (Rows 115 to 119)", in addition to setting some encoding space aside for end user definition. Windows codepage 932 is the version used in the W3C/ WHATWG encoding standard used by HTML5, which includes the "formerly proprietary extensions from IBM and NEC" from Windows-31J in its table for JIS X 0208, and also treats the label "shift_jis" interchangeably with "windows-31j" with the intent of being "compatible with deployed content".


MacJapanese

The version of Shift-JIS originating from the
classic Mac OS Mac OS (originally System Software; retronym: Classic Mac OS) is the series of operating systems developed for the Macintosh family of personal computers by Apple Computer from 1984 to 2001, starting with System 1 and ending with Mac OS 9. The ...
(known as x-mac-japanese, Code page 10001 or MacJapanese) assigned the tilde to 0x7E (following US-ASCII, not which assigns the overline here), but the Yen sign to 0x5C (as in and standard ). It also extended by assigning the backslash to 0x80 (corresponding to 0x5C in US-ASCII), the
non-breaking space In word processing and digital typesetting, a non-breaking space, , also called NBSP, required space, hard space, or fixed space (though it is not of fixed width), is a space character that prevents an automatic line break at its position. In s ...
to 0xA0, the copyright sign to 0xFD, the trademark symbol to 0xFE and the half-width horizontal ellipsis to 0xFF. It also added extended double byte characters; including 53 vertical presentation forms in the range 0xEB41–0xED96, at 84 JIS rows down from their canonical forms, and 260 special characters in the Shift_JIS range 0x8540–0x886D. This variant was introduced in
KanjiTalk KanjiTalk was the name given by Apple to its Japanese language localization of the classic Mac OS. It consisted of translated applications, a set of Japanese fonts, and a Japanese input method called Kotoeri. The software was sold and supported ...
version 7. However, certain Mac OS typefaces used other variants. Sai Mincho and Chu Gothic use a "
PostScript PostScript (PS) is a page description language in the electronic publishing and desktop publishing realm. It is a dynamically typed, concatenative programming language. It was created at Adobe Systems by John Warnock, Charles Geschke, Doug Br ...
" variant of MacJapanese, which included additional vertical presentation forms and a different set of extended special characters, based on the NEC special characters, some of which were only available in the printer versions of the fonts. Older versions of Maru Gothic and Hon Mincho from System 7.1 encoded vertical presentation forms at 10 (not 84) JIS rows down from their canonical forms, and did not include the special character extensions, this was subsequently changed. The typical variant used with KanjiTalk version 6 placed the vertical presentation forms 10 rows down, and also used the NEC extension layout for row 13.


Shift_JISx0213 and Shift_JIS-2004

The newer JIS X 0213 standard defines an extended variant of Shift_JIS referred to as Shift_JISx0213 (in a previous version of the standard) or Shift_JIS-2004. It is a superset of standard Shift JIS. In order to represent the allocated rows on both planes of JIS X 0213, Shift_JIS-2004 uses the following method of mapping codepoints. :s_1 = \begin \left \lfloor \frac \right \rfloor & \mbox m = 1 \mbox 1 \le k \le 62 \\ \left \lfloor \frac \right \rfloor & \mbox m = 1 \mbox 63 \le k \le 94 \\ \left \lfloor \frac \right \rfloor - \left \lfloor \frac \right \rfloor \times 3 & \mbox m = 2 \mbox k = 1, 3, 4, 5, 8, 12, 13, 14, 15 \\ \left \lfloor \frac \right \rfloor & \mbox m = 2 \mbox 78 \le k \le 94 \end :s_2 = \begin t + 63 & \mbox k \mbox 1 \le t \le 63 \\ t + 64 & \mbox k \mbox 64 \le t \le 94 \\ t + 158 & \mbox k \mbox \end In the above, s_1 s_2 is a two-byte Shift_JIS-2004 sequence, m is the number (1 or 2), k is the number (1-94) and t is the number (1-94). The ''ku'' and ''ten'' numbers are equivalent to j_1 - 32 and j_2 - 32 respectively, where j_1 j_2 is a two-byte JIS sequence referencing a given plane. The same set of characters can represented by EUC-JIS-2004, the EUC-JP based counterpart. Some of the additions collide with popular Shift JIS extensions, including Windows codepage 932 which is used in web standards (see above). For example, compare plane 1 row 89 in (beginning 硃, 硎, 硏…) to row 89 in the JIS X 0208 variant defined in web standards (beginning 纊, 褜, 鍈…). In addition, some of the characters map to Unicode characters beyond the BMP.


Other variants

The space with lead bytes 0xF5 to 0xF9 (beyond the region used for JIS X 0208) is used by Japanese mobile phone operators for pictographs for use in E-mail. KDDI goes further and defines hundreds more in the space with lead bytes 0xF3 and 0xF4. Beyond even this, there have been numerous minor variations made on Shift JIS, with individual characters here and there altered. Most of these extensions and variants have no
IANA The Internet Assigned Numbers Authority (IANA) is a standards organization that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System (DNS), media types, and other Interne ...
registration, so there is much scope for confusion, if the extensions are used. A variant is the one that must be used if wanting to encode Shift JIS in source code strings of C and similar programming languages. This variant doubles the byte 0x5C if it appears as second byte of a two-byte character, but not if it appears as a single "¥" (ASCII: "\") character, because 0x5C is the beginning of an escape sequence. The best way of handling this is a special editor which encodes this way.


Shift JIS byte map


As defined in JIS X 0208:1997

The chart below gives the detailed meaning of each byte in a stream encoded in standard (conforming to ).


With vendor or JIS X 0213 extensions

Some of the bytes which are not used for single-byte codes or initial bytes in are used by certain extensions, resulting in the layout detailed in the chart below.


See also

* Japanese language and computers * Code page 932 (Microsoft Windows) * Mojibake * Shift JIS art


References


External links


Shift-JIS Kanji Table
a table of the non-ASCII part of the codeset * Microsoft's definition * Forms of Shift-JIS in ICU ( International Components for Unicode) *
ibm-942 (sjis78)
*
ibm-943 (contains the \u00A5 ↔ \x5C mapping)
*
Shift JIS (contains the \u005C ↔ \x5C mapping)
{{DEFAULTSORT:Shift JIS Encodings of Japanese