HOME

TheInfoList



OR:

Unicode input is the insertion of a specific
Unicode character The Unicode Consortium and the ISO/IEC JTC 1/SC 2/ WG 2 jointly collaborate on the list of the characters in the Universal Coded Character Set. The Universal Coded Character Set, most commonly called the Universal Character Set ( UCS, offici ...
on a computer by a user; it is a common way to input characters not directly supported by a physical keyboard. Unicode characters can be produced either by selecting them from a display or by typing a certain sequence of keys on a physical keyboard. In addition, a character produced by one of these methods in one web page or document can be copied into another. In contrast to
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
's 96 element
character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values tha ...
(which it contains), Unicode encodes hundreds of thousands of
grapheme In linguistics, a grapheme is the smallest functional unit of a writing system. The word ''grapheme'' is derived and the suffix ''-eme'' by analogy with ''phoneme'' and other names of emic units. The study of graphemes is called '' graphemi ...
s (characters) from almost all of the world's written languages and many other signs and symbols besides. A Unicode input system must provide for a large repertoire of characters, ideally all valid Unicode code points. This is different from a
keyboard layout A keyboard layout is any specific physical, visual or functional arrangement of the keys, legends, or key-meaning associations (respectively) of a computer keyboard, mobile phone, or other computer-controlled typographic keyboard. is the actua ...
which defines keys and their combinations only for a limited number of characters appropriate for a certain locale.


Unicode numbers

Unicode characters are distinguished by
code point In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
s, which are conventionally represented by "U+" followed by four, five or six
hexadecimal digit In mathematics and computing, the hexadecimal (also base-16 or simply hex) numeral system is a positional numeral system that represents numbers using a radix (base) of 16. Unlike the decimal system representing numbers using 10 symbols, hexad ...
s, for example U+00AE or U+1D310. Characters in the
Basic Multilingual Plane In the Unicode standard, a plane is a continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecima ...
(BMP), containing modern scripts – including many Chinese and Japanese characters – and many symbols, have a 4-digit code. Historic scripts, but also many modern symbols and pictographs (such as
emoticon An emoticon (, , rarely , ), short for "emotion icon", also known simply as an emote, is a pictorial representation of a facial expression using characters—usually punctuation marks, numbers, and letters—to express a person's feelings, ...
s, emojis,
playing card A playing card is a piece of specially prepared card stock, heavy paper, thin cardboard, plastic-coated paper, cotton-paper blend, or thin plastic that is marked with distinguishing motifs. Often the front (face) and back of each card has a f ...
s and many CJK characters) have 5-digit codes.


Availability

An application can display a character only if it can access a font which contains a glyph for the character.Andrew Marcuse
"How to enter Unicode characters in Microsoft Windows"
Access date: September 13, 2012
Very few fonts have full Unicode coverage; most only contain the glyphs needed to support a few
writing system A writing system is a method of visually representing verbal communication, based on a script and a set of rules regulating its use. While both writing and speech are useful in conveying messages, writing differs in also being a reliable fo ...
s. However, most modern browsers and other text-processing applications are able to display multilingual content because they perform font substitution, automatically switching to a fallback font when necessary to display characters which are not supported in the current font. Which fonts are used for fallback and the thoroughness of Unicode coverage varies by software and operating system; some software will search for a suitable glyph in all of the installed fonts, others only search within certain fonts. If an application does not have access to a glyph, the character will usually be shown as the font's ".notdef" glyph which often appears as an empty box (nicknamed "tofu" based on the shape), a box with an X in it, or a box with a question mark in it.


Selection from a screen

Many systems provide a way to select Unicode characters visually. ISO/IEC 14755 refers to this as a ''screen-selection entry method''. Microsoft Windows has provided a Unicode version of the Character Map program, appearing in the consumer edition since XP. This is limited to characters in the
Basic Multilingual Plane In the Unicode standard, a plane is a continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecima ...
(BMP). Characters are searchable by Unicode character name, and the table can be limited to a particular code block. More advanced third-party tools of the same type are also available (a notable
freeware Freeware is software, most often proprietary, that is distributed at no monetary cost to the end user. There is no agreed-upon set of rights, license, or EULA that defines ''freeware'' unambiguously; every publisher defines its own rules for t ...
example is BabelMap, which supports all Unicode characters). On most
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, w ...
desktop environments, equivalent tools – such as
gucharmap GNOME Character Map, formerly known as Gucharmap, is a free and open-source software Unicode character map program, part of GNOME. This program allows characters to be displayed by Unicode block or script type. It includes brief descriptions of re ...
(GNOME) or kcharselect (KDE) – are available. Generally these tools let the user "copy" the selected characters into the clipboard, and then paste them into the document, rather than pretending to directly type them. It is often practical to just find the desired character on the web or in another document, and copy and paste it from there.


Decimal input

Some programs running in Microsoft Windows, including recent versions of
Word A word is a basic element of language that carries an objective or practical meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no conse ...
and Wordpad, can produce characters from their Unicode code points expressed in decimal and entered on the
numeric keypad A numeric keypad, number pad, numpad, or ten key, is the palm-sized, usually-17-key section of a standard computer keyboard, usually on the far right. It provides calculator-style efficiency for entering numbers. The idea of a 10-key nu ...
with the key held down. For example, the Euro sign has 20AC as its hexadecimal code point, which is 8364 in decimal, so will produce the symbol. Similarly, produces the double-struck character . Decimal code points in the range 160 –255 must be entered with a leading zero (so that the
Windows code page Windows code pages are sets of characters or code pages (known as character encodings in other operating systems) used in Microsoft Windows from the 1980s and 1990s. Windows code pages were gradually superseded when Unicode was implemented in Wind ...
is chosen) and furthermore the Windows code page must be set to match Unicode (
CP1252 Windows-1252 or CP-1252 (code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. It i ...
must be used). For example, yields a , corresponding to its code point, but the character produced by depends on the , such as Code page 437, and may yield a . In programs in which Alt codes over 255 do not work, the character retrieved usually corresponds to the
remainder In mathematics, the remainder is the amount "left over" after performing some computation. In arithmetic, the remainder is the integer "left over" after dividing one integer by another to produce an integer quotient ( integer division). In algeb ...
when the number is divided by 256. The text editor Vim allows characters to be specified by two-character mnemonics (confusingly called "digraphs" by Vim developers). The installed set can be augmented by custom mnemonics defined for arbitrary code points, specified in decimal. For example, as decimal 9881 is equal to hexadecimal 2699, associates "Gr" with . See below for use of decimal code points in HTML.


Hexadecimal input

Clause 5.1 of ISO/IEC 14755 describes a ''Basic method'' whereby a ''beginning sequence'' is followed by the hex number representation of the
code point In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
and the ''ending sequence''. Most modern systems have some method to emulate this, sometimes limited to four digits (thus only the
Basic Multilingual Plane In the Unicode standard, a plane is a continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecima ...
).


In Microsoft Windows

Hexadecimal Unicode input can be enabled by adding a string type (REG_SZ) value called EnableHexNumpad to the registry key HKEY_CURRENT_USER\Control Panel\Input Method and assigning the value data 1 to it. Users will need to log off and back in after editing the registry for this input method to start working. (In versions earlier than Vista, users needed to reboot for it to start working.) Unicode characters can then be entered by holding down , and typing on the numeric keypad, followed by the hexadecimal code, and then releasing . This may not work for 5-digit hexadecimal codes like . Some versions of Windows may require the digits 0-9 to be typed on the numeric keypad or require NumLock to be on. In some applications (
Word A word is a basic element of language that carries an objective or practical meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of what a word is, there is no conse ...
, WordPad and
LibreOffice LibreOffice () is a free and open-source office productivity software suite, a project of The Document Foundation (TDF). It was forked in 2010 from OpenOffice.org, an open-sourced version of the earlier StarOffice. The LibreOffice suite co ...
programs) will replace the hexadecimal number to the left of the cursor with the matching Unicode character. Unless it is six hexadecimal digits long, the code must not be preceded by any digit or letters a–f as they may be treated as part of the code to be converted. For example, entering af1 followed by (or if using a French version) will produce '૱' (U+0AF1), but entering a0000f1 followed by will produce 'añ' ('a' followed by character U+00F1). This facility enables Unicode characters to be entered in other applications: one can create a desired character in WordPad, for example, and then
cut and paste In human–computer interaction and user interface design, cut, copy, and paste are related commands that offer an interprocess communication technique for transferring data through a computer's user interface. The ''cut'' command removes the ...
it wherever desired.
AutoHotkey AutoHotkey is a free and open-source custom scripting language for Microsoft Windows, initially aimed at providing easy keyboard shortcuts or hotkeys, fast macro-creation and software automation that allows users of most levels of computer skil ...
commands can specify Unicode characters in hex. For example, the command Send will insert an em dash in a text field in the active window.


In MacOS

Hex input of Unicode must be enabled. In Mac OS 8.5 and later, one can choose the ''Unicode Hex Input'' keyboard layout; in OS X (10.10) Yosemite, this can be added in Keyboard → Input Sources. Holding down , one types the four-digit hexadecimal Unicode code point and the equivalent character appears; one can then release the key.Typing special and accented characters
Characters outside of the BMP (the Basic Multilingual Plane) exceed the four-digit limit of the Unicode hex input mechanism but can be entered by using surrogate pairs: holding down the key while entering the first surrogate, the , the second surrogate, then releasing the Option key.


In X11 (Linux and other Unix variants including ChromeOS)

In many applications one or both of the following methods work to directly input Unicode characters: * Holding and typing followed by the hex digits, then releasing . * Entering , releasing, then typing the hex digits and pressing (or or even, on some systems, pressing and releasing or ). This is supported by GTK and Qt applications, and possibly others. In ChromeOS, this is an operating system function.


In platform-independent applications

* In Emacs, or . * In
LibreOffice LibreOffice () is a free and open-source office productivity software suite, a project of The Document Foundation (TDF). It was forked in 2010 from OpenOffice.org, an open-sourced version of the earlier StarOffice. The LibreOffice suite co ...
5.1 onwards, the method described above for Windows works. * In
Opera Opera is a form of theatre in which music is a fundamental component and dramatic roles are taken by singers. Such a "work" (the literal translation of the Italian word "opera") is typically a collaboration between a composer and a libr ...
versions that use the Presto layout engine—i.e. up to and including version 12.xx—, entering the hexadecimal number of the desired symbol or character and then pressing (alternative shortcut on
macOS macOS (; previously OS X and originally Mac OS X) is a Unix operating system developed and marketed by Apple Inc. since 2001. It is the primary operating system for Apple's Mac computers. Within the market of desktop and lapt ...
). * In the Vim editor, in insert mode, the user first types (for codepoints up to 4 hex digits long; using for longer), then types in the hexadecimal number of the symbol or character desired, and it will be converted into the symbol. (On Microsoft Windows, may be required instead of .Vim documentation: gui_w32
/ref>) * In
AutoCAD AutoCAD is a commercial computer-aided design (CAD) and drafting software application. Developed and marketed by Autodesk, AutoCAD was first released in December 1982 as a desktop app running on microcomputers with internal graphics controllers. ...
or three shortcuts , , .


HTML

In
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaSc ...
and
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable ...
, character codes to be rendered as characters are prefixed by ampersand and
number sign The symbol is known variously in English-speaking regions as the number sign, hash, or pound sign. The symbol has historically been used for a wide range of purposes including the designation of an ordinal number and as a ligatured abbreviati ...
(&#), and are followed by a semicolon (;). The code point can be either in decimal or in hexadecimal; in the latter case it is preceded by an "x". Leading zeros may be omitted. A number of characters may be represented by a named entity. ''Example:'' In HTML/XML, the
copyright sign The copyright symbol, or copyright sign, (a circled capital letter C for copyright), is the symbol used in copyright notices for works other than sound recordings. 17 U.S.C. The use of the symbol is described by the Universal Copyright Conv ...
© (U+00A9) may be coded as: * © (decimal code point) * © (hexadecimal code point) * © (entity name) This works in many pieces of software that accept HTML markup, such as
Thunderbird Thunderbird, thunder bird or thunderbirds may refer to: * Thunderbird (mythology), a legendary creature in certain North American indigenous peoples' history and culture * Ford Thunderbird, a car Birds * Dromornithidae, extinct flightless birds ...
and Wikipedia editing.


See also

*
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because ...
* Digraph (programming) *
AltGr key AltGr (also Alt Graph) is a modifier key found on many computer keyboards (rather than a second Alt key found on US keyboards). It is primarily used to type characters that are not widely used in the territory where sold, such as foreign cur ...
* Compose key


Notes


References

{{DEFAULTSORT:Unicode Input Input Input methods