HOME

TheInfoList



OR:

Unicode input is the insertion of a specific
Unicode character The Unicode Consortium and the ISO/IEC JTC 1/SC 2/ WG 2 jointly collaborate on the list of the characters in the Universal Coded Character Set. The Universal Coded Character Set, most commonly called the Universal Character Set ( UCS, officia ...
on a computer by a
user Ancient Egyptian roles * User (ancient Egyptian official), an ancient Egyptian nomarch (governor) of the Eighth Dynasty * Useramen, an ancient Egyptian vizier also called "User" Other uses * User (computing), a person (or software) using an ...
; it is a common way to input characters not directly supported by a physical
keyboard Keyboard may refer to: Text input * Keyboard, part of a typewriter * Computer keyboard ** Keyboard layout, the software control of computer keyboards and their mapping ** Keyboard technology, computer keyboard hardware and firmware Music * Musi ...
. Unicode characters can be produced either by selecting them from a display or by typing a certain sequence of keys on a physical keyboard. In addition, a character produced by one of these methods in one web page or document can be copied into another. In contrast to
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
's 96 element
character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that ...
(which it contains), Unicode encodes hundreds of thousands of
grapheme In linguistics, a grapheme is the smallest functional unit of a writing system. The word ''grapheme'' is derived and the suffix ''-eme'' by analogy with ''phoneme'' and other names of emic units. The study of graphemes is called ''graphemics' ...
s (characters) from almost all of the world's written languages and many other signs and symbols besides. A Unicode input system must provide for a large repertoire of characters, ideally all valid Unicode code points. This is different from a
keyboard layout A keyboard layout is any specific physical, visual or functional arrangement of the keys, legends, or key-meaning associations (respectively) of a computer keyboard, mobile phone, or other computer-controlled typographic keyboard. is the actua ...
which defines keys and their combinations only for a limited number of characters appropriate for a certain locale.


Unicode numbers

Unicode characters are distinguished by
code point In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
s, which are conventionally represented by "U+" followed by four, five or six
hexadecimal digit In mathematics and computing, the hexadecimal (also base-16 or simply hex) numeral system is a positional numeral system that represents numbers using a radix (base) of 16. Unlike the decimal system representing numbers using 10 symbols, hexad ...
s, for example U+00AE or U+1D310. Characters in the
Basic Multilingual Plane In the Unicode standard, a plane is a continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal ...
(BMP), containing modern
scripts Script may refer to: Writing systems * Script, a distinctive writing system, based on a repertoire of specific elements or symbols, or that repertoire * Script (styles of handwriting) ** Script typeface, a typeface with characteristics of handw ...
 – including many Chinese and Japanese characters – and many symbols, have a 4-digit code. Historic scripts, but also many modern symbols and pictographs (such as
emoticon An emoticon (, , rarely , ), short for "emotion icon", also known simply as an emote, is a pictorial representation of a facial expression using Character (symbol), characters—usually punctuation marks, numbers, and letters—to express a ...
s,
emoji An emoji ( ; plural emoji or emojis) is a pictogram, logogram, ideogram or smiley embedded in text and used in electronic messages and web pages. The primary function of emoji is to fill in emotional cues otherwise missing from typed conversat ...
s,
playing card A playing card is a piece of specially prepared card stock, heavy paper, thin cardboard, plastic-coated paper, cotton-paper blend, or thin plastic that is marked with distinguishing motifs. Often the front (face) and back of each card has a fi ...
s and many
CJK characters In internationalization, CJK characters is a collective term for the Chinese, Japanese, and Korean languages, all of which include Chinese characters and derivatives in their writing systems, sometimes paired with other scripts. Collectively, the ...
) have 5-digit codes.


Availability

An application can display a character only if it can access a
font In metal typesetting, a font is a particular size, weight and style of a typeface. Each font is a matched set of type, with a piece (a "sort") for each glyph. A typeface consists of a range of such fonts that shared an overall design. In mod ...
which contains a
glyph A glyph () is any kind of purposeful mark. In typography, a glyph is "the specific shape, design, or representation of a character". It is a particular graphical representation, in a particular typeface, of an element of written language. A g ...
for the character.Andrew Marcuse
"How to enter Unicode characters in Microsoft Windows"
Access date: September 13, 2012
Very few fonts have full Unicode coverage; most only contain the glyphs needed to support a few
writing system A writing system is a method of visually representing verbal communication, based on a script and a set of rules regulating its use. While both writing and speech are useful in conveying messages, writing differs in also being a reliable form ...
s. However, most modern browsers and other text-processing applications are able to display multilingual content because they perform
font substitution Font substitution is the process of using one typeface in place of another when the intended typeface either is not available or does not contain glyphs for the required characters. Font substitution can be aided by: * classifying fonts into gen ...
, automatically switching to a fallback font when necessary to display characters which are not supported in the current font. Which fonts are used for fallback and the thoroughness of Unicode coverage varies by software and operating system; some software will search for a suitable glyph in all of the installed fonts, others only search within certain fonts. If an application does not have access to a glyph, the character will usually be shown as the font's ".notdef" glyph which often appears as an empty box (nicknamed "tofu" based on the shape), a box with an X in it, or a box with a question mark in it.


Selection from a screen

Many systems provide a way to select Unicode characters visually.
ISO/IEC 14755 ISO/IEC 14755 is a joint ISO and IEC standard for input methods to enter characters defined in ISO/IEC 10646, the international standard corresponding to the Unicode Standard. As the repertoires of ISO/IEC 10646 and the Unicode Standard are identica ...
refers to this as a ''screen-selection entry method''.
Microsoft Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for serv ...
has provided a Unicode version of the
Character Map Character Map is a utility included with Microsoft Windows operating systems and is used to view the characters in any installed font, to check what keyboard input (Alt code) is used to enter those characters, and to copy characters to the clipboa ...
program, appearing in the consumer edition since XP. This is limited to characters in the
Basic Multilingual Plane In the Unicode standard, a plane is a continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal ...
(BMP). Characters are searchable by Unicode character name, and the table can be limited to a particular code block. More advanced third-party tools of the same type are also available (a notable
freeware Freeware is software, most often proprietary, that is distributed at no monetary cost to the end user. There is no agreed-upon set of rights, license, or EULA that defines ''freeware'' unambiguously; every publisher defines its own rules for the f ...
example is
BabelMap Andrew Christopher West (; born 31 March 1960) is an English Sinologist. His first works concerned Chinese novels of the Ming and Qing dynasties. His study of ''Romance of the Three Kingdoms'' used a new approach to analyse the relationship a ...
, which supports all Unicode characters). On most
Linux Linux ( or ) is a family of open-source Unix-like operating systems based on the Linux kernel, an operating system kernel first released on September 17, 1991, by Linus Torvalds. Linux is typically packaged as a Linux distribution, which ...
desktop environments, equivalent tools – such as
gucharmap GNOME Character Map, formerly known as Gucharmap, is a free and open-source software Unicode character map Computer program, program, part of GNOME. This program allows Character (computing), characters to be displayed by Unicode block or script t ...
(GNOME) or kcharselect (KDE) – are available. Generally these tools let the user "copy" the selected characters into the clipboard, and then paste them into the document, rather than pretending to directly type them. It is often practical to just find the desired character on the web or in another document, and copy and paste it from there.


Decimal input

Some programs running in
Microsoft Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for serv ...
, including recent versions of
Word A word is a basic element of language that carries an semantics, objective or pragmatics, practical semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of w ...
and
Wordpad WordPad is the basic word processor that has been included with almost all versions of Microsoft Windows from Windows 95 onwards. It is more advanced than Windows Notepad, and simpler than Microsoft Word and Microsoft Works (last updated in 2007) ...
, can produce characters from their Unicode code points expressed in decimal and entered on the
numeric keypad A numeric keypad, number pad, numpad, or ten key, is the palm-sized, usually-17-key section of a standard computer keyboard, usually on the far right. It provides calculator-style efficiency for entering numbers. The idea of a 10-key nu ...
with the key held down. For example, the
Euro sign The euro sign () is the currency sign used for the euro, the official currency of the eurozone and unilaterally adopted by Kosovo and Montenegro. The design was presented to the public by the European Commission on 12 December 1996. It consists ...
has 20AC as its hexadecimal code point, which is 8364 in decimal, so will produce the symbol. Similarly, produces the double-struck character . Decimal code points in the range 160 –255 must be entered with a leading zero (so that the
Windows code page Windows code pages are sets of characters or code pages (known as character encodings in other operating systems) used in Microsoft Windows from the 1980s and 1990s. Windows code pages were gradually superseded when Unicode was implemented in Wind ...
is chosen) and furthermore the Windows code page must be set to match Unicode (
CP1252 Windows-1252 or CP-1252 (code page 1252) is a single-byte character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows for English and many European languages including Spanish, French, and German. It i ...
must be used). For example, yields a , corresponding to its code point, but the character produced by depends on the , such as
Code page 437 Code page 437 (CCSID 437) is the character set of the original IBM PC (personal computer). It is also known as CP437, OEM-US, OEM 437, PC-8, or DOS Latin US. The set includes all printable ASCII characters as well as some accented letters (diacri ...
, and may yield a . In programs in which Alt codes over 255 do not work, the character retrieved usually corresponds to the
remainder In mathematics, the remainder is the amount "left over" after performing some computation. In arithmetic, the remainder is the integer "left over" after dividing one integer by another to produce an integer quotient (integer division). In algebr ...
when the number is divided by 256. The text editor Vim allows characters to be specified by two-character mnemonics (confusingly called "digraphs" by Vim developers). The installed set can be augmented by custom mnemonics defined for arbitrary code points, specified in decimal. For example, as decimal 9881 is equal to hexadecimal 2699, associates "Gr" with . See
below Below may refer to: *Earth *Ground (disambiguation) *Soil *Floor *Bottom (disambiguation) Bottom may refer to: Anatomy and sex * Bottom (BDSM), the partner in a BDSM who takes the passive, receiving, or obedient role, to that of the top or ...
for use of decimal code points in HTML.


Hexadecimal input

Clause 5.1 of
ISO/IEC 14755 ISO/IEC 14755 is a joint ISO and IEC standard for input methods to enter characters defined in ISO/IEC 10646, the international standard corresponding to the Unicode Standard. As the repertoires of ISO/IEC 10646 and the Unicode Standard are identica ...
describes a ''Basic method'' whereby a ''beginning sequence'' is followed by the
hex number In mathematics and combinatorics, a centered hexagonal number, or hex number, is a centered figurate number that represents a hexagon with a dot in the center and all other dots surrounding the center dot in a hexagonal lattice. The following f ...
representation of the
code point In character encoding terminology, a code point, codepoint or code position is a numerical value that maps to a specific character. Code points usually represent a single grapheme—usually a letter, digit, punctuation mark, or whitespace—but ...
and the ''ending sequence''. Most modern systems have some method to emulate this, sometimes limited to four digits (thus only the
Basic Multilingual Plane In the Unicode standard, a plane is a continuous group of 65,536 (216) code points. There are 17 planes, identified by the numbers 0 to 16, which corresponds with the possible values 00–1016 of the first two positions in six position hexadecimal ...
).


In Microsoft Windows

Hexadecimal Unicode input can be enabled by adding a string type (REG_SZ) value called EnableHexNumpad to the
registry Registry may refer to: Computing * Container registry, an operating-system-level virtualization registry * Domain name registry, a database of top-level internet domain names * Local Internet registry * Metadata registry, information system for re ...
key HKEY_CURRENT_USER\Control Panel\Input Method and assigning the value data 1 to it. Users will need to log off and back in after editing the registry for this input method to start working. (In versions earlier than Vista, users needed to reboot for it to start working.) Unicode characters can then be entered by holding down , and typing on the numeric keypad, followed by the hexadecimal code, and then releasing . This may not work for 5-digit hexadecimal codes like . Some versions of Windows may require the digits 0-9 to be typed on the numeric keypad or require NumLock to be on. In some applications (
Word A word is a basic element of language that carries an semantics, objective or pragmatics, practical semantics, meaning, can be used on its own, and is uninterruptible. Despite the fact that language speakers often have an intuitive grasp of w ...
,
WordPad WordPad is the basic word processor that has been included with almost all versions of Microsoft Windows from Windows 95 onwards. It is more advanced than Windows Notepad, and simpler than Microsoft Word and Microsoft Works (last updated in 2007) ...
and
LibreOffice LibreOffice () is a free and open-source productivity software, office productivity software suite, a project of The Document Foundation (TDF). It was fork (software development), forked in 2010 from OpenOffice.org, an open-sourced version of t ...
programs) will replace the hexadecimal number to the left of the cursor with the matching Unicode character. Unless it is six hexadecimal digits long, the code must not be preceded by any digit or letters a–f as they may be treated as part of the code to be converted. For example, entering af1 followed by (or if using a French version) will produce '૱' (U+0AF1), but entering a0000f1 followed by will produce 'añ' ('a' followed by character U+00F1). This facility enables Unicode characters to be entered in other applications: one can create a desired character in
WordPad WordPad is the basic word processor that has been included with almost all versions of Microsoft Windows from Windows 95 onwards. It is more advanced than Windows Notepad, and simpler than Microsoft Word and Microsoft Works (last updated in 2007) ...
, for example, and then
cut and paste In human–computer interaction and user interface design, cut, copy, and paste are related commands that offer an interprocess communication technique for transferring data through a computer's user interface. The ''cut'' command removes the ...
it wherever desired.
AutoHotkey AutoHotkey is a free and open-source custom scripting language for Microsoft Windows, initially aimed at providing easy keyboard shortcuts or hotkeys, fast macro-creation and software automation that allows users of most levels of computer skil ...
commands can specify Unicode characters in hex. For example, the command Send will insert an
em dash The dash is a punctuation mark consisting of a long horizontal line. It is similar in appearance to the hyphen but is longer and sometimes higher from the baseline. The most common versions are the endash , generally longer than the hyphen b ...
in a text field in the active window.


In MacOS

Hex input of Unicode must be enabled. In Mac OS 8.5 and later, one can choose the ''Unicode Hex Input'' keyboard layout; in OS X (10.10) Yosemite, this can be added in Keyboard → Input Sources. Holding down , one types the four-digit
hexadecimal In mathematics and computing, the hexadecimal (also base-16 or simply hex) numeral system is a positional numeral system that represents numbers using a radix (base) of 16. Unlike the decimal system representing numbers using 10 symbols, hexa ...
Unicode code point and the equivalent character appears; one can then release the key.Typing special and accented characters
Characters outside of the BMP (the Basic Multilingual Plane) exceed the four-digit limit of the Unicode hex input mechanism but can be entered by using
surrogate pairs UTF-16 (16-bit Unicode Transformation Format) is a character encoding capable of encoding all 1,112,064 valid code points of Unicode (in fact this number of code points is dictated by the design of UTF-16). The encoding is variable-length, as cod ...
: holding down the key while entering the first surrogate, the , the second surrogate, then releasing the Option key.


In X11 (Linux and other Unix variants including ChromeOS)

In many applications one or both of the following methods work to directly input Unicode characters: * Holding and typing followed by the hex digits, then releasing . * Entering , releasing, then typing the hex digits and pressing (or or even, on some systems, pressing and releasing or ). This is supported by GTK and Qt applications, and possibly others. In ChromeOS, this is an operating system function.


In platform-independent applications

* In
Emacs Emacs , originally named EMACS (an acronym for "Editor MACroS"), is a family of text editors that are characterized by their extensibility. The manual for the most widely used variant, GNU Emacs, describes it as "the extensible, customizable, s ...
, or . * In
LibreOffice LibreOffice () is a free and open-source productivity software, office productivity software suite, a project of The Document Foundation (TDF). It was fork (software development), forked in 2010 from OpenOffice.org, an open-sourced version of t ...
5.1 onwards, the method described above for Windows works. * In
Opera Opera is a form of theatre in which music is a fundamental component and dramatic roles are taken by singers. Such a "work" (the literal translation of the Italian word "opera") is typically a collaboration between a composer and a librett ...
versions that use the Presto layout engine—i.e. up to and including version 12.xx—, entering the hexadecimal number of the desired symbol or character and then pressing (alternative shortcut on
macOS macOS (; previously OS X and originally Mac OS X) is a Unix operating system developed and marketed by Apple Inc. since 2001. It is the primary operating system for Apple's Mac computers. Within the market of desktop and lapt ...
). * In the Vim editor, in insert mode, the user first types (for codepoints up to 4 hex digits long; using for longer), then types in the hexadecimal number of the symbol or character desired, and it will be converted into the symbol. (On Microsoft Windows, may be required instead of .Vim documentation: gui_w32
/ref>) * In
AutoCAD AutoCAD is a commercial computer-aided design (CAD) and drafting software application. Developed and marketed by Autodesk, AutoCAD was first released in December 1982 as a desktop app running on microcomputers with internal graphics controllers. ...
or three shortcuts , , .


HTML

In
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
and
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. T ...
, character codes to be rendered as characters are prefixed by
ampersand The ampersand, also known as the and sign, is the logogram , representing the conjunction "and". It originated as a ligature of the letters ''et''—Latin for "and". Etymology Traditionally in English, when spelling aloud, any letter that ...
and
number sign The symbol is known variously in English-speaking regions as the number sign, hash, or pound sign. The symbol has historically been used for a wide range of purposes including the designation of an ordinal number and as a Typographic ligature, ...
(&#), and are followed by a semicolon (;). The code point can be either in
decimal The decimal numeral system (also called the base-ten positional numeral system and denary or decanary) is the standard system for denoting integer and non-integer numbers. It is the extension to non-integer numbers of the Hindu–Arabic numeral ...
or in
hexadecimal In mathematics and computing, the hexadecimal (also base-16 or simply hex) numeral system is a positional numeral system that represents numbers using a radix (base) of 16. Unlike the decimal system representing numbers using 10 symbols, hexa ...
; in the latter case it is preceded by an "x". Leading zeros may be omitted. A number of characters may be represented by a
named entity In information extraction, a named entity is a real-world object, such as a person, location, organization, product, etc., that can be denoted with a proper name. It can be abstract or have a physical existence. Examples of named entities include ...
. ''Example:'' In HTML/XML, the
copyright sign The copyright symbol, or copyright sign, (a circled capital letter C for copyright), is the symbol used in copyright notices for works other than sound recordings. 17 U.S.C. The use of the symbol is described by the Universal Copyright Conv ...
© (U+00A9) may be coded as: * © (decimal code point) * © (hexadecimal code point) * © (entity name) This works in many pieces of software that accept HTML markup, such as
Thunderbird Thunderbird, thunder bird or thunderbirds may refer to: * Thunderbird (mythology), a legendary creature in certain North American indigenous peoples' history and culture * Ford Thunderbird, a car Birds * Dromornithidae, extinct flightless birds k ...
and Wikipedia editing.


See also

*
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
*
Digraph (programming) In computer programming, digraphs and trigraphs are sequences of two and three characters, respectively, that appear in source code and, according to a programming language's specification, should be treated as if they were single characters. V ...
*
AltGr key AltGr (also Alt Graph) is a modifier key found on many computer keyboards (rather than a second Alt key found on US keyboards). It is primarily used to type characters that are not widely used in the territory where sold, such as foreign cur ...
*
Compose key A compose key (sometimes called multi key) is a key on a computer keyboard that indicates that the following (usually 2 or more) keystrokes trigger the insertion of an alternate character, typically a precomposed character or a symbol. For insta ...


Notes


References

{{DEFAULTSORT:Unicode Input Input Input methods