Soft hyphen
   HOME

TheInfoList



OR:

In computing and typesetting, a soft hyphen (ISO 8859: 0xAD, Unicode , HTML: ­ or ­ or ­) or syllable hyphen (EBCDIC: 0xCA), abbreviated SHY, is a code point reserved in some
coded character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that ...
s for the purpose of breaking words across lines by inserting visible
hyphen The hyphen is a punctuation mark used to join words and to separate syllables of a single word. The use of hyphens is called hyphenation. ''Son-in-law'' is an example of a hyphenated word. The hyphen is sometimes confused with dashes ( figure ...
s. Two alternative ways of using the soft hyphen character for this purpose have emerged, depending on whether the encoded text will be broken into lines by its recipient, or has already been preformatted by its originator.


Text to be formatted by the recipient

The use of SHY characters in text that will be broken into lines by the recipient is the application context considered by the post-1999
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaS ...
and
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, ...
specifications, as well as some word-processing file formats. In this context, the soft hyphen may also be called a discretionary hyphen or optional hyphen. It serves as an invisible marker used to specify a place in text where a hyphenated break is allowed without forcing a line break in an inconvenient place if the text is re-flowed. It becomes visible only after
word wrapping Line breaking, also known as word wrapping, is breaking a section of text into lines so that it will fit into the available width of a page, window or other display area. In text display, line wrap is continuing on a new line when a line is ful ...
at the end of a line. The soft hyphen's Unicode semantics and HTML implementation are in many ways similar to Unicode's
zero-width space The zero-width space , abbreviated ZWSP, is a non-printing character used in computerized typesetting to indicate word boundaries to text-processing systems in scripts that do not use explicit spacing, or after characters (such as the slash) that a ...
, with the exception that the soft hyphen will preserve the
kerning In typography, kerning is the process of adjusting the spacing between characters in a proportional font, usually to achieve a visually pleasing result. Kerning adjusts the space between individual letterforms, while tracking (letter-spacin ...
of the characters on either side when not visible. The zero-width space, on the other hand, will not, as it is considered a visible character even if not rendered, thus having its own kerning metrics. To show the effect of a soft hyphen in HTML, the words of the following text have been separated with soft hyphens:
Margaret­Are­You­Grieving­Over­Goldengrove­Unleaving­Leaves­Like­The­Things­Of­Man­You­With­Your­Fresh­Thoughts­Care­For­Can­You­Ah­As­The­Heart­Grows­Older­It­Will­Come­To­Such­Sights­Colder­By­And­By­Nor­Spare­A­Sigh­Though­Worlds­Of­Wanwood­Leafmeal­Lie­And­Yet­You­Will­Weep­And­Know­Why­Now­No­Matter­Child­The­Name­Sorrows­Springs­Are­The­Same­Nor­Mouth­Had­No­Nor­Mind­Expressed­What­Heart­Heard­Of­Ghost­Guessed­It­Is­The­Blight­Man­Was­Born­For­It­Is­Margaret­You­Mourn­For
On HTML browsers supporting soft hyphens, resizing the window will re-break the above text only at word boundaries, and insert a hyphen at the end of each line.


Text preformatted by the originator

The SHY character is also used in text where paragraphs have already been broken into lines, such as certain
plain text In computing, plain text is a loose term for data (e.g. file contents) that represent only characters of readable material but not its graphical representation nor other objects (floating-point numbers, images, etc.). It may also include a limit ...
files, text sent to
VT100 The VT100 is a video terminal, introduced in August 1978 by Digital Equipment Corporation (DEC). It was one of the first terminals to support ANSI escape codes for cursor control and other tasks, and added a number of extended codes for special ...
-style
terminal emulator A terminal emulator, or terminal application, is a computer program that emulates a video terminal within some other display architecture. Though typically synonymous with a shell or text terminal, the term ''terminal'' covers all remote term ...
s or printers, or pages represented in
page description language In digital printing, a page description language (PDL) is a computer language that describes the appearance of a printed page in a higher level than an actual output bitmap (or generally raster graphics). An overlapping term is printer control la ...
s. This is the application context originally considered by the
EBCDIC Extended Binary Coded Decimal Interchange Code (EBCDIC; ) is an eight- bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding ...
and
ISO 8859-1 ISO/IEC 8859-1:1998, ''Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in ...
standards and implemented in many
VT100 The VT100 is a video terminal, introduced in August 1978 by Digital Equipment Corporation (DEC). It was one of the first terminals to support ANSI escape codes for cursor control and other tasks, and added a number of extended codes for special ...
terminal emulator A terminal emulator, or terminal application, is a computer program that emulates a video terminal within some other display architecture. Though typically synonymous with a shell or text terminal, the term ''terminal'' covers all remote term ...
s. Here, SHY is a visible hyphen that is usually visually indistinguishable from a regular hyphen, but has been inserted solely for the purpose of line breaking. The purpose of the soft hyphen here is to distinguish it from any regular hyphen that might have been part of the original spelling of the word. This distinction helps re-use of already formatted text, when line breaks and soft hyphens inserted during word wrapping have to be removed to convert the text back into its unformatted form. For example, the copy or paste function of a
terminal emulator A terminal emulator, or terminal application, is a computer program that emulates a video terminal within some other display architecture. Though typically synonymous with a shell or text terminal, the term ''terminal'' covers all remote term ...
can offer to replace line breaks with a
space character In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area ...
, and remove any soft hyphens including any immediately following
whitespace character In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area ...
s. An example application that outputs soft hyphens for this reason is the groff text formatter as used on many Unix/Linux systems to display
man pages A man page (short for manual page) is a form of software documentation usually found on a Unix or Unix-like operating system. Topics covered include computer programs (including library and system calls), formal standards and conventions, and ev ...
.


Encodings and definitions

SHY characters in coded characters sets, roughly in chronological order: *
EBCDIC Extended Binary Coded Decimal Interchange Code (EBCDIC; ) is an eight- bit character encoding used mainly on IBM mainframe and IBM midrange computer operating systems. It descended from the code used with punched cards and the corresponding ...
placed a SHY character (known there as a "syllable hyphen") at position 202 (0xCA
hexadecimal In mathematics and computing, the hexadecimal (also base-16 or simply hex) numeral system is a positional numeral system that represents numbers using a radix (base) of 16. Unlike the decimal system representing numbers using 10 symbols, he ...
). IBM defined its purpose as a "hyphen used to divide a word at the end of a line
hat A hat is a head covering which is worn for various reasons, including protection against weather conditions, ceremonial reasons such as university graduation, religious reasons, safety, or as a fashion accessory. Hats which incorporate mecha ...
may be removed when a program adjusts lines." * German standard
DIN DIN or Din or din may refer to: People and language * Din (name), people with the name * Dīn, an Arabic word with three general senses: judgment, custom, and religion from which the name originates * Dinka language (ISO 639 code: din), spoken by ...
31626 defined a
C1 control code set The C0 and C1 control code or control character sets define control codes for use in text by computer systems that use ASCII and derivatives of ASCII. The codes represent additional information about the text, such as the position of a cursor, ...
defining 0x8D as an "Optional Syllabification Control (OSC)", a "print control character" for use marking syllable boundaries in long words. This C1 control set was registered in 1979. (Note: this is not the same as the
ISO/IEC 6429 ISO/IEC JTC 1, entitled "Information technology", is a joint technical committee (JTC) of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). Its purpose is to develop, maintain and ...
C1 control code .) *
ISO 8859-1 ISO/IEC 8859-1:1998, ''Information technology — 8-bit single-byte coded graphic character sets — Part 1: Latin alphabet No. 1'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in ...
:1986 (Latin 1) inherited SHY from EBCDIC, but called it "soft hyphen", placed it at position 0xAD (hexadecimal), and stated its purpose as "for use when a line break has been established within a word". Other
ISO 8859 ISO/IEC 8859 is a joint ISO and IEC series of standards for 8-bit character encodings. The series of standards consists of numbered parts, such as ISO/IEC 8859-1, ISO/IEC 8859-2, etc. There are 15 parts, excluding the abandoned ISO/IEC 8859-12. ...
parts placed it at the same position, with the exception of
ISO 8859-11 ISO/IEC 8859-11:2001, ''Information technology — 8-bit single-byte coded graphic character sets — Part 11: Latin/Thai alphabet'', is part of the ISO/IEC 8859 series of ASCII-based standard character encodings, first edition published in 2001. ...
(Latin/Thai), which lacks it. * IBM
code page 850 Code page 850 ( CCSID 850) (also known as CP 850, IBM 00850, OEM 850, DOS Latin 1) is a code page used under DOS and Psion's EPOC16 operating systems in Western Europe. Depending on the country setting and system configuration, code page 850 i ...
(an
MS-DOS MS-DOS ( ; acronym for Microsoft Disk Operating System, also known as Microsoft DOS) is an operating system for x86-based personal computers mostly developed by Microsoft. Collectively, MS-DOS, its rebranding as IBM PC DOS, and a few o ...
character set covering all ISO 8859-1 characters) placed it at position 240 = 0xF0. *
SGML The Standard Generalized Markup Language (SGML; ISO 8879:1986) is a standard for defining generalized markup languages for documents. ISO 8879 Annex A.1 states that generalized markup is "based on two postulates": * Declarative: Markup should ...
's "Numeric and Special Graphic" (isonum)
character entity In SGML, HTML and XML documents, the logical constructs known as ''character data'' and ''attribute values'' consist of sequences of characters, in which each character can manifest directly (representing itself), or can be represented by a series ...
set (ISO 8879:1986) includes ­ for the ISO 8859-1 soft hyphen. * Unicode 1.0 (1991) and ISO 10646 (1993) took the first 256 code positions from ISO 8859-1, resulting in SHY at Unicode code point of U+00AD. *
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaS ...
2 (1995) incorporated the "­" character entity from SGML, but explicitly discouraged its use. * HTML 4 (1999) redefined the purpose of the character as marking a hyphenation opportunity, which only becomes visible as a hyphen at the end of a line after formatting. * Unicode 4.0 (2002) changed the category of its SHY character from previously "Pd" (punctuation, dash) to "Cf" (other, format), thereby aligning its interpretation of the character with that of HTML 4. Other commands for marking hyphenation opportunities in text formatting languages (similar to the HTML 4 and Unicode 4.0 interpretation of SHY): *
troff troff (), short for "typesetter roff", is the major component of a document processing system developed by Bell Labs for the Unix operating system. troff and the related nroff were both developed from the original roff. While nroff was inte ...
and groff: \%. * TeX and
LaTeX Latex is an emulsion (stable dispersion) of polymer microparticles in water. Latexes are found in nature, but synthetic latexes are common as well. In nature, latex is found as a milky fluid found in 10% of all flowering plants (angiosperms ...
: \-


Security issues

Soft hyphens, like other invisible characters, have been used to obscure malicious domains or URLs in
e-mail spam Email spam, also referred to as junk email, spam mail, or simply spam, is unsolicited messages sent in bulk by email (spamming). The name comes from a Monty Python sketch in which the name of the canned pork product Spam is ubiquitous, unavoida ...
.


See also

* Hard hyphen *
Non-breaking space In word processing and digital typesetting, a non-breaking space, , also called NBSP, required space, hard space, or fixed space (though it is not of fixed width), is a space character that prevents an automatic line break at its position. I ...
*
Word divider In punctuation, a word divider is a glyph that separates written words. In languages which use the Latin, Cyrillic, and Arabic alphabets, as well as other scripts of Europe and West Asia, the word divider is a blank space, or ''whitespace''. ...
*
Word joiner The word joiner (WJ) is a format character in Unicode used to indicate that word separation should not occur at a position, when using scripts such as Arabic that do not use explicit spacing. It is encoded since Unicode version 3.2 (released i ...
*
Zero-width space The zero-width space , abbreviated ZWSP, is a non-printing character used in computerized typesetting to indicate word boundaries to text-processing systems in scripts that do not use explicit spacing, or after characters (such as the slash) that a ...
*
Word wrap Line breaking, also known as word wrapping, is breaking a section of text into lines so that it will fit into the available width of a page, window or other display area. In text display, line wrap is continuing on a new line when a line is ful ...


References

{{Unicode navigation Punctuation Typography Control characters Whitespace Unicode formatting code points