HOME

TheInfoList



OR:

Text normalization is the process of transforming
text Text may refer to: Written word * Text (literary theory), any object that can be read, including: **Religious text, a writing that a religious tradition considers to be sacred **Text, a verse or passage from scripture used in expository preachin ...
into a single
canonical form In mathematics and computer science, a canonical, normal, or standard form of a mathematical object is a standard way of presenting that object as a mathematical expression. Often, it is one which provides the simplest representation of an obje ...
that it might not have had before. Normalizing text before storing or processing it allows for
separation of concerns In computer science, separation of concerns is a design principle for separating a computer program into distinct sections. Each section addresses a separate '' concern'', a set of information that affects the code of a computer program. A concern ...
, since input is guaranteed to be consistent before operations are performed on it. Text normalization requires being aware of what type of text is to be normalized and how it is to be processed afterwards; there is no all-purpose normalization procedure.


Applications

Text normalization is frequently used when converting
text to speech Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal languag ...
.
Number A number is a mathematical object used to count, measure, and label. The original examples are the natural numbers 1, 2, 3, 4, and so forth. Numbers can be represented in language with number words. More universally, individual numbers c ...
s,
date Date or dates may refer to: *Date (fruit), the fruit of the date palm (''Phoenix dactylifera'') Social activity *Dating, a form of courtship involving social activity, with the aim of assessing a potential partner ** Group dating *Play date, a ...
s,
acronym An acronym is a word or name formed from the initial components of a longer name or phrase. Acronyms are usually formed from the initial letters of words, as in ''NATO'' (''North Atlantic Treaty Organization''), but sometimes use syllables, as ...
s, and
abbreviation An abbreviation (from Latin ''brevis'', meaning ''short'') is a shortened form of a word or phrase, by any method. It may consist of a group of letters or words taken from the full version of the word or phrase; for example, the word ''abbrevia ...
s are non-standard "words" that need to be pronounced differently depending on context.Sproat, R.; Black, A.; Chen, S.; Kumar, S.; Ostendorf, M.; Richards, C. (2001). "Normalization of non-standard words." ''Computer Speech and Language'' 15; 287–333. doibr>10.1006/csla.2001.0169
For example: * "$200" would be pronounced as "two hundred dollars" in English, but as "lua selau tālā" in Samoan. * "vi" could be pronounced as " vie," " vee," or "
the sixth ''The Sixth'' (russian: Шестой, translit. ''Shestoy'') is a 1981 Soviet action film directed by Samvel Gasparov at Gorky Film Studio. Plot ''The Sixth'' is a parable about lawlessness and bureaucracy in the aftermath of the Great O ...
" depending on the surrounding words. Text can also be normalized for storing and searching in a database. For instance, if a search for "resume" is to match the word "résumé," then the text would be normalized by removing
diacritical marks A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacriti ...
; and if "john" is to match "John", the text would be converted to a single case. To prepare text for searching, it might also be stemmed (e.g. converting "flew" and "flying" both into "fly"), canonicalized (e.g. consistently using American or British English spelling), or have
stop word Stop words are the words in a stop list (or ''stoplist'' or ''negative dictionary'') which are filtered out (i.e. stopped) before or after processing of natural language data (text) because they are insignificant. There is no single universal list ...
s removed.


Techniques

For simple, context-independent normalization, such as removing non-
alphanumeric Alphanumericals or alphanumeric characters are a combination of alphabetical and numerical characters. More specifically, they are the collection of Latin letters and Arabic digits. An alphanumeric code is an identifier made of alphanumeric ch ...
characters or
diacritical marks A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacriti ...
,
regular expressions A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" o ...
would suffice. For example, the
sed sed ("stream editor") is a Unix utility that parses and transforms text, using a simple, compact programming language. It was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs, and is available today for most operating systems. sed w ...
script sed ‑e "s/\s+/ /g"  ''inputfile'' would normalize runs of
whitespace character In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area ...
s into a single space. More complex normalization requires correspondingly complicated algorithms, including
domain knowledge Domain knowledge is knowledge of a specific, specialized discipline or field, in contrast to general (or domain-independent) knowledge. The term is often used in reference to a more general discipline—for example, in describing a software engin ...
of the language and vocabulary being normalized. Among other approaches, text normalization has been modeled as a problem of tokenizing and tagging streams of textZhu, C.; Tang, J.; Li, H.; Ng, H.; Zhao, T. (2007). "A Unified Tagging Approach to Text Normalization." ''Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics''; 688–695. doibr>10.1.1.72.8138
and as a special case of machine translation.Filip, G.; Krzysztof, J.; Agnieszka, W.; Mikołaj, W. (2006)
"Text Normalization as a Special Case of Machine Translation."
''Proceedings of the International Multiconference on Computer Science and Information Technology'' 1; 51–56.
Mosquera, A.; Lloret, E.; Moreda, P. (2012)
"Towards Facilitating the Accessibility of Web 2.0 Texts through Text Normalisation"
''Proceedings of the LREC workshop: Natural Language Processing for Improving Textual Accessibility (NLP4ITA)''; 9-14


Textual scholarship

In the field of
textual scholarship Textual scholarship (or textual studies) is an umbrella term for disciplines that deal with describing, transcribing, editing or annotating texts and physical documents. Overview Textual research is mainly historically oriented. Textual scholars s ...
and the editing of historic texts, the term "normalization" implies a degree of modernization and standardization – for example in the extension of
scribal abbreviation Scribal abbreviations or sigla (grammatical number, singular: siglum) are abbreviations used by ancient and medieval scribes writing in various languages, including Latin, Greek language, Greek, Old English and Old Norse. In modern manuscrip ...
s and the transliteration of the archaic
glyph A glyph () is any kind of purposeful mark. In typography, a glyph is "the specific shape, design, or representation of a character". It is a particular graphical representation, in a particular typeface, of an element of written language. A g ...
s typically found in manuscript and early printed sources. A ''normalized edition'' is therefore distinguished from a ''
diplomatic edition Diplomatics (in American English, and in most anglophone countries), or diplomatic (in British English), is a scholarly discipline centred on the critical analysis of documents: especially, historical documents. It focuses on the conventions, p ...
'' (or ''semi-diplomatic edition''), in which some attempt is made to preserve these features. The aim is to strike an appropriate balance between, on the one hand, rigorous fidelity to the source text (including, for example, the preservation of enigmatic and ambiguous elements); and, on the other, producing a new text that will be comprehensible and accessible to the modern reader. The extent of normalization is therefore at the discretion of the editor, and will vary. Some editors, for example, choose to modernize archaic spellings and punctuation, but others do not.


See also

*
Automated paraphrasing Paraphrase or paraphrasing in computational linguistics is the natural language processing task of detecting and generating paraphrases. Applications of paraphrasing are varied including information retrieval, question answering, text summarizati ...
*
Canonicalization In computer science, canonicalization (sometimes standardization or normalization) is a process for converting data that has more than one possible representation into a "standard", "normal", or canonical form. This can be done to compare diff ...
*
Text simplification Text simplification is an operation used in natural language processing to change, enhance, classify, or otherwise process an existing body of human-readable text so its grammar and structure is greatly simplified while the underlying meaning and ...
*
Unicode equivalence Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting st ...


References

{{Reflist Natural language processing