Text normalization is the process of transforming

text Text may refer to: Written word * Text (literary theory), any object that can be read, including: **Religious text, a writing that a religious tradition considers to be sacred **Text, a verse or passage from scripture used in expository preachin ...

into a single

canonical form In mathematics and computer science, a canonical, normal, or standard form of a mathematical object is a standard way of presenting that object as a mathematical expression. Often, it is one which provides the simplest representation of an ob ...

that it might not have had before. Normalizing text before storing or processing it allows for

separation of concerns In computer science, separation of concerns is a design principle for separating a computer program into distinct sections. Each section addresses a separate '' concern'', a set of information that affects the code of a computer program. A concern ...

, since input is guaranteed to be consistent before operations are performed on it. Text normalization requires being aware of what type of text is to be normalized and how it is to be processed afterwards; there is no all-purpose normalization procedure.

Applications

Text normalization is frequently used when converting text to speech.

Number A number is a mathematical object used to count, measure, and label. The original examples are the natural numbers 1, 2, 3, 4, and so forth. Numbers can be represented in language with number words. More universally, individual numbers ...

s, dates,

acronym An acronym is a word or name formed from the initial components of a longer name or phrase. Acronyms are usually formed from the initial letters of words, as in '' NATO'' (''North Atlantic Treaty Organization''), but sometimes use syllables, a ...

s, and abbreviations are non-standard "words" that need to be pronounced differently depending on context.Sproat, R.; Black, A.; Chen, S.; Kumar, S.; Ostendorf, M.; Richards, C. (2001). "Normalization of non-standard words." ''Computer Speech and Language'' 15; 287–333. doibr>10.1006/csla.2001.0169
For example: * "$200" would be pronounced as "two hundred dollars" in English, but as "lua selau tālā" in Samoan. * "vi" could be pronounced as " vie," " vee," or " the sixth" depending on the surrounding words. Text can also be normalized for storing and searching in a database. For instance, if a search for "resume" is to match the word "résumé," then the text would be normalized by removing

diacritical marks A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacriti ...

; and if "john" is to match "John", the text would be converted to a single

case Case or CASE may refer to: Containers * Case (goods), a package of related merchandise * Cartridge case or casing, a firearm cartridge component * Bookcase, a piece of furniture used to store books * Briefcase or attaché case, a narrow box to c ...

. To prepare text for searching, it might also be stemmed (e.g. converting "flew" and "flying" both into "fly"), canonicalized (e.g. consistently using American or British English spelling), or have stop words removed.

Techniques

For simple, context-independent normalization, such as removing non-

alphanumeric Alphanumericals or alphanumeric characters are a combination of alphabetical and numerical characters. More specifically, they are the collection of Latin letters and Arabic digits. An alphanumeric code is an identifier made of alphanumeric ...

characters or

regular expressions A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" o ...

would suffice. For example, the sed script sed ‑e "s/\s+/ /g" ''inputfile'' would normalize runs of

whitespace character In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area ...

s into a single space. More complex normalization requires correspondingly complicated algorithms, including

domain knowledge Domain knowledge is knowledge of a specific, specialized discipline or field, in contrast to general (or domain-independent) knowledge. The term is often used in reference to a more general discipline—for example, in describing a software eng ...

of the language and vocabulary being normalized. Among other approaches, text normalization has been modeled as a problem of tokenizing and tagging streams of textZhu, C.; Tang, J.; Li, H.; Ng, H.; Zhao, T. (2007). "A Unified Tagging Approach to Text Normalization." ''Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics''; 688–695. doibr>10.1.1.72.8138
and as a special case of machine translation.Filip, G.; Krzysztof, J.; Agnieszka, W.; Mikołaj, W. (2006)
"Text Normalization as a Special Case of Machine Translation."
''Proceedings of the International Multiconference on Computer Science and Information Technology'' 1; 51–56.Mosquera, A.; Lloret, E.; Moreda, P. (2012)
"Towards Facilitating the Accessibility of Web 2.0 Texts through Text Normalisation"
''Proceedings of the LREC workshop: Natural Language Processing for Improving Textual Accessibility (NLP4ITA)''; 9-14

Textual scholarship

In the field of

textual scholarship Textual scholarship (or textual studies) is an umbrella term for disciplines that deal with describing, transcribing, editing or annotating texts and physical documents. Overview Textual research is mainly historically oriented. Textual scholars ...

and the editing of historic texts, the term "normalization" implies a degree of modernization and standardization – for example in the extension of

scribal abbreviation Scribal abbreviations or sigla (singular: siglum) are abbreviations used by ancient and medieval scribes writing in various languages, including Latin, Greek, Old English and Old Norse. In modern manuscript editing (substantive and mechanica ...

s and the transliteration of the archaic

glyph A glyph () is any kind of purposeful mark. In typography, a glyph is "the specific shape, design, or representation of a character". It is a particular graphical representation, in a particular typeface, of an element of written language. A g ...

s typically found in manuscript and early printed sources. A ''normalized edition'' is therefore distinguished from a '' diplomatic edition'' (or ''semi-diplomatic edition''), in which some attempt is made to preserve these features. The aim is to strike an appropriate balance between, on the one hand, rigorous fidelity to the source text (including, for example, the preservation of enigmatic and ambiguous elements); and, on the other, producing a new text that will be comprehensible and accessible to the modern reader. The extent of normalization is therefore at the discretion of the editor, and will vary. Some editors, for example, choose to modernize archaic spellings and punctuation, but others do not.

References

{{Reflist Natural language processing

Applications

Techniques

Textual scholarship

See also

References