Text normalization is the process of transforming

text Text may refer to: Written word * Text (literary theory) In literary theory, a text is any object that can be "read", whether this object is a work of literature, a street sign, an arrangement of buildings on a city block, or styles of clothi ...

into a single

canonical form In mathematics and computer science, a canonical, normal, or standard form of a mathematical object is a standard way of presenting that object as a mathematical expression. Often, it is one which provides the simplest representation of an obje ...

that it might not have had before. Normalizing text before storing or processing it allows for

separation of concerns In computer science, separation of concerns (sometimes abbreviated as SoC) is a design principle for separating a computer program into distinct sections. Each section addresses a separate '' concern'', a set of information that affects the code o ...

, since input is guaranteed to be consistent before operations are performed on it. Text normalization requires being aware of what type of text is to be normalized and how it is to be processed afterwards; there is no all-purpose normalization procedure.

Applications

Text normalization is frequently used when converting text to speech.

Number A number is a mathematical object used to count, measure, and label. The most basic examples are the natural numbers 1, 2, 3, 4, and so forth. Numbers can be represented in language with number words. More universally, individual numbers can ...

s, dates,

acronym An acronym is a type of abbreviation consisting of a phrase whose only pronounced elements are the initial letters or initial sounds of words inside that phrase. Acronyms are often spelled with the initial Letter (alphabet), letter of each wor ...

s, and

abbreviation An abbreviation () is a shortened form of a word or phrase, by any method including shortening (linguistics), shortening, contraction (grammar), contraction, initialism (which includes acronym), or crasis. An abbreviation may be a shortened for ...

s are non-standard "words" that need to be pronounced differently depending on context.Sproat, R.; Black, A.; Chen, S.; Kumar, S.; Ostendorf, M.; Richards, C. (2001). "Normalization of non-standard words." ''Computer Speech and Language'' 15; 287–333. doibr>10.1006/csla.2001.0169
For example: * "$200" would be pronounced as "two hundred dollars" in English, but as "lua selau tālā" in Samoan. * "vi" could be pronounced as " vie," " vee," or " the sixth" depending on the surrounding words. Text can also be normalized for storing and searching in a database. For instance, if a search for "resume" is to match the word "résumé," then the text would be normalized by removing diacritical marks; and if "john" is to match "John", the text would be converted to a single

case Case or CASE may refer to: Instances * Instantiation (disambiguation), a realization of a concept, theme, or design * Special case, an instance that differs in a certain way from others of the type Containers * Case (goods), a package of relate ...

. To prepare text for searching, it might also be stemmed (e.g. converting "flew" and "flying" both into "fly"), canonicalized (e.g. consistently using American or British English spelling), or have

stop word Stop words are the words in a stop list (or ''stoplist'' or ''negative dictionary'') which are filtered out ("stopped") before or after processing of natural language data (i.e. text) because they are deemed to have little semantic value or are ot ...

s removed.

Techniques

For simple, context-independent normalization, such as removing non-

alphanumeric Alphanumericals or alphanumeric characters are any collection of number characters and letters in a certain language. Sometimes such characters may be mistaken one for the other. Merriam-Webster suggests that the term "alphanumeric" may often ...

characters or diacritical marks,

regular expressions A regular expression (shortened as regex or regexp), sometimes referred to as rational expression, is a sequence of character (computing), characters that specifies a pattern matching, match pattern in string (computer science), text. Usually ...

would suffice. For example, the sed script sed ‑e "s/\s+/ /g" ''inputfile'' would normalize runs of

whitespace character A whitespace character is a character data element that represents white space when text is rendered for display by a computer. For example, a ''space'' character (, ASCII 32) represents blank space such as a word divider in a Western scrip ...

s into a single space. More complex normalization requires correspondingly complicated algorithms, including

domain knowledge Domain knowledge is knowledge of a specific discipline or field in contrast to general (or domain-independent) knowledge. The term is often used in reference to a more general discipline—for example, in describing a software engineer who has ge ...

of the language and vocabulary being normalized. Among other approaches, text normalization has been modeled as a problem of tokenizing and tagging streams of textZhu, C.; Tang, J.; Li, H.; Ng, H.; Zhao, T. (2007). "A Unified Tagging Approach to Text Normalization." ''Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics''; 688–695. doibr>10.1.1.72.8138
and as a special case of machine translation.Filip, G.; Krzysztof, J.; Agnieszka, W.; Mikołaj, W. (2006)
"Text Normalization as a Special Case of Machine Translation."
''Proceedings of the International Multiconference on Computer Science and Information Technology'' 1; 51–56.Mosquera, A.; Lloret, E.; Moreda, P. (2012)
"Towards Facilitating the Accessibility of Web 2.0 Texts through Text Normalisation"
''Proceedings of the LREC workshop: Natural Language Processing for Improving Textual Accessibility (NLP4ITA)''; 9-14

Textual scholarship

In the field of

textual scholarship Textual scholarship (or textual studies) is an umbrella term for disciplines that deal with describing, transcribing, editing or annotating text (literary theory), texts and physical documents. Overview Textual research is mainly historically orie ...

and the editing of historic texts, the term "normalization" implies a degree of modernization and standardization – for example in the extension of

scribal abbreviation Scribal abbreviations, or sigla (grammatical number, singular: siglum), are abbreviations used by ancient and medieval scribes writing in various languages, including Latin, Greek language, Greek, Old English and Old Norse. In modern Textua ...

s and the transliteration of the archaic

glyph A glyph ( ) is any kind of purposeful mark. In typography, a glyph is "the specific shape, design, or representation of a character". It is a particular graphical representation, in a particular typeface, of an element of written language. A ...

s typically found in manuscript and early printed sources. A ''normalized edition'' is therefore distinguished from a '' diplomatic edition'' (or ''semi-diplomatic edition''), in which some attempt is made to preserve these features. The aim is to strike an appropriate balance between, on the one hand, rigorous fidelity to the source text (including, for example, the preservation of enigmatic and ambiguous elements); and, on the other, producing a new text that will be comprehensible and accessible to the modern reader. The extent of normalization is therefore at the discretion of the editor, and will vary. Some editors, for example, choose to modernize archaic spellings and punctuation, but others do not.

References

{{Reflist Natural language processing

Applications

Techniques

Textual scholarship

See also

References