Text normalization is the process of transforming
text
Text may refer to:
Written word
* Text (literary theory), any object that can be read, including:
**Religious text, a writing that a religious tradition considers to be sacred
**Text, a verse or passage from scripture used in expository preachin ...
into a single
canonical form
In mathematics and computer science, a canonical, normal, or standard form of a mathematical object is a standard way of presenting that object as a mathematical expression. Often, it is one which provides the simplest representation of an ob ...
that it might not have had before. Normalizing text before storing or processing it allows for
separation of concerns
In computer science, separation of concerns is a design principle for separating a computer program into distinct sections. Each section addresses a separate '' concern'', a set of information that affects the code of a computer program. A concern ...
, since input is guaranteed to be consistent before operations are performed on it. Text normalization requires being aware of what type of text is to be normalized and how it is to be processed afterwards; there is no all-purpose normalization procedure.
Applications
Text normalization is frequently used when converting
text to speech.
Number
A number is a mathematical object used to count, measure, and label. The original examples are the natural numbers 1, 2, 3, 4, and so forth. Numbers can be represented in language with number words. More universally, individual numbers ...
s,
dates,
acronym
An acronym is a word or name formed from the initial components of a longer name or phrase. Acronyms are usually formed from the initial letters of words, as in '' NATO'' (''North Atlantic Treaty Organization''), but sometimes use syllables, a ...
s, and
abbreviations are non-standard "words" that need to be pronounced differently depending on context.
[Sproat, R.; Black, A.; Chen, S.; Kumar, S.; Ostendorf, M.; Richards, C. (2001). "Normalization of non-standard words." ''Computer Speech and Language'' 15; 287–333. doibr>10.1006/csla.2001.0169]
For example:
* "$200" would be pronounced as "two hundred dollars" in English, but as "lua selau tālā" in Samoan.
* "vi" could be pronounced as "
vie," "
vee," or "
the sixth" depending on the surrounding words.
Text can also be normalized for storing and searching in a database. For instance, if a search for "resume" is to match the word "résumé," then the text would be normalized by removing
diacritical marks
A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacriti ...
; and if "john" is to match "John", the text would be converted to a single
case
Case or CASE may refer to:
Containers
* Case (goods), a package of related merchandise
* Cartridge case or casing, a firearm cartridge component
* Bookcase, a piece of furniture used to store books
* Briefcase or attaché case, a narrow box to c ...
. To prepare text for searching, it might also be
stemmed (e.g. converting "flew" and "flying" both into "fly"),
canonicalized (e.g. consistently using
American or British English spelling), or have
stop words removed.
Techniques
For simple, context-independent normalization, such as removing non-
alphanumeric
Alphanumericals or alphanumeric characters are a combination of alphabetical and numerical characters. More specifically, they are the collection of Latin letters and Arabic digits. An alphanumeric code is an identifier made of alphanumeric ...
characters or
diacritical marks
A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacriti ...
,
regular expressions
A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" o ...
would suffice. For example, the
sed script
sed ‑e "s/\s+/ /g" ''inputfile''
would normalize runs of
whitespace character
In computer programming, whitespace is any character or series of characters that represent horizontal or vertical space in typography. When rendered, a whitespace character does not correspond to a visible mark, but typically does occupy an area ...
s into a single space. More complex normalization requires correspondingly complicated algorithms, including
domain knowledge
Domain knowledge is knowledge of a specific, specialized discipline or field, in contrast to general (or domain-independent) knowledge. The term is often used in reference to a more general discipline—for example, in describing a software eng ...
of the language and vocabulary being normalized. Among other approaches, text normalization has been modeled as a problem of tokenizing and tagging streams of text
[Zhu, C.; Tang, J.; Li, H.; Ng, H.; Zhao, T. (2007). "A Unified Tagging Approach to Text Normalization." ''Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics''; 688–695. doibr>10.1.1.72.8138]
and as a special case of machine translation.
[Filip, G.; Krzysztof, J.; Agnieszka, W.; Mikołaj, W. (2006)]
"Text Normalization as a Special Case of Machine Translation."
''Proceedings of the International Multiconference on Computer Science and Information Technology'' 1; 51–56.[Mosquera, A.; Lloret, E.; Moreda, P. (2012)]
"Towards Facilitating the Accessibility of Web 2.0 Texts through Text Normalisation"
''Proceedings of the LREC workshop: Natural Language Processing for Improving Textual Accessibility (NLP4ITA)''; 9-14
Textual scholarship
In the field of
textual scholarship
Textual scholarship (or textual studies) is an umbrella term for disciplines that deal with describing, transcribing, editing or annotating texts and physical documents.
Overview
Textual research is mainly historically oriented. Textual scholars ...
and the editing of historic texts, the term "normalization" implies a degree of modernization and standardization – for example in the extension of
scribal abbreviation
Scribal abbreviations or sigla (singular: siglum) are abbreviations used by ancient and medieval scribes writing in various languages, including Latin, Greek, Old English and Old Norse. In modern manuscript editing (substantive and mechanica ...
s and the transliteration of the archaic
glyph
A glyph () is any kind of purposeful mark. In typography, a glyph is "the specific shape, design, or representation of a character". It is a particular graphical representation, in a particular typeface, of an element of written language. A g ...
s typically found in manuscript and early printed sources. A ''normalized edition'' is therefore distinguished from a ''
diplomatic edition'' (or ''semi-diplomatic edition''), in which some attempt is made to preserve these features. The aim is to strike an appropriate balance between, on the one hand, rigorous fidelity to the source text (including, for example, the preservation of enigmatic and ambiguous elements); and, on the other, producing a new text that will be comprehensible and accessible to the modern reader. The extent of normalization is therefore at the discretion of the editor, and will vary. Some editors, for example, choose to modernize archaic spellings and punctuation, but others do not.
See also
*
Automated paraphrasing
*
Canonicalization
*
Text simplification
Text simplification is an operation used in natural language processing to change, enhance, classify, or otherwise process an existing body of human-readable text so its grammar and structure is greatly simplified while the underlying meaning and ...
*
Unicode equivalence
Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting st ...
References
{{Reflist
Natural language processing