Truecasing, also called capitalization recovery,
capitalization correction, or case restoration, is the problem in
natural language processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
(NLP) of determining the proper
capitalization
Capitalization (American English) or capitalisation (British English) is writing a word with its first letter as a capital letter (uppercase letter) and the remaining letters in lower case, in writing systems with a case distinction. The term a ...
of words where such information is unavailable. This commonly comes up due to the standard practice (in
English
English usually refers to:
* English language
* English people
English may also refer to:
Peoples, culture, and language
* ''English'', an adjective for something of, from, or related to England
** English national ide ...
and many other languages) of automatically capitalizing the first word of a sentence. It can also arise in badly cased or noncased text (for example, all-lowercase or all-uppercase
text messages
Text messaging, or texting, is the act of composing and sending electronic messages, typically consisting of alphabetic and numeric characters, between two or more users of mobile devices, desktops/ laptops, or another type of compatible comput ...
).
Truecasing is unnecessary in languages whose scripts do not have a distinction between uppercase and lowercase letters. This includes all languages not written in the
Latin
Latin (, or , ) is a classical language belonging to the Italic branch of the Indo-European languages. Latin was originally a dialect spoken in the lower Tiber area (then known as Latium) around present-day Rome, but through the power of the ...
,
Greek
Greek may refer to:
Greece
Anything of, from, or related to Greece, a country in Southern Europe:
*Greeks, an ethnic group.
*Greek language, a branch of the Indo-European language family.
**Proto-Greek language, the assumed last common ancestor ...
,
Cyrillic
, bg, кирилица , mk, кирилица , russian: кириллица , sr, ћирилица, uk, кирилиця
, fam1 = Egyptian hieroglyphs
, fam2 = Proto-Sinaitic
, fam3 = Phoenician
, fam4 = G ...
or
Armenian alphabet
The Armenian alphabet ( hy, Հայոց գրեր, ' or , ') is an alphabetic writing system used to write Armenian language, Armenian. It was developed around 405 AD by Mesrop Mashtots, an Armenian linguist and wikt:ecclesiastical, ecclesiast ...
s, such as
Japanese
Japanese may refer to:
* Something from or related to Japan, an island country in East Asia
* Japanese language, spoken mainly in Japan
* Japanese people, the ethnic group that identifies with Japan through ancestry or culture
** Japanese diaspor ...
,
Chinese
Chinese can refer to:
* Something related to China
* Chinese people, people of Chinese nationality, citizenship, and/or ethnicity
**''Zhonghua minzu'', the supra-ethnic concept of the Chinese nation
** List of ethnic groups in China, people of ...
,
Thai
Thai or THAI may refer to:
* Of or from Thailand, a country in Southeast Asia
** Thai people, the dominant ethnic group of Thailand
** Thai language, a Tai-Kadai language spoken mainly in and around Thailand
*** Thai script
*** Thai (Unicode block ...
,
Hebrew
Hebrew (; ; ) is a Northwest Semitic language of the Afroasiatic language family. Historically, it is one of the spoken languages of the Israelites and their longest-surviving descendants, the Jews and Samaritans. It was largely preserved ...
,
Arabic
Arabic (, ' ; , ' or ) is a Semitic languages, Semitic language spoken primarily across the Arab world.Semitic languages: an international handbook / edited by Stefan Weninger; in collaboration with Geoffrey Khan, Michael P. Streck, Janet C ...
,
Hindi
Hindi (Devanāgarī: or , ), or more precisely Modern Standard Hindi (Devanagari: ), is an Indo-Aryan language spoken chiefly in the Hindi Belt region encompassing parts of northern, central, eastern, and western India. Hindi has been de ...
, and
Georgian
Georgian may refer to:
Common meanings
* Anything related to, or originating from Georgia (country)
** Georgians, an indigenous Caucasian ethnic group
** Georgian language, a Kartvelian language spoken by Georgians
**Georgian scripts, three scrip ...
.
Techniques
*
Neural network models that operate at the word level or the character level have been trained to recover capitalization with greater than 90% accuracy.
*
Sentence segmentation Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end. Natural language processing to ...
can be used to determine where sentences begin, to implement the rule that the first word of every sentence must be capitalized.
*
Part-of-speech tagging
In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definitio ...
can be used to identify
proper nouns
A proper noun is a noun that identifies a single entity and is used to refer to that entity (''Africa'', ''Jupiter'', ''Sarah (given name), Sarah'', ''Microsoft)'' as distinguished from a common noun, which is a noun that refers to a Class (philo ...
(such as Africa, Jupiter, Sarah, or Amazon), which must be capitalized. In some cases, the same word can be used as different parts of speech, and is capitalized differently. For example, Xerox the company, as a noun, is capitalized, but to xerox a document, as a verb, is not capitalized. A xerox, as in the copy of a document, can be recognized by the presence of a
determiner
A determiner, also called determinative (abbreviated ), is a word, phrase, or affix that occurs together with a noun or noun phrase and generally serves to express the reference of that noun or noun phrase in the context. That is, a determiner m ...
, which is not used for proper nouns.
*
Named entity recognition
Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre ...
can be used to identify proper nouns, which must be capitalized.
* A
spell checker In software, a spell checker (or spelling checker or spell check) is a software feature that checks for misspellings in a text. Spell-checking features are often embedded in software or services, such as a word processor, email client, electronic di ...
can be used to identify words that are always capitalized.
Applications
Truecasing aids in other NLP tasks, such as
named entity recognition
Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre ...
(NER),
automatic content extraction
Automatic content extraction (ACE) is a research program for developing advanced information extraction technologies convened by the NIST from 1999 to 2008, succeeding MUC and precedinText Analysis Conference
Goals and efforts
In general objecti ...
(ACE), and
machine translation
Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
.
Proper capitalization allows easier detection of proper nouns, which are the starting points of NER and ACE. Some translation systems use
statistical machine learning techniques, which could make use of the information contained in capitalization to increase accuracy.
See also
*
Sentence case
Letter case is the distinction between the letters that are in larger uppercase or capitals (or more formally ''majuscule'') and smaller lowercase (or more formally ''minuscule'') in the written representation of certain languages. The writing ...
*
Title case
Title case or headline case is a style of capitalization used for rendering the titles of published works or works of art in English. When using title case, all words are capitalized, except for minor words (typically articles, short prepositions, ...
References
{{Natural Language Processing
Tasks of natural language processing