HOME

TheInfoList



OR:

Truecasing, also called capitalization recovery, capitalization correction, or case restoration, is the problem in
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
(NLP) of determining the proper
capitalization Capitalization (American English) or capitalisation (British English) is writing a word with its first letter as a capital letter (uppercase letter) and the remaining letters in lower case, in writing systems with a case distinction. The term a ...
of words where such information is unavailable. This commonly comes up due to the standard practice (in
English English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national ide ...
and many other languages) of automatically capitalizing the first word of a sentence. It can also arise in badly cased or noncased text (for example, all-lowercase or all-uppercase
text messages Text messaging, or texting, is the act of composing and sending electronic messages, typically consisting of alphabetic and numeric characters, between two or more users of mobile devices, desktops/ laptops, or another type of compatible comput ...
). Truecasing is unnecessary in languages whose scripts do not have a distinction between uppercase and lowercase letters. This includes all languages not written in the
Latin Latin (, or , ) is a classical language belonging to the Italic branch of the Indo-European languages. Latin was originally a dialect spoken in the lower Tiber area (then known as Latium) around present-day Rome, but through the power of the ...
,
Greek Greek may refer to: Greece Anything of, from, or related to Greece, a country in Southern Europe: *Greeks, an ethnic group. *Greek language, a branch of the Indo-European language family. **Proto-Greek language, the assumed last common ancestor ...
,
Cyrillic , bg, кирилица , mk, кирилица , russian: кириллица , sr, ћирилица, uk, кирилиця , fam1 = Egyptian hieroglyphs , fam2 = Proto-Sinaitic , fam3 = Phoenician , fam4 = G ...
or
Armenian alphabet The Armenian alphabet ( hy, Հայոց գրեր, ' or , ') is an alphabetic writing system used to write Armenian language, Armenian. It was developed around 405 AD by Mesrop Mashtots, an Armenian linguist and wikt:ecclesiastical, ecclesiast ...
s, such as
Japanese Japanese may refer to: * Something from or related to Japan, an island country in East Asia * Japanese language, spoken mainly in Japan * Japanese people, the ethnic group that identifies with Japan through ancestry or culture ** Japanese diaspor ...
,
Chinese Chinese can refer to: * Something related to China * Chinese people, people of Chinese nationality, citizenship, and/or ethnicity **''Zhonghua minzu'', the supra-ethnic concept of the Chinese nation ** List of ethnic groups in China, people of ...
,
Thai Thai or THAI may refer to: * Of or from Thailand, a country in Southeast Asia ** Thai people, the dominant ethnic group of Thailand ** Thai language, a Tai-Kadai language spoken mainly in and around Thailand *** Thai script *** Thai (Unicode block ...
,
Hebrew Hebrew (; ; ) is a Northwest Semitic language of the Afroasiatic language family. Historically, it is one of the spoken languages of the Israelites and their longest-surviving descendants, the Jews and Samaritans. It was largely preserved ...
,
Arabic Arabic (, ' ; , ' or ) is a Semitic languages, Semitic language spoken primarily across the Arab world.Semitic languages: an international handbook / edited by Stefan Weninger; in collaboration with Geoffrey Khan, Michael P. Streck, Janet C ...
,
Hindi Hindi (Devanāgarī: or , ), or more precisely Modern Standard Hindi (Devanagari: ), is an Indo-Aryan language spoken chiefly in the Hindi Belt region encompassing parts of northern, central, eastern, and western India. Hindi has been de ...
, and
Georgian Georgian may refer to: Common meanings * Anything related to, or originating from Georgia (country) ** Georgians, an indigenous Caucasian ethnic group ** Georgian language, a Kartvelian language spoken by Georgians **Georgian scripts, three scrip ...
.


Techniques

* Neural network models that operate at the word level or the character level have been trained to recover capitalization with greater than 90% accuracy. *
Sentence segmentation Sentence boundary disambiguation (SBD), also known as sentence breaking, sentence boundary detection, and sentence segmentation, is the problem in natural language processing of deciding where sentences begin and end. Natural language processing to ...
can be used to determine where sentences begin, to implement the rule that the first word of every sentence must be capitalized. *
Part-of-speech tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definitio ...
can be used to identify
proper nouns A proper noun is a noun that identifies a single entity and is used to refer to that entity (''Africa'', ''Jupiter'', ''Sarah (given name), Sarah'', ''Microsoft)'' as distinguished from a common noun, which is a noun that refers to a Class (philo ...
(such as Africa, Jupiter, Sarah, or Amazon), which must be capitalized. In some cases, the same word can be used as different parts of speech, and is capitalized differently. For example, Xerox the company, as a noun, is capitalized, but to xerox a document, as a verb, is not capitalized. A xerox, as in the copy of a document, can be recognized by the presence of a
determiner A determiner, also called determinative (abbreviated ), is a word, phrase, or affix that occurs together with a noun or noun phrase and generally serves to express the reference of that noun or noun phrase in the context. That is, a determiner m ...
, which is not used for proper nouns. *
Named entity recognition Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre ...
can be used to identify proper nouns, which must be capitalized. * A
spell checker In software, a spell checker (or spelling checker or spell check) is a software feature that checks for misspellings in a text. Spell-checking features are often embedded in software or services, such as a word processor, email client, electronic di ...
can be used to identify words that are always capitalized.


Applications

Truecasing aids in other NLP tasks, such as
named entity recognition Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre ...
(NER),
automatic content extraction Automatic content extraction (ACE) is a research program for developing advanced information extraction technologies convened by the NIST from 1999 to 2008, succeeding MUC and precedinText Analysis Conference Goals and efforts In general objecti ...
(ACE), and
machine translation Machine translation, sometimes referred to by the abbreviation MT (not to be confused with computer-aided translation, machine-aided human translation or interactive translation), is a sub-field of computational linguistics that investigates t ...
. Proper capitalization allows easier detection of proper nouns, which are the starting points of NER and ACE. Some translation systems use statistical machine learning techniques, which could make use of the information contained in capitalization to increase accuracy.


See also

*
Sentence case Letter case is the distinction between the letters that are in larger uppercase or capitals (or more formally ''majuscule'') and smaller lowercase (or more formally ''minuscule'') in the written representation of certain languages. The writing ...
*
Title case Title case or headline case is a style of capitalization used for rendering the titles of published works or works of art in English. When using title case, all words are capitalized, except for minor words (typically articles, short prepositions, ...


References

{{Natural Language Processing Tasks of natural language processing