TheInfoList

Collation is the assembly of written information into a standard order. Many systems of collation are based on or
alphabetical order Alphabetical order is a system whereby character string In computer programming Computer programming is the process of designing and building an executable computer program to accomplish a specific computing result or to perform a parti ...
, or extensions and combinations thereof. Collation is a fundamental element of most office filing systems,
library catalog A library catalog (or library catalogue in British English) is a register of all bibliography, bibliographic items found in a library or group of libraries, such as a network of libraries at several locations. A catalog for a group of libra ...
s, and
reference book A reference work is a work, such as a book or periodical literature, periodical (or electronic publishing, their electronic equivalents), to which one can refer for information. The information is intended to be found quickly when needed. Such ...
s. Collation differs from ''
classification Classification is a process related to categorization Categorization is the human ability and activity of recognizing shared features or similarities between the elements of the experience Experience refers to conscious , an English Paracels ...
'' in that the classes themselves are not necessarily ordered. However, even if the order of the classes is irrelevant, the identifiers of the classes may be members of an ordered set, allowing a
sorting algorithm In computer science, a sorting algorithm is an algorithm that puts elements of a List (computing), list into an Total order, order. The most frequently used orders are numerical order and lexicographical order, and either ascending or descending. E ...
to arrange the items by class. Formally speaking, a collation method typically defines a
total order In mathematics, a total or linear order is a partial order in which any two elements are comparable. That is, a total order is a binary relation \leq on some Set (mathematics), set X, which satisfies the following for all a, b and c in X: # a \ ...
on a set of possible identifiers, called sort keys, which consequently produces a
total preorder The 13 possible strict weak orderings on a set of three elements . The only total orders are shown in black. Two orderings are connected by an edge if they differ by a single dichotomy. In mathematics Mathematics (from Ancient Greek, Gre ...
on the set of items of information (items with the same identifier are not placed in any defined order). A collation algorithm such as the
Unicode collation algorithm The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system A writing system is a method of visually ...
defines an order through the process of comparing two given
character string In computer programming Computer programming is the process of designing and building an executable computer program to accomplish a specific computing result or to perform a particular task. Programming involves tasks such as analysis, gene ...
s and deciding which should come before the other. When an order has been defined in this way, a sorting algorithm can be used to put a list of any number of items into that order. The main advantage of collation is that it makes it fast and easy for a user to find an element in the list, or to confirm that it is absent from the list. In automatic systems this can be done using a
binary search algorithm In computer science Computer science deals with the theoretical foundations of information, algorithms and the architectures of its computation as well as practical techniques for their application. Computer science is the study of Al ...
or
interpolation search Interpolation search is an algorithm for Search algorithm, searching for a key in an array that has been Collation, ordered by numerical values assigned to the keys (''key values''). It was first described by W. W. Peterson in 1957. Interpolation s ...

; manual searching may be performed using a roughly similar procedure, though this will often be done unconsciously. Other advantages are that one can easily find the first or last elements on the list (most likely to be useful in the case of numerically sorted data), or elements in a given range (useful again in the case of numerical data, and also with alphabetically ordered data when one may be sure of only the first few letters of the sought item or items).

# Ordering

## Numerical and chronological

Strings representing
number A number is a mathematical object A mathematical object is an abstract concept arising in mathematics. In the usual language of mathematics, an ''object'' is anything that has been (or could be) formally defined, and with which one may do deduct ...

s may be sorted based on the values of the numbers that they represent. For example, "−4", "2.5", "10", "89", "30,000". Note that pure application of this method may provide only a partial ordering on the strings, since different strings can represent the same number (as with "2" and "2.0" or, when
scientific notation Scientific notation is a way of expressing numbers A number is a mathematical object A mathematical object is an abstract concept arising in mathematics. In the usual language of mathematics, an ''object'' is anything that has been (or coul ...
is used, "2e3" and "2000"). A similar approach may be taken with strings representing dates or other items that can be ordered chronologically or in some other natural fashion.

## Alphabetical

Alphabetical order Alphabetical order is a system whereby character string In computer programming Computer programming is the process of designing and building an executable computer program to accomplish a specific computing result or to perform a parti ...
is the basis for many systems of collation where items of information are identified by strings consisting principally of
letters Letter, letters, or literature may refer to: Characters typeface * Letter (alphabet) A letter is a segmental symbol A symbol is a mark, sign, or word that indicates, signifies, or is understood as representing an idea, Object (philosophy ...
from an
alphabet An alphabet is a standardized set of basic written symbols A symbol is a mark, sign, or word In linguistics, a word of a spoken language can be defined as the smallest sequence of phonemes that can be uttered in isolation with semanti ...

. The ordering of the strings relies on the existence of a standard ordering for the letters of the alphabet in question. (The system is not limited to alphabets in the strict technical sense; languages that use a
syllabary In the linguistic Linguistics is the scientific study of language A language is a structured system of communication used by humans, including speech (spoken language), gestures (Signed language, sign language) and writing. Most langu ...
or
abugida An abugida (, from Ge'ez language, Ge'ez: አቡጊዳ), sometimes known as alphasyllabary, neosyllabary or pseudo-alphabet, is a segmental Writing systems#Segmental writing system, writing system in which consonant-vowel sequences are writt ...
, for example
Cherokee The Cherokee (; chr, ᎠᏂᏴᏫᏯᎢ, translit=Aniyvwiyaʔi or Anigiduwagi, or chr, ᏣᎳᎩ, links=no, translit=Tsalagi) are one of the indigenous peoples of the Southeastern Woodlands Indigenous peoples of the Southeastern Woodlands, ...
, can use the same ordering principle provided there is a set ordering for the symbols used.) To decide which of two strings comes first in alphabetical order, initially their first letters are compared. The string whose first letter appears earlier in the alphabet comes first in alphabetical order. If the first letters are the same, then the second letters are compared, and so on, until the order is decided. (If one string runs out of letters to compare, then it is deemed to come first; for example, "cart" comes before "carthorse".) The result of arranging a set of strings in alphabetical order is that words with the same first letter are grouped together, and within such a group words with the same first two letters are grouped together, and so on.
Capital letter Letter case is the distinction between the letters Letter, letters, or literature may refer to: Characters typeface * Letter (alphabet) A letter is a segmental symbol A symbol is a mark, sign, or word that indicates, signifies, or ...
s are typically treated as equivalent to their corresponding lowercase letters. (For alternative treatments in computerized systems, see
Automated collation Collation is the assembly of written information into a standard order. Many systems of collation are based on number, numerical order or alphabetical order, or extensions and combinations thereof. Collation is a fundamental element of most office ...
, below.) Certain limitations, complications, and special conventions may apply when alphabetical order is used: *When strings contain spaces or other word dividers, the decision must be taken whether to ignore these dividers or to treat them as symbols preceding all other letters of the alphabet. For example, if the first approach is taken then "car park" will come after "carbon" and "carp" (as it would if it were written "carpark"), whereas in the second approach "car park" will come before those two words. The first rule is used in many (but not all)
dictionaries A dictionary is a listing of lexeme A lexeme () is a unit of lexical meaning that underlies a set of words that are related through inflection In linguistic morphology Morphology, from the Greek and meaning "study of shape", may refe ...

, the second in telephone directories (so that Wilson, Jim K appears with other people named Wilson, Jim and not after Wilson, Jimbo). *Abbreviations may be treated as if they were spelt out in full. For example, names containing "St." (short for the English word ''
Saint In religious belief, a saint is a person who is recognized as having an exceptional degree of Q-D-Š, holiness, likeness, or closeness to God. However, the use of the term ''saint'' depends on the context and Christian denomination, denominatio ...

'') are often ordered as if they were written out as "Saint". There is also a traditional convention in English that surnames beginning ''Mc'' and ''M are listed as if those prefixes were written ''Mac''. *Strings that represent personal names will often be listed by alphabetical order of surname, even if the
given name A given name (also known as a first name or forename) is the part of a personal name A personal name, or full name, in onomastic Onomastics or onomatology is the study of the etymology, history, and use of proper names. An ''wikt:ortho ...
comes first. For example, Juan Hernandes and Brian O'Leary should be sorted as "Hernandes, Juan" and "O'Leary, Brian" even if they are not written this way. *Very common initial words, such as ''The'' in English, are often ignored for sorting purposes. So '' The Shining'' would be sorted as just "Shining" or "Shining, The". *When some of the strings contain
numerals A numeral is a figure, symbol, or group of figures or symbols denoting a number. It may refer to: * Numeral system used in mathematics * Numeral (linguistics), a part of speech denoting numbers (e.g. ''one'' and ''first'' in English) * Numerical di ...
(or other non-letter characters), various approaches are possible. Sometimes such characters are treated as if they came before or after all the letters of the alphabet. Another method is for numbers to be sorted alphabetically as they would be spelled: for example ''
1776 Events January–February * January 1 January 1 or 1 January is the first day of the year in the Gregorian Calendar The Gregorian calendar is the used in most of the world. It was introduced in October 1582 by as a modif ...
'' would be sorted as if spelled out "seventeen seventy-six", and ''
24 heures du Mans 4 (four) is a number, numeral (linguistics), numeral and numerical digit, digit. It is the natural number following 3 and preceding 5. It is the smallest composite number, and is tetraphobia, considered unlucky in many East Asian cultures. In mat ...
'' as if spelled "vingt-quatre..." (French for "twenty-four"). When numerals or other symbols are used as special graphical forms of letters, as in ''1337'' for
leet Leet (or "1337"), also known as eleet or leetspeak, is a system of modified spellings used primarily on the Internet. It often uses character replacements in ways that play on the similarity of their glyphs via reflection (mathematics), reflec ...
or ''Se7en'' for the movie title ''
Seven 7 is a number, numeral, and glyph. 7 or seven may also refer to: * AD 7, the seventh year of the AD era * 7 BC, the seventh year before the AD era * The month of July Music Artists * Seven (Swiss singer) (born 1978), a Swiss recording artist * Se ...
'', they may be sorted as if they were those letters. *Languages have different conventions for treating modified letters and certain letter combinations. For example, in
Spanish Spanish may refer to: * Items from or related to Spain: **Spaniards, a nation and ethnic group indigenous to Spain **Spanish language **Spanish cuisine Other places * Spanish, Ontario, Canada * Spanish River (disambiguation), the name of several ...

the letter ''ñ'' is treated as a basic letter following ''n'', and the
digraphs Digraph may refer to: * Digraph (orthography) A digraph or digram (from the el, δίς ', "double" and ', "to write") is a pair of characters used in the orthography An orthography is a set of conventions for writing Writing is a m ...
''ch'' and ''ll'' were formerly (until 1994) treated as basic letters following ''c'' and ''l'', although they are now alphabetized as two-letter combinations. A list of such conventions for various languages can be found at . In several languages the rules have changed over time, and so older dictionaries may use a different order than modern ones. Furthermore, collation may depend on use. For example, German
dictionaries A dictionary is a listing of lexeme A lexeme () is a unit of lexical meaning that underlies a set of words that are related through inflection In linguistic morphology Morphology, from the Greek and meaning "study of shape", may refe ...

and telephone directories use different approaches.

:''See also Indexing of Chinese characters'' Another form of collation is radical-and-stroke sorting, used for non-alphabetic writing systems such as the
hanzi Chinese characters, also called ''hanzi'' (), are logogram In a written language A written language is the representation of a spoken or gestural language A language is a structured system of communication used by humans, ...

of
Chinese Chinese can refer to: * Something related to China China, officially the People's Republic of China (PRC), is a country in East Asia. It is the List of countries and dependencies by population, world's most populous country, with a populat ...
and the
kanji are a set of logographic In a written language A written language is the representation of a spoken or gestural language A language is a structured system of communication used by humans, including speech (spoken language), gest ...

of
Japanese Japanese may refer to: * Something from or related to Japan Japan ( ja, 日本, or , and formally ) is an island country An island country or an island nation is a country A country is a distinct territory, territorial body or ...

, whose thousands of symbols defy ordering by convention. In this system, common components of characters are identified; these are called radicals in Chinese and logographic systems derived from Chinese. Characters are then grouped by their primary radical, then ordered by number of pen strokes within radicals. When there is no obvious radical or more than one radical, convention governs which is used for collation. For example, the Chinese character 妈 (meaning "mother") is sorted as a six-stroke character under the three-stroke primary radical 女. The radical-and-stroke system is cumbersome compared to an alphabetical system in which there are a few characters, all unambiguous. The choice of which components of a logograph comprise separate radicals and which radical is primary is not clear-cut. As a result, logographic languages often supplement radical-and-stroke ordering with alphabetic sorting of a phonetic conversion of the logographs. For example, the kanji word ''Tōkyō'' (東京) can be sorted as if it were spelled out in the Japanese characters of the
hiragana is a Japanese Japanese may refer to: * Something from or related to Japan , image_flag = Flag of Japan.svg , alt_flag = Centered deep red circle on a white rectangle , image_coat ...

syllabary as "to-u-ki-yo-u" (とうきょう), using the conventional sorting order for these characters. In addition, in Greater China,
surname stroke order The surname stroke order () is a system for the collation Collation is the assembly of written information into a standard order. Many systems of collation are based on numerical order or alphabetical order, or extensions and combinations there ...
ing is a convention in some official documents where people's names are listed without hierarchy. The radical-and-stroke system, or some similar pattern-matching and stroke-counting method, was traditionally the only practical method for constructing dictionaries that someone could use to look up a logograph whose pronunciation was unknown. With the advent of computers, dictionary programs are now available that allow one to handwrite a character using a mouse or stylus.

# Automation

When information is stored in digital systems, collation may become an automated process. It is then necessary to implement an appropriate collation
algorithm In and , an algorithm () is a finite sequence of , computer-implementable instructions, typically to solve a class of problems or to perform a computation. Algorithms are always and are used as specifications for performing s, , , and other ...

that allows the information to be sorted in a satisfactory manner for the application in question. Often the aim will be to achieve an alphabetical or numerical ordering that follows the standard criteria as described in the preceding sections. However, not all of these criteria are easy to automate.''M Programming: A Comprehensive Guide''
Richard F. Walters, Digital Press, 1997
The simplest kind of automated collation is based on the numerical codes of the symbols in a
character set Character encoding is the process of assigning numbers to Graphics, graphical character (computing), characters, especially the written characters of Language, human language, allowing them to be Data storage, stored, Data communication, transmit ...
, such as
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding Character encoding is the process of assigning numbers to graphical Graphics (from Greek Greek may refer to: Greece Anything of, ...
coding (or any of its
superset In mathematics Mathematics (from Ancient Greek, Greek: ) includes the study of such topics as quantity (number theory), mathematical structure, structure (algebra), space (geometry), and calculus, change (mathematical analysis, analysis). ...
s such as
Unicode Unicode, formally the Unicode Standard, is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expressed in most of the world's wri ...

), with the symbols being ordered in increasing numerical order of their codes, and this ordering being extended to strings in accordance with the basic principles of alphabetical ordering (mathematically speaking,
lexicographical order In mathematics Mathematics (from Greek: ) includes the study of such topics as numbers (arithmetic and number theory), formulas and related structures (algebra), shapes and spaces in which they are contained (geometry), and quantities and ...
ing). So a computer program might treat the characters ''a'', ''b'', ''C'', ''d'', and ''$'' as being ordered ''$'', ''C'', ''a'', ''b'', ''d'' (the corresponding ASCII codes are ''\$'' = 36, ''a'' = 97, ''b'' = 98, ''C'' = 67, and ''d'' = 100). Therefore, strings beginning with ''C'', ''M'', or ''Z'' would be sorted before strings with lower-case ''a'', ''b'', etc. This is sometimes called '' ASCIIbetical order''. This deviates from the standard alphabetical order, particularly due to the ordering of capital letters before all lower-case ones (and possibly the treatment of spaces and other non-letter characters). It is therefore often applied with certain alterations, the most obvious being case conversion (often to uppercase, for historical reasonsHistorically, computers only handled text in uppercase (this dates back to
telegraph Telegraphy is the long-distance transmission of messages where the sender uses symbolic codes, known to the recipient, rather than a physical exchange of an object bearing the message. Thus flag semaphore Flag semaphore (from the Ancient ...

conventions).
) before comparison of ASCII values. In many collation algorithms, the comparison is based not on the numerical codes of the characters, but with reference to the collating sequence – a sequence in which the characters are assumed to come for the purpose of collation – as well as other ordering rules appropriate to the given application. This can serve to apply the correct conventions used for alphabetical ordering in the language in question, dealing properly with differently cased letters, modified letters,
digraphs Digraph may refer to: * Digraph (orthography) A digraph or digram (from the el, δίς ', "double" and ', "to write") is a pair of characters used in the orthography An orthography is a set of conventions for writing Writing is a m ...
, particular abbreviations, and so on, as mentioned above under
Alphabetical order Alphabetical order is a system whereby character string In computer programming Computer programming is the process of designing and building an executable computer program to accomplish a specific computing result or to perform a parti ...
, and in detail in the
Alphabetical order Alphabetical order is a system whereby character string In computer programming Computer programming is the process of designing and building an executable computer program to accomplish a specific computing result or to perform a parti ...
article. Such algorithms are potentially quite complex, possibly requiring several passes through the text. Problems are nonetheless still common when the algorithm has to encompass more than one language. For example, in
German German(s) may refer to: Common uses * of or related to Germany * Germans, Germanic ethnic group, citizens of Germany or people of German ancestry * For citizens of Germany, see also German nationality law * German language The German la ...
dictionaries the word ''ökonomisch'' comes between ''offenbar'' and ''olfaktorisch'', while
Turkish Turkish may refer to: * of or about Turkey Turkey ( tr, Türkiye ), officially the Republic of Turkey, is a country straddling Southeastern Europe and Western Asia. It shares borders with Greece Greece ( el, Ελλάδα, , ), offi ...
dictionaries treat ''o'' and ''ö'' as different letters, placing ''oyun'' before ''öbür''. A standard algorithm for collating any collection of strings composed of any standard
Unicode Unicode, formally the Unicode Standard, is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expressed in most of the world's wri ...

symbols is the
Unicode Collation Algorithm The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system A writing system is a method of visually ...
. This can be adapted to use the appropriate collation sequence for a given language by tailoring its default collation table. Several such tailorings are collected in
Common Locale Data Repository The Common Locale Data Repository Project, often abbreviated as CLDR, is a project of the Unicode Consortium The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) A 501(c)(3) organization is a corporation, trust, unincorporated associatio ...
.

## Sort keys

In some applications, the strings by which items are collated may differ from the identifiers that are displayed. For example, ''The Shining'' might be as ''Shining, The'' (see
Alphabetical order Alphabetical order is a system whereby character string In computer programming Computer programming is the process of designing and building an executable computer program to accomplish a specific computing result or to perform a parti ...
above), but it may still be desired to display it as ''The Shining''. In this case two sets of strings can be stored, one for display purposes, and another for collation purposes. Strings used for collation in this way are called ''sort keys''.

## Issues with numbers

Sometimes, it is desired to order text with embedded numbers using proper numerical order. For example, "Figure 7b" goes before "Figure 11a", even though '7' comes after '1' in
Unicode Unicode, formally the Unicode Standard, is an information technology Technical standard, standard for the consistent character encoding, encoding, representation, and handling of Character (computing), text expressed in most of the world's wri ...

. This can be extended to
Roman numeral Roman numerals are a numeral system A numeral system (or system of numeration) is a writing system for expressing numbers; that is, a mathematical notation for representing numbers of a given set, using Numerical digit, digits or other s ...
s. This behavior is not particularly difficult to produce as long as only integers are to be sorted, although it can slow down sorting significantly. For example,
Microsoft Windows Microsoft Windows, commonly referred to as Windows, is a group of several proprietary {{Short pages monitor