HOME

TheInfoList



OR:

Collation is the assembly of written information into a standard order. Many systems of collation are based on numerical order or
alphabetical order Alphabetical order is a system whereby character strings are placed in order based on the position of the characters in the conventional ordering of an alphabet. It is one of the methods of collation. In mathematics, a lexicographical order is ...
, or extensions and combinations thereof. Collation is a fundamental element of most office filing systems,
library catalog A library catalog (or library catalogue in British English) is a register of all bibliographic items found in a library or group of libraries, such as a network of libraries at several locations. A catalog for a group of libraries is also ...
s, and
reference book A reference work is a work, such as a paper, book or periodical (or their electronic equivalents), to which one can refer for information. The information is intended to be found quickly when needed. Such works are usually ''referred'' to f ...
s. Collation differs from ''
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...
'' in that the classes themselves are not necessarily ordered. However, even if the order of the classes is irrelevant, the identifiers of the classes may be members of an ordered set, allowing a
sorting algorithm In computer science, a sorting algorithm is an algorithm that puts elements of a list into an order. The most frequently used orders are numerical order and lexicographical order, and either ascending or descending. Efficient sorting is importa ...
to arrange the items by class. Formally speaking, a collation method typically defines a
total order In mathematics, a total or linear order is a partial order in which any two elements are comparable. That is, a total order is a binary relation \leq on some set X, which satisfies the following for all a, b and c in X: # a \leq a ( reflexi ...
on a set of possible identifiers, called sort keys, which consequently produces a
total preorder In mathematics, especially order theory, a weak ordering is a mathematical formalization of the intuitive notion of a ranking of a set, some of whose members may be tied with each other. Weak orders are a generalization of totally ordered s ...
on the set of items of information (items with the same identifier are not placed in any defined order). A collation algorithm such as the
Unicode collation algorithm The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system and language that can be represented with ...
defines an order through the process of comparing two given
character string In computer programming, a string is traditionally a sequence of characters, either as a literal constant or as some kind of variable. The latter may allow its elements to be mutated and the length changed, or it may be fixed (after creation) ...
s and deciding which should come before the other. When an order has been defined in this way, a sorting algorithm can be used to put a list of any number of items into that order. The main advantage of collation is that it makes it fast and easy for a user to find an element in the list, or to confirm that it is absent from the list. In automatic systems this can be done using a
binary search algorithm In computer science, binary search, also known as half-interval search, logarithmic search, or binary chop, is a search algorithm that finds the position of a target value within a sorted array. Binary search compares the target value to the m ...
or
interpolation search Interpolation search is an algorithm for searching for a key in an array that has been ordered by numerical values assigned to the keys (''key values''). It was first described by W. W. Peterson in 1957. Interpolation search resembles the method ...
; manual searching may be performed using a roughly similar procedure, though this will often be done unconsciously. Other advantages are that one can easily find the first or last elements on the list (most likely to be useful in the case of numerically sorted data), or elements in a given range (useful again in the case of numerical data, and also with alphabetically ordered data when one may be sure of only the first few letters of the sought item or items).


Ordering


Numerical and chronological

Strings representing
number A number is a mathematical object used to count, measure, and label. The original examples are the natural numbers 1, 2, 3, 4, and so forth. Numbers can be represented in language with number words. More universally, individual numbers can ...
s may be sorted based on the values of the numbers that they represent. For example, "−4", "2.5", "10", "89", "30,000". Note that pure application of this method may provide only a partial ordering on the strings, since different strings can represent the same number (as with "2" and "2.0" or, when
scientific notation Scientific notation is a way of expressing numbers that are too large or too small (usually would result in a long string of digits) to be conveniently written in decimal form. It may be referred to as scientific form or standard index form, o ...
is used, "2e3" and "2000"). A similar approach may be taken with strings representing
dates Date or dates may refer to: *Date (fruit), the fruit of the date palm (''Phoenix dactylifera'') Social activity *Dating, a form of courtship involving social activity, with the aim of assessing a potential partner ** Group dating *Play date, a ...
or other items that can be ordered chronologically or in some other natural fashion.


Alphabetical

Alphabetical order Alphabetical order is a system whereby character strings are placed in order based on the position of the characters in the conventional ordering of an alphabet. It is one of the methods of collation. In mathematics, a lexicographical order is ...
is the basis for many systems of collation where items of information are identified by strings consisting principally of
letters Letter, letters, or literature may refer to: Characters typeface * Letter (alphabet), a character representing one or more of the sounds used in speech; any of the symbols of an alphabet. * Letterform, the graphic form of a letter of the alphabe ...
from an
alphabet An alphabet is a standardized set of basic written graphemes (called letter (alphabet), letters) that represent the phonemes of certain spoken languages. Not all writing systems represent language in this way; in a syllabary, each character ...
. The ordering of the strings relies on the existence of a standard ordering for the letters of the alphabet in question. (The system is not limited to alphabets in the strict technical sense; languages that use a
syllabary In the linguistic study of written languages, a syllabary is a set of written symbols that represent the syllables or (more frequently) moras which make up words. A symbol in a syllabary, called a syllabogram, typically represents an (optiona ...
or
abugida An abugida (, from Ge'ez: ), sometimes known as alphasyllabary, neosyllabary or pseudo-alphabet, is a segmental writing system in which consonant-vowel sequences are written as units; each unit is based on a consonant letter, and vowel not ...
, for example Cherokee, can use the same ordering principle provided there is a set ordering for the symbols used.) To decide which of two strings comes first in alphabetical order, initially their first letters are compared. The string whose first letter appears earlier in the alphabet comes first in alphabetical order. If the first letters are the same, then the second letters are compared, and so on, until the order is decided. (If one string runs out of letters to compare, then it is deemed to come first; for example, "cart" comes before "carthorse".) The result of arranging a set of strings in alphabetical order is that words with the same first letter are grouped together, and within such a group words with the same first two letters are grouped together, and so on.
Capital letter Letter case is the distinction between the letters that are in larger uppercase or capitals (or more formally ''majuscule'') and smaller lowercase (or more formally ''minuscule'') in the written representation of certain languages. The writing ...
s are typically treated as equivalent to their corresponding lowercase letters. (For alternative treatments in computerized systems, see
Automated collation Collation is the assembly of written information into a standard order. Many systems of collation are based on numerical order or alphabetical order, or extensions and combinations thereof. Collation is a fundamental element of most office fili ...
, below.) Certain limitations, complications, and special conventions may apply when alphabetical order is used: *When strings contain spaces or other word dividers, the decision must be taken whether to ignore these dividers or to treat them as symbols preceding all other letters of the alphabet. For example, if the first approach is taken then "car park" will come after "carbon" and "carp" (as it would if it were written "carpark"), whereas in the second approach "car park" will come before those two words. The first rule is used in many (but not all)
dictionaries A dictionary is a listing of lexemes from the lexicon of one or more specific languages, often arranged alphabetically (or by radical and stroke for ideographic languages), which may include information on definitions, usage, etymologies ...
, the second in
telephone directories A telephone directory, commonly called a telephone book, telephone address book, phonebook, or the white and yellow pages, is a listing of telephone subscribers in a geographical area or subscribers to services provided by the organization that ...
(so that Wilson, Jim K appears with other people named Wilson, Jim and not after Wilson, Jimbo). *Abbreviations may be treated as if they were spelt out in full. For example, names containing "St." (short for the English word ''
Saint In religious belief, a saint is a person who is recognized as having an exceptional degree of holiness, likeness, or closeness to God. However, the use of the term ''saint'' depends on the context and denomination. In Catholic, Eastern Ortho ...
'') are often ordered as if they were written out as "Saint". There is also a traditional convention in English that surnames beginning ''Mc'' and ''M are listed as if those prefixes were written ''Mac''. *Strings that represent personal names will often be listed by alphabetical order of surname, even if the
given name A given name (also known as a forename or first name) is the part of a personal name quoted in that identifies a person, potentially with a middle name as well, and differentiates that person from the other members of a group (typically a fa ...
comes first. For example, Juan Hernandes and Brian O'Leary should be sorted as "Hernandes, Juan" and "O'Leary, Brian" even if they are not written this way. *Very common initial words, such as ''The'' in English, are often ignored for sorting purposes. So '' The Shining'' would be sorted as just "Shining" or "Shining, The". *When some of the strings contain
numerals A numeral is a figure, symbol, or group of figures or symbols denoting a number. It may refer to: * Numeral system used in mathematics * Numeral (linguistics), a part of speech denoting numbers (e.g. ''one'' and ''first'' in English) * Numerical d ...
(or other non-letter characters), various approaches are possible. Sometimes such characters are treated as if they came before or after all the letters of the alphabet. Another method is for numbers to be sorted alphabetically as they would be spelled: for example ''
1776 Events January–February * January 1 – American Revolutionary War – Burning of Norfolk: The town of Norfolk, Virginia is destroyed, by the combined actions of the British Royal Navy and occupying Patriot forces. * January ...
'' would be sorted as if spelled out "seventeen seventy-six", and '' 24 heures du Mans'' as if spelled "vingt-quatre..." (French for "twenty-four"). When numerals or other symbols are used as special graphical forms of letters, as in ''1337'' for
leet Leet (or "1337"), also known as eleet or leetspeak, is a system of modified spellings used primarily on the Internet. It often uses character replacements in ways that play on the similarity of their glyphs via reflection or other resemblan ...
or ''Se7en'' for the movie title '' Seven'', they may be sorted as if they were those letters. *Languages have different conventions for treating
modified letter A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacriti ...
s and certain letter combinations. For example, in
Spanish Spanish might refer to: * Items from or related to Spain: **Spaniards are a nation and ethnic group indigenous to Spain **Spanish language, spoken in Spain and many Latin American countries **Spanish cuisine Other places * Spanish, Ontario, Can ...
the letter ''ñ'' is treated as a basic letter following ''n'', and the digraphs ''ch'' and ''ll'' were formerly (until 1994) treated as basic letters following ''c'' and ''l'', although they are now alphabetized as two-letter combinations. A list of such conventions for various languages can be found at . In several languages the rules have changed over time, and so older dictionaries may use a different order than modern ones. Furthermore, collation may depend on use. For example, German
dictionaries A dictionary is a listing of lexemes from the lexicon of one or more specific languages, often arranged alphabetically (or by radical and stroke for ideographic languages), which may include information on definitions, usage, etymologies ...
and
telephone directories A telephone directory, commonly called a telephone book, telephone address book, phonebook, or the white and yellow pages, is a listing of telephone subscribers in a geographical area or subscribers to services provided by the organization that ...
use different approaches.


Radical-and-stroke sorting

:''See also Indexing of Chinese characters'' Another form of collation is radical-and-stroke sorting, used for non-alphabetic writing systems such as the
hanzi Chinese characters () are logograms developed for the writing of Chinese. In addition, they have been adapted to write other East Asian languages, and remain a key component of the Japanese writing system where they are known as ''kanji' ...
of
Chinese Chinese can refer to: * Something related to China * Chinese people, people of Chinese nationality, citizenship, and/or ethnicity **''Zhonghua minzu'', the supra-ethnic concept of the Chinese nation ** List of ethnic groups in China, people of v ...
and the
kanji are the logographic Chinese characters taken from the Chinese script and used in the writing of Japanese. They were made a major part of the Japanese writing system during the time of Old Japanese and are still used, along with the subsequen ...
of
Japanese Japanese may refer to: * Something from or related to Japan, an island country in East Asia * Japanese language, spoken mainly in Japan * Japanese people, the ethnic group that identifies with Japan through ancestry or culture ** Japanese diaspor ...
, whose thousands of symbols defy ordering by convention. In this system, common components of characters are identified; these are called radicals in Chinese and logographic systems derived from Chinese. Characters are then grouped by their primary radical, then ordered by number of pen strokes within radicals. When there is no obvious radical or more than one radical, convention governs which is used for collation. For example, the Chinese character 妈 (meaning "mother") is sorted as a six-stroke character under the three-stroke primary radical 女. The radical-and-stroke system is cumbersome compared to an alphabetical system in which there are a few characters, all unambiguous. The choice of which components of a logograph comprise separate radicals and which radical is primary is not clear-cut. As a result, logographic languages often supplement radical-and-stroke ordering with alphabetic sorting of a phonetic conversion of the logographs. For example, the kanji word ''Tōkyō'' (東京) can be sorted as if it were spelled out in the Japanese characters of the
hiragana is a Japanese syllabary, part of the Japanese writing system, along with ''katakana'' as well as ''kanji''. It is a phonetic lettering system. The word ''hiragana'' literally means "flowing" or "simple" kana ("simple" originally as contras ...
syllabary as "to-u-ki-yo-u" (とうきょう), using the conventional sorting order for these characters. In addition, in Greater China,
surname stroke order The surname stroke order () is a system for the collation of Chinese surnames. It arose as an impartial method of categorization of the order in which names appear in official documentation or in ceremonial procedure without any line of hierarchy ...
ing is a convention in some official documents where people's names are listed without hierarchy.


Automation

When information is stored in digital systems, collation may become an automated process. It is then necessary to implement an appropriate collation
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
that allows the information to be sorted in a satisfactory manner for the application in question. Often the aim will be to achieve an alphabetical or numerical ordering that follows the standard criteria as described in the preceding sections. However, not all of these criteria are easy to automate.''M Programming: A Comprehensive Guide''
Richard F. Walters, Digital Press, 1997
The simplest kind of automated collation is based on the numerical codes of the symbols in a
character set Character encoding is the process of assigning numbers to graphical characters, especially the written characters of human language, allowing them to be stored, transmitted, and transformed using digital computers. The numerical values that ...
, such as
ASCII ASCII ( ), abbreviated from American Standard Code for Information Interchange, is a character encoding standard for electronic communication. ASCII codes represent text in computers, telecommunications equipment, and other devices. Because of ...
coding (or any of its
superset In mathematics, set ''A'' is a subset of a set ''B'' if all elements of ''A'' are also elements of ''B''; ''B'' is then a superset of ''A''. It is possible for ''A'' and ''B'' to be equal; if they are unequal, then ''A'' is a proper subset o ...
s such as
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whic ...
), with the symbols being ordered in increasing numerical order of their codes, and this ordering being extended to strings in accordance with the basic principles of alphabetical ordering (mathematically speaking,
lexicographical order In mathematics, the lexicographic or lexicographical order (also known as lexical order, or dictionary order) is a generalization of the alphabetical order of the dictionaries to sequences of ordered symbols or, more generally, of elements of a ...
ing). So a computer program might treat the characters ''a'', ''b'', ''C'', ''d'', and ''$'' as being ordered ''$'', ''C'', ''a'', ''b'', ''d'' (the corresponding ASCII codes are ''$'' = 36, ''a'' = 97, ''b'' = 98, ''C'' = 67, and ''d'' = 100). Therefore, strings beginning with ''C'', ''M'', or ''Z'' would be sorted before strings with lower-case ''a'', ''b'', etc. This is sometimes called '' ASCIIbetical order''. This deviates from the standard alphabetical order, particularly due to the ordering of capital letters before all lower-case ones (and possibly the treatment of spaces and other non-letter characters). It is therefore often applied with certain alterations, the most obvious being case conversion (often to uppercase, for historical reasonsHistorically, computers only handled text in uppercase (this dates back to
telegraph Telegraphy is the long-distance transmission of messages where the sender uses symbolic codes, known to the recipient, rather than a physical exchange of an object bearing the message. Thus flag semaphore is a method of telegraphy, whereas p ...
conventions).
) before comparison of ASCII values. In many collation algorithms, the comparison is based not on the numerical codes of the characters, but with reference to the collating sequence – a sequence in which the characters are assumed to come for the purpose of collation – as well as other ordering rules appropriate to the given application. This can serve to apply the correct conventions used for alphabetical ordering in the language in question, dealing properly with differently cased letters,
modified letter A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacriti ...
s, digraphs, particular abbreviations, and so on, as mentioned above under
Alphabetical order Alphabetical order is a system whereby character strings are placed in order based on the position of the characters in the conventional ordering of an alphabet. It is one of the methods of collation. In mathematics, a lexicographical order is ...
, and in detail in the
Alphabetical order Alphabetical order is a system whereby character strings are placed in order based on the position of the characters in the conventional ordering of an alphabet. It is one of the methods of collation. In mathematics, a lexicographical order is ...
article. Such algorithms are potentially quite complex, possibly requiring several passes through the text. Problems are nonetheless still common when the algorithm has to encompass more than one language. For example, in German dictionaries the word ''ökonomisch'' comes between ''offenbar'' and ''olfaktorisch'', while Turkish dictionaries treat ''o'' and ''ö'' as different letters, placing ''oyun'' before ''öbür''. A standard algorithm for collating any collection of strings composed of any standard
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whic ...
symbols is the
Unicode Collation Algorithm The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system and language that can be represented with ...
. This can be adapted to use the appropriate collation sequence for a given language by tailoring its default collation table. Several such tailorings are collected in Common Locale Data Repository.


Sort keys

In some applications, the strings by which items are collated may differ from the identifiers that are displayed. For example, ''The Shining'' might be sorted as ''Shining, The'' (see
Alphabetical order Alphabetical order is a system whereby character strings are placed in order based on the position of the characters in the conventional ordering of an alphabet. It is one of the methods of collation. In mathematics, a lexicographical order is ...
above), but it may still be desired to display it as ''The Shining''. In this case two sets of strings can be stored, one for display purposes, and another for collation purposes. Strings used for collation in this way are called ''sort keys''.


Issues with numbers

Sometimes, it is desired to order text with embedded numbers using proper numerical order. For example, "Figure 7b" goes before "Figure 11a", even though '7' comes after '1' in
Unicode Unicode, formally The Unicode Standard,The formal version reference is is an information technology standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The standard, whic ...
. This can be extended to
Roman numeral Roman numerals are a numeral system that originated in ancient Rome and remained the usual way of writing numbers throughout Europe well into the Late Middle Ages. Numbers are written with combinations of letters from the Latin alphabet, ea ...
s. This behavior is not particularly difficult to produce as long as only integers are to be sorted, although it can slow down sorting significantly. For example,
Microsoft Windows Windows is a group of several proprietary graphical operating system families developed and marketed by Microsoft. Each family caters to a certain sector of the computing industry. For example, Windows NT for consumers, Windows Server for se ...
does this when sorting
file name A filename or file name is a name used to uniquely identify a computer file in a directory structure. Different file systems impose different restrictions on filename lengths. A filename may (depending on the file system) include: * name &ndas ...
s. Sorting decimals properly is a bit more difficult, because different locales use different symbols for a
decimal point A decimal separator is a symbol used to separate the integer part from the fractional part of a number written in decimal form (e.g., "." in 12.45). Different countries officially designate different symbols for use as the separator. The cho ...
, and sometimes the same character used as a
decimal point A decimal separator is a symbol used to separate the integer part from the fractional part of a number written in decimal form (e.g., "." in 12.45). Different countries officially designate different symbols for use as the separator. The cho ...
is also used as a separator, for example "Section 3.2.5". There is no universal answer for how to sort such strings; any rules are application dependent.


Labeling of ordered items

In some contexts, numbers and letters are used not so much as a basis for establishing an ordering, but as a means of labeling items that are already ordered. For example, pages, sections, chapters, and the like, as well as the items of lists, are frequently "numbered" in this way. Labeling series that may be used include ordinary
Arabic numerals Arabic numerals are the ten numerical digits: , , , , , , , , and . They are the most commonly used symbols to write decimal numbers. They are also used for writing numbers in other systems such as octal, and for writing identifiers such as ...
(1, 2, 3, ...),
Roman numerals Roman numerals are a numeral system that originated in ancient Rome and remained the usual way of writing numbers throughout Europe well into the Late Middle Ages. Numbers are written with combinations of letters from the Latin alphabet, eac ...
(I, II, III, ... or i, ii, iii, ...), or letters (A, B, C, ... or a, b, c, ...). (An alternative method for indicating list items, without numbering them, is to use a bulleted list.) When letters of an alphabet are used for this purpose of
enumeration An enumeration is a complete, ordered listing of all the items in a collection. The term is commonly used in mathematics and computer science to refer to a listing of all of the elements of a set. The precise requirements for an enumeration ...
, there are certain language-specific conventions as to which letters are used. For example, the
Russian Russian(s) refers to anything related to Russia, including: *Russians (, ''russkiye''), an ethnic group of the East Slavic peoples, primarily living in Russia and neighboring countries * Rossiyane (), Russian language term for all citizens and p ...
letters Ъ and Ь (which in writing are only used for modifying the preceding consonant), and usually also Ы, Й, and Ё, are omitted. Also in many languages that use extended
Latin script The Latin script, also known as Roman script, is an alphabetic writing system based on the letters of the classical Latin alphabet, derived from a form of the Greek alphabet which was in use in the ancient Greek city of Cumae, in southern Ita ...
, the
modified letter A diacritic (also diacritical mark, diacritical point, diacritical sign, or accent) is a glyph added to a letter or to a basic glyph. The term derives from the Ancient Greek (, "distinguishing"), from (, "to distinguish"). The word ''diacriti ...
s are often not used in enumeration.


See also

*
Alphabetical order Alphabetical order is a system whereby character strings are placed in order based on the position of the characters in the conventional ordering of an alphabet. It is one of the methods of collation. In mathematics, a lexicographical order is ...
* Asciibetical order *
Sorting Sorting refers to ordering data in an increasing or decreasing manner according to some linear relationship among the data items. # ordering: arranging items in a sequence ordered by some criterion; # categorizing: grouping items with similar pr ...
* Taxonomic sequence * Mac and Mc together *
Unicode equivalence Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with preexisting st ...
* Natural sort order


Notes


References


External links


Unicode Collation Algorithm
Unicode Technical Standard #10

* ttp://www.w3.org/TR/css3-lists Typographical collation for many languages as proposed in the List module of
Cascading Style Sheet Cascading Style Sheets (CSS) is a style sheet language used for describing the presentation of a document written in a markup language such as HTML or XML (including XML dialects such as SVG, MathML or XHTML). CSS is a cornerstone techno ...
s.
Collation Charts
Charts demonstrating language-specific sorting orders in various operating systems and DBMS
ICU Locale Explorer
{{Webarchive, url=https://web.archive.org/web/20080511185729/http://demo.icu-project.org/icu-bin/locexp?_=en_US&x=col , date=2008-05-11 : An online demonstration of sorting in different languages that uses the
Unicode Collation Algorithm The Unicode collation algorithm (UCA) is an algorithm defined in Unicode Technical Report #10, which is a customizable method to produce binary keys from strings representing text in any writing system and language that can be represented with ...
with
International Components for Unicode International Components for Unicode (ICU) is an open-source project of mature C/ C++ and Java libraries for Unicode support, software internationalization, and software globalization. ICU is widely portable to many operating systems and environ ...