An IETF BCP 47 language tag is a standardized code that is used to identify

human language Language is a structured system of communication that consists of grammar and vocabulary. It is the primary means by which humans convey meaning, both in spoken and signed language, signed forms, and may also be conveyed through writing syste ...

s on the Internet. The tag structure has been standardized by the

Internet Engineering Task Force The Internet Engineering Task Force (IETF) is a standards organization for the Internet standard, Internet and is responsible for the technical standards that make up the Internet protocol suite (TCP/IP). It has no formal membership roster ...

(IETF) in ''Best Current Practice (BCP) 47''; the subtags are maintained by the ''IANA Language Subtag Registry''. To distinguish language variants for countries,

regions In geography, regions, otherwise referred to as areas, zones, lands or territories, are portions of the Earth's surface that are broadly divided by physical characteristics (physical geography), human impact characteristics (human geography), and ...

, or

writing systems A writing system comprises a set of symbols, called a ''script'', as well as the rules by which the script represents a particular language. The earliest writing appeared during the late 4th millennium BC. Throughout history, each independe ...

(scripts), IETF language tags combine subtags from other standards such as

ISO 639 ISO 639 is a international standard, standard by the International Organization for Standardization (ISO) concerned with representation of languages and language groups. It currently consists of four sets (1-3, 5) of code, named after each part w ...

, ISO 15924,

ISO 3166-1 ISO 3166-1 (''Codes for the representation of names of countries and their subdivisions – Part 1: Country code'') is a standard defining codes for the names of countries, dependent territories, and special areas of geographical interest. It i ...

and UN M.49. For example, the tag stands for English; for

Latin American Spanish The different dialects of the Spanish language spoken in the Americas are distinct from each other, as well as from those varieties spoken in the Iberian Peninsula and the Spanish Mediterranean islands—collectively known as Peninsular Spanish� ...

; for Romansh Sursilvan; for Serbian written in

Cyrillic The Cyrillic script ( ) is a writing system used for various languages across Eurasia. It is the designated national script in various Slavic, Turkic, Mongolic, Uralic, Caucasian and Iranic-speaking countries in Southeastern Europe, Ea ...

script; for Min Nan Chinese using traditional Han characters, as spoken in

Taiwan Taiwan, officially the Republic of China (ROC), is a country in East Asia. The main geography of Taiwan, island of Taiwan, also known as ''Formosa'', lies between the East China Sea, East and South China Seas in the northwestern Pacific Ocea ...

; for

Cantonese Cantonese is the traditional prestige variety of Yue Chinese, a Sinitic language belonging to the Sino-Tibetan language family. It originated in the city of Guangzhou (formerly known as Canton) and its surrounding Pearl River Delta. While th ...

using traditional Han characters, as spoken in

Hong Kong Hong Kong)., Legally Hong Kong, China in international treaties and organizations. is a special administrative region of China. With 7.5 million residents in a territory, Hong Kong is the fourth most densely populated region in the wor ...

; and for Zürich German. It is used by computing standards such as

HTTP HTTP (Hypertext Transfer Protocol) is an application layer protocol in the Internet protocol suite model for distributed, collaborative, hypermedia information systems. HTTP is the foundation of data communication for the World Wide Web, wher ...

HTML Hypertext Markup Language (HTML) is the standard markup language for documents designed to be displayed in a web browser. It defines the content and structure of web content. It is often assisted by technologies such as Cascading Style Sheets ( ...

XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...

and PNG.

History

IETF language tags were first defined in RFC 1766, edited by

Harald Tveit Alvestrand Harald Tveit Alvestrand (born 29 June 1959) is a Norwegian computer scientist. He was chair of the Internet Engineering Task Force (IETF) from 2001 until 2005, succeeding Fred Baker. Within the IETF, Alvestrand was earlier the chair of the Area ...

, published in March 1995. The tags used ISO 639 two-letter language codes and ISO 3166 two-letter country codes, and allowed registration of whole tags that included variant or script subtags of three to eight letters. In January 2001, this was updated by RFC 3066, which added the use of

ISO 639-2 ISO 639-2:1998, ''Codes for the representation of names of languages — Part 2: Alpha-3 code'', is the second part of the ISO 639 International standard, standard, which lists Language code, codes for the representation of the names of languages ...

three-letter codes, permitted subtags with digits, and adopted the concept of language ranges from HTTP/1.1 to help with matching of language tags. The next revision of the specification came in September 2006 with the publication of RFC 4646 (the main part of the specification), edited by Addison Philips and Mark Davis and RFC 4647 (which deals with matching behaviour). RFC 4646 introduced a more structured format for language tags, added the use of ISO 15924 four-letter script codes and UN M.49 three-digit geographical region codes, and replaced the old registry of tags with a new registry of subtags. The small number of previously defined tags that did not conform to the new structure were grandfathered in order to maintain compatibility with RFC 3066. The current version of the specification, RFC 5646, was published in September 2009. The main purpose of this revision was to incorporate three-letter codes from

ISO 639-3 ISO 639-3:2007, ''Codes for the representation of names of languages – Part 3: Alpha-3 code for comprehensive coverage of languages'', is an international standard for language codes in the ISO 639 series. It defines three-letter codes for ...

and 639-5 into the Language Subtag Registry, in order to increase the interoperability between ISO 639 and BCP 47.

Syntax of language tags

Each language tag is composed of one or more "subtags" separated by hyphens (-). Each subtag is composed of basic Latin letters or digits only. With the exceptions of private-use language tags beginning with an ''x-'' prefix and grandfathered language tags (including those starting with an ''i-'' prefix and those previously registered in the old Language Tag Registry), subtags occur in the following order: * A single ''primary language subtag'' based on a two-letter language code from

ISO 639-1 ISO 639-1:2002, ''Codes for the representation of names of languages—Part 1: Alpha-2 code'', is the first part of the ISO 639 series of international standards for language codes. Part 1 covers the registration of "set 1" two-letter codes. The ...

(2002) or a three-letter code from

(1998),

(2007) or ISO 639-5 (2008), or registered through the BCP 47 process and composed of five to eight letters; * Up to three optional ''extended language subtags'' composed of three letters each, separated by hyphens; (There is currently no extended language subtag registered in the Language Subtag Registry without an equivalent and preferred primary language subtag. This component of language tags is preserved for backwards compatibility and to allow for future parts of ISO 639.) * An optional ''script subtag'', based on a four-letter script code from ISO 15924 (usually written in

Title Case Title case or headline case is a style of capitalization used for rendering the titles of published works or works of art in English. When using title case, all words are capitalized, except for minor words (typically articles, short prepositio ...

); * An optional ''region subtag'' based on a two-letter country code from

ISO 3166-1 alpha-2 ISO 3166-1 alpha-2 codes are two-letter country codes defined in ISO 3166-1, part of the ISO 3166 standard published by the International Organization for Standardization (ISO), to represent countries, dependent territories, and special ...

(usually written in upper case), or a three-digit code from UN M.49 for geographical regions; * Optional ''variant subtags'', separated by hyphens, each composed of five to eight letters, or of four characters starting with a digit; (Variant subtags are registered with IANA and not associated with any external standard.) * Optional ''extension subtags'', separated by hyphens, each composed of a single character, with the exception of the letter ''x'', and a hyphen followed by one or more subtags of two to eight characters each, separated by hyphens; * An optional ''private-use subtag'', composed of the letter ''x'' and a hyphen followed by subtags of one to eight characters each, separated by hyphens. Subtags are not

case-sensitive In computers, case sensitivity defines whether uppercase and lowercase letters are treated as distinct (case-sensitive) or equivalent (case-insensitive). For instance, when users interested in learning about dogs search an e-book, "dog" and "Dog ...

, but the specification recommends using the same case as in the Language Subtag Registry, where region subtags are

UPPERCASE Letter case is the distinction between the letters that are in larger uppercase or capitals (more formally ''#Majuscule, majuscule'') and smaller lowercase (more formally ''#Minuscule, minuscule'') in the written representation of certain langua ...

, script subtags are

, and all other subtags are

lowercase Letter case is the distinction between the letters that are in larger uppercase or capitals (more formally ''majuscule'') and smaller lowercase (more formally '' minuscule'') in the written representation of certain languages. The writing system ...

. This capitalization follows the recommendations of the underlying ISO standards. Optional script and region subtags are preferred to be omitted when they add no distinguishing information to a language tag. For example, ''es'' is preferred over ''es-Latn'', as Spanish is fully expected to be written in the Latin script; ''ja'' is preferred over ''ja-JP'', as Japanese ''as used in Japan'' does not differ markedly from Japanese as used elsewhere. Not all linguistic regions can be represented with a valid region subtag: the subnational regional dialects of a primary language are registered as variant subtags. For example, the ''valencia'' variant subtag for the

Valencian Valencian can refer to: * Something related to the Valencian Community ( Valencian Country) in Spain * Something related to the city of Valencia * Something related to the province of Valencia in Spain * Something related to the old Kingdom of ...

variant of the Catalan is registered in the Language Subtag Registry with the prefix ''ca''. As this dialect is spoken almost exclusively in Spain, the region subtag ''ES'' can normally be omitted. Furthermore, there are script tags that do not refer to traditional scripts such as Latin, or even scripts at all, and these usually begin with a ''Z.'' For example, ''Zsye'' refers to

emoji An emoji ( ; plural emoji or emojis; , ) is a pictogram, logogram, ideogram, or smiley embedded in text and used in electronic messages and web pages. The primary function of modern emoji is to fill in emotional cues otherwise missing from type ...

s, ''Zmth'' to

mathematical notation Mathematical notation consists of using glossary of mathematical symbols, symbols for representing operation (mathematics), operations, unspecified numbers, relation (mathematics), relations, and any other mathematical objects and assembling ...

, ''Zxxx'' to unwritten documents and ''Zyyy'' to undetermined scripts. IETF language tags have been used as locale identifiers in many applications. It may be necessary for these applications to establish their own strategy for defining, encoding and matching locales if the strategy described in RFC 4647 is not adequate. The use, interpretation and matching of IETF language tags is currently defined in RFC 5646 and RFC 4647. The Language Subtag Registry lists all currently valid public subtags. Private-use subtags are not included in the Registry as they are implementation-dependent and subject to private agreements between third parties using them. These private agreements are out of scope of BCP 47.

List of common primary language subtags

The following is a list of some of the more commonly used primary language subtags. The list represents only a small subset (less than 2 percent) of primary language subtags; for full information, the Language Subtag Registry should be consulted directly.

Relation to other standards

Although some types of subtags are derived from

ISO The International Organization for Standardization (ISO ; ; ) is an independent, non-governmental, international standard development organization composed of representatives from the national standards organizations of member countries. Me ...

or UN core standards, they do not follow these standards absolutely, as this could lead to the meaning of language tags changing over time. In particular, a subtag derived from a code assigned by

, ISO 15924,

ISO 3166 ISO 3166 is a standard published by the International Organization for Standardization (ISO) that defines codes for the names of countries, dependent territories, special areas of geographical interest, and their principal subdivisions (e.g., pro ...

, or UN M49 remains a valid (though deprecated) subtag even if the code is withdrawn from the corresponding core standard. If the standard later assigns a new meaning to the withdrawn code, the corresponding subtag will still retain its old meaning. This stability was introduced in RFC 4646.

ISO 639-3 and ISO 639-1

RFC 4646 defined the concept of an "extended language subtag" (sometimes referred to as ''extlang''), although no such subtags were registered at that time. RFC 5645 and RFC 5646 added primary language subtags corresponding to

codes for all languages that did not already exist in the Registry. In addition, codes for languages encompassed by certain macrolanguages were registered as extended language subtags. Sign languages were also registered as extlangs, with the prefix ''sgn''. These languages may be represented either with the subtag for the encompassed language alone (''cmn'' for Mandarin) or with a language-extlang combination (''zh-cmn''). The first option is preferred for most purposes. The second option is called "extlang form" and is new in RFC 5646. Whole tags that were registered prior to RFC 4646 and are now classified as "grandfathered" or "redundant" (depending on whether they fit the new syntax) are deprecated in favor of the corresponding ISO 639-3–based language subtag, if one exists. To list a few examples, ''nan'' is preferred over ''zh-min-nan'' for

Min Nan Southern Min (), Minnan ( Mandarin pronunciation: ) or Banlam (), is a group of linguistically similar and historically related Chinese languages that form a branch of Min Chinese spoken in Fujian (especially the Minnan region), most of Taiwan ...

Chinese; ''hak'' is preferred over ''i-hak'' and ''zh-hakka'' for

Hakka Chinese Hakka ( zh, c=, p=Kèjiāhuà; '' Pha̍k-fa-sṳ: '', zh, c=, p=Kèjiāyǔ; '' Pha̍k-fa-sṳ: '') forms a language group of varieties of Chinese, spoken natively by the Hakka people in parts of Southern China, Taiwan, some diaspora areas ...

; and ''ase'' is preferred over ''sgn-US'' for

American Sign Language American Sign Language (ASL) is a natural language that serves as the predominant sign language of Deaf communities in the United States and most of Anglophone Canadians, Anglophone Canada. ASL is a complete and organized visual language that i ...

Windows Vista Windows Vista is a major release of the Windows NT operating system developed by Microsoft. It was the direct successor to Windows XP, released five years earlier, which was then the longest time span between successive releases of Microsoft W ...

and later versions of Microsoft Windows have RFC 4646 support.

ISO 639-5 and ISO 639-1/2

ISO 639-5 ISO 639-5:2008 "Codes for the representation of names of languages—Part 5: Alpha-3 code for language families and groups" is an international standard published by the International Organization for Standardization (ISO). It was developed by ISO ...

defines language collections with alpha-3 codes in a different way than they were initially encoded in ISO 639-2 (including one code already present in ISO 639-1, Bihari coded inclusively as ''bh'' in ISO 639-1 and ''bih'' in ISO 639-2). Specifically, the language collections are now all defined in ISO 639-5 as inclusive, rather than some of them being defined exclusively. This means that language collections have a broader scope than before, in some cases where they could encompass languages that were already encoded separately within ISO 639-2. For example, the ISO 639-2 code ''afa'' was previously associated with the name "Afro-Asiatic (Other)", excluding languages such as Arabic that already had their own code. In ISO 639-5, this collection is named "Afro-Asiatic languages" and includes all such languages. ISO 639-2 changed the exclusive names in 2009 to match the inclusive ISO 639-5 names. To avoid breaking implementations that may still depend on the older (exclusive) definition of these collections, ISO 639-5 defines a grouping type attribute for all collections that were already encoded in ISO 639-2 (such grouping type is not defined for the new collections added only in ISO 639-5). BCP 47 defines a "Scope" property to identify subtags for language collections. However, it does not define any given collection as inclusive or exclusive, and does not use the ISO 639-5 grouping type attribute, although the description fields in the Language Subtag Registry for these subtags match the ISO 639-5 (inclusive) names. As a consequence, BCP 47 language tags that include a primary language subtag for a collection may be ambiguous as to whether the collection is intended to be inclusive or exclusive. ISO 639-5 does not define precisely which languages are members of these collections; only the hierarchical classification of collections is defined, using the inclusive definition of these collections. Because of this, RFC 5646 does not recommend the use of subtags for language collections for most applications, although they are still preferred over subtags whose meaning is even less specific, such as "Multiple languages" and "Undetermined". In contrast, the classification of individual languages within their macrolanguage is standardized, in both ISO 639-3 and the Language Subtag Registry.

ISO 15924, ISO/IEC 10646 and Unicode

Script subtags were first added to the Language Subtag Registry when RFC 4646 was published, from the list of codes defined in ISO 15924. They are encoded in the language tag after primary and extended language subtags, but before other types of subtag, including region and variant subtags. Some primary language subtags are defined with a property named "Suppress-Script" which indicates the cases where a single script can usually be assumed by default for the language, even if it can be written with another script. When this is the case, it is preferable to omit the script subtag, to improve the likelihood of successful matching. A different script subtag can still be appended to make the distinction when necessary. For example, ''yi'' is preferred over ''yi-Hebr'' in most contexts, because the Hebrew script subtag is assumed for the

Yiddish Yiddish, historically Judeo-German, is a West Germanic language historically spoken by Ashkenazi Jews. It originated in 9th-century Central Europe, and provided the nascent Ashkenazi community with a vernacular based on High German fused with ...

language. As another example, ''zh-Hans-SG'' may be considered equivalent to ''zh-Hans'', because the region code is probably not significant; the written form of Chinese used in Singapore uses the same simplified Chinese characters as in other countries where Chinese is written. However, the script subtag is maintained because it is significant. ISO 15924 includes some codes for script variants (for example, ''Hans'' and ''Hant'' for simplified and traditional forms of Chinese characters) that are unified within

Unicode Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...

and

ISO/IEC 10646 ISO/IEC JTC 1, entitled "Information technology", is a joint technical committee (JTC) of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC). Its purpose is to develop, maintain and ...

. These script variants are most often encoded for bibliographic purposes, but are not always significant from a linguistic point of view (for example, ''Latf'' and ''Latg'' script codes for the Fraktur and Gaelic variants of the Latin script, which are mostly encoded with regular Latin letters in Unicode and ISO/IEC 10646). They may occasionally be useful in language tags to expose orthographic or semantic differences, with different analysis of letters, diacritics, and digraphs/trigraphs as default grapheme clusters, or differences in letter casing rules.

ISO 3166-1 and UN M.49

Two-letter region subtags are based on codes assigned, or "exceptionally reserved", in

. If the ISO 3166 Maintenance Agency were to reassign a code that had previously been assigned to a different country, the existing BCP 47 subtag corresponding to that code would retain its meaning, and a new region subtag based on UN M.49 would be registered for the new country. UN M.49 is also the source for numeric region subtags for geographical regions, such as 005 for South America. The UN M.49 codes for economic regions are not allowed. Region subtags are used to specify the variety of a language "as used in" a particular region. They are appropriate when the variety is regional in nature, and can be captured adequately by identifying the countries involved, as when distinguishing

British English British English is the set of Variety (linguistics), varieties of the English language native to the United Kingdom, especially Great Britain. More narrowly, it can refer specifically to the English language in England, or, more broadly, to ...

(''en-GB'') from

American English American English, sometimes called United States English or U.S. English, is the set of variety (linguistics), varieties of the English language native to the United States. English is the Languages of the United States, most widely spoken lang ...

(''en-US''). When the difference is one of script or script variety, as for simplified versus traditional Chinese characters, it should be expressed with a script subtag instead of a region subtag; in this example, ''zh-Hans'' and ''zh-Hant'' should be used instead of ''zh-CN/zh-SG/zh-MY'' and ''zh-TW/zh-HK/zh-MO''. When a distinct language subtag exists for a language that could be considered a regional variety, it is often preferable to use the more specific subtag instead of a language-region combination. For example, ''ar-DZ'' (

Arabic Arabic (, , or , ) is a Central Semitic languages, Central Semitic language of the Afroasiatic languages, Afroasiatic language family spoken primarily in the Arab world. The International Organization for Standardization (ISO) assigns lang ...

as used in

Algeria Algeria, officially the People's Democratic Republic of Algeria, is a country in the Maghreb region of North Africa. It is bordered to Algeria–Tunisia border, the northeast by Tunisia; to Algeria–Libya border, the east by Libya; to Alger ...

) may be better expressed as ''arq'' for Algerian Spoken Arabic.

Adherence to core standards

Disagreements about language identification may extend to BCP 47 and to the core standards that inform it. For example, some speakers of Punjabi believe that the ISO 639-3 distinction between an"Panjabi" and nb"Western Panjabi" is spurious (i.e. they feel the two are the same language); that sub-varieties of the

Arabic script The Arabic script is the writing system used for Arabic (Arabic alphabet) and several other languages of Asia and Africa. It is the second-most widely used alphabetic writing system in the world (after the Latin script), the second-most widel ...

should be encoded separately in ISO 15924 (as, for example, the

Fraktur Fraktur () is a calligraphic hand of the Latin alphabet and any of several blackletter typefaces derived from this hand. It is designed such that the beginnings and ends of the individual strokes that make up each letter will be clearly vis ...

and Gaelic styles of the Latin script are); and that BCP 47 should reflect these views or overrule the core standards with regard to them. BCP 47 delegates this type of judgment to the core standards, and does not attempt to overrule or supersede them. Variant subtags and (theoretically) primary language subtags may be registered individually, but not in a way that contradicts the core standards.

Extensions

''Extension subtags'' (not to be confused with ''extended language subtags'') allow additional information to be attached to a language tag that does not necessarily serve to identify a language. One use for extensions is to encode locale information, such as calendar and currency. Extension subtags are composed of multiple hyphen-separated character strings, starting with a single character (other than ''x''), called a ''singleton''. Each extension is described in its own

IETF The Internet Engineering Task Force (IETF) is a standards organization for the Internet standard, Internet and is responsible for the technical standards that make up the Internet protocol suite (TCP/IP). It has no formal membership roster ...

RFC, which identifies a Registration Authority to manage the data for that extension.

IANA The Internet Assigned Numbers Authority (IANA) is a standards organization that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System (DNS), media types, and other Internet P ...

is responsible for allocating singletons. Two extensions have been assigned as of January 2014.

Extension T (Transformed Content)

Extension T allows a language tag to include information on how the tagged data was transliterated, transcribed, or otherwise transformed. For example, the tag ''en-t-jp'' could be used for content in English that was translated from the original Japanese. Additional substrings could indicate that the translation was done mechanically, or in accordance with a published standard. Extension T is described in the informational RFC 6497, published in February 2012. The Registration Authority is the

Unicode Consortium The Unicode Consortium (legally Unicode, Inc.) is a 501(c)(3) non-profit organization incorporated and based in Mountain View, California, U.S. Its primary purpose is to maintain and publish the Unicode Standard which was developed with the in ...

Extension U (Unicode Locale)

Extension U allows a wide variety of locale attributes found in the

Common Locale Data Repository The Common Locale Data Repository (CLDR) is a project of the Unicode Consortium to provide locale data in XML format for use in computer applications. CLDR contains locale-specific information that an operating system will typically provide to ...

(CLDR) to be embedded in language tags. These attributes include country subdivisions, calendar and time zone data, collation order, currency, number system, and keyboard identification. Some examples include: * ''gsw-u-sd-chzh'' represents

Swiss German Swiss German (Standard German: , ,Because of the many different dialects, and because there is no #Conventions, defined orthography for any of them, many different spellings can be found. and others; ) is any of the Alemannic German, Alemannic ...

as used in the

Canton of Zurich The canton of Zurich is an administrative unit (Swiss canton, canton) of Switzerland, situated in the northeastern part of the country. With a population of (as of ), it is the most populous canton of Switzerland. Zurich is the ''de facto'' Capi ...

. * ''ar-u-nu-latn'' represents Arabic-language content using Basic Latin digits (0 through 9) instead of Arabic-script digits (٠ through ٩). * ''he-IL-u-ca-hebrew-tz-jeruslm'' represents Hebrew as spoken in Israel, using the traditional

Hebrew calendar The Hebrew calendar (), also called the Jewish calendar, is a lunisolar calendar used today for Jewish religious observance and as an official calendar of Israel. It determines the dates of Jewish holidays and other rituals, such as '' yahrze ...

, and in the "Asia/Jerusalem" time zone as identified in the

tz database The tz database is a collaborative compilation of information about the world's time zones and rules for observing daylight saving time, primarily intended for use with computer programs and operating systems. Paul Eggert has been its editor an ...

. Extension U is described in the informational RFC 6067, published in December 2010. The Registration Authority is the

References

External links

BCP 47 Language Tags
– current specification * ** contains two RFCs published separately at different dates, but concatenated in a single document: **
RFC 4647Matching of Language Tags
**

** (also referencing the related informational RFC 5645, which complements the previous informational RFC 4645, as well other individual registration forms published separately by others for each language added or modified in the Registry between these BCP 47 revisions)
Language Subtag Registry
– maintained by IANA
Language Subtag Registry Search
– find subtags and view entries in the Registry *
Language tags in HTML and XML
– from the W3C *
Language Tags
– from the IETF Language Tag Registry Update working group Internet properties established in 1995 Internet governance Request for Comments ISO standards Language identifiers Unique identifiers Internationalization and localization