International Components for Unicode (ICU) is an
open-source project of mature
C/
C++ and
Java
Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
libraries for
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
support, software
internationalization, and software globalization. ICU is widely portable to many operating systems and environments. It gives applications the same results on all platforms and between C, C++, and Java software. The ICU project is a technical committee of the
Unicode Consortium and sponsored, supported, and used by
IBM
International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
and many other companies. ICU has been included as a standard component with
Microsoft Windows
Windows is a Product lining, product line of Proprietary software, proprietary graphical user interface, graphical operating systems developed and marketed by Microsoft. It is grouped into families and subfamilies that cater to particular sec ...
since
Windows 10
Windows 10 is a major release of Microsoft's Windows NT operating system. The successor to Windows 8.1, it was Software release cycle#Release to manufacturing (RTM), released to manufacturing on July 15, 2015, and later to retail on July 2 ...
version 1703.
ICU provides the following services:
Unicode
Unicode or ''The Unicode Standard'' or TUS is a character encoding standard maintained by the Unicode Consortium designed to support the use of text in all of the world's writing systems that can be digitized. Version 16.0 defines 154,998 Char ...
text handling, full character properties, and
character set conversions; Unicode
regular expressions; full Unicode sets; character, word, and line boundaries; language-sensitive
collation and searching;
normalization, upper and lowercase conversion, and script
transliterations; comprehensive
locale data and resource bundle architecture via the
Common Locale Data Repository (CLDR); multiple
calendar
A calendar is a system of organizing days. This is done by giving names to periods of time, typically days, weeks, months and years. A calendar date, date is the designation of a single and specific day within such a system. A calendar is ...
s and
time zone
A time zone is an area which observes a uniform standard time for legal, Commerce, commercial and social purposes. Time zones tend to follow the boundaries between Country, countries and their Administrative division, subdivisions instead of ...
s; and rule-based formatting and parsing of dates, times, numbers, currencies, and messages. ICU provided
complex text layout service for Arabic, Hebrew, Indic, and Thai historically, but that was deprecated in version 54, and was completely removed in version 58 in favor of
HarfBuzz.
ICU provides more extensive internationalization facilities than the standard libraries for C and C++. Future ICU 75 planned for April 2024 will require
C++17 (up from
C++11) or
C11 (up from C99), depending on what languages is used. ICU has historically used
UTF-16, and still does only for Java; while for C/C++
UTF-8 is supported,
including the correct handling of "illegal UTF-8".
ICU 73.2 has improved significant changes for
GB18030-2022 compliance support, i.e. for Chinese (that updated Chinese GB18030
Unicode Transformation Format standard is slightly incompatible); has "a modified character conversion table, mapping some GB18030 characters to Unicode characters that were encoded after GB18030-2005" and has a number of other changes such as improving Japanese and Korean short-text line breaking, and in "English, the name “Türkiye” is now used for the country instead of “Turkey” (the alternate spelling is also available in the data)."
ICU 74 "updates to Unicode 15.1, including new characters, emoji, security mechanisms, and corresponding APIs and implementations.
.ICU 74 and CLDR 44 are major releases, including a new version of Unicode and major locale data improvements." Of the many changes some are for person name formatting, or for improved language support, e.g. for
Low German
Low German is a West Germanic languages, West Germanic language variety, language spoken mainly in Northern Germany and the northeastern Netherlands. The dialect of Plautdietsch is also spoken in the Russian Mennonite diaspora worldwide. "Low" ...
, and there's e.g. a new spoof checker API, following the (latest version)
Unicode 15.1.0 UTS #39: Unicode Security Mechanism.
Older version details
ICU 72 updated to
Unicode 15 (and 73.2 to latest 15.1). "In many formatting patterns, ASCII
spaces are replaced with Unicode spaces (e.g., a "
thin space")." ICU (ICU4J) now requires Java 8 but "Most of the ICU 72 library code should still work with Java 7 / Android API level 21, but we no longer test with Java 7." ICU 71 added e.g. phrase-based line breaking for Japanese (earlier methods didn't work well for short Japanese text, such as in titles and headings) and support for Hindi written in Latin letters (hi_Latn), also referred to as "
Hinglish". ICU 70 added e.g. support for
emoji
An emoji ( ; plural emoji or emojis; , ) is a pictogram, logogram, ideogram, or smiley embedded in text and used in electronic messages and web pages. The primary function of modern emoji is to fill in emotional cues otherwise missing from type ...
properties of strings and can now be built and used with
C++20 compilers (and "ICU operator() and operator!=() functions now return bool instead of UBool, as an adjustment for incompatible changes in C++20"), and as of that version the minimum Windows version is
Windows 7. ICU 67 handles
removal of Great Britain from the EU. ICU 64.2 added support for Unicode 12.1, i.e. the single new symbol for current Japanese
Reiwa era (but support for it has also been backported to older ICU versions down to ICU 4.8.2). ICU 58 (with Unicode 9.0 support) is the last version to support older platforms such as
Windows XP
Windows XP is a major release of Microsoft's Windows NT operating system. It was released to manufacturing on August 24, 2001, and later to retail on October 25, 2001. It is a direct successor to Windows 2000 for high-end and business users a ...
and
Windows Vista
Windows Vista is a major release of the Windows NT operating system developed by Microsoft. It was the direct successor to Windows XP, released five years earlier, which was then the longest time span between successive releases of Microsoft W ...
. Support for
AIX,
Solaris and
z/OS may also be limited in later versions (i.e. building depends on compiler support).
Origin and development
After
Taligent became part of
IBM
International Business Machines Corporation (using the trademark IBM), nicknamed Big Blue, is an American Multinational corporation, multinational technology company headquartered in Armonk, New York, and present in over 175 countries. It is ...
in early 1996,
Sun Microsystems
Sun Microsystems, Inc., often known as Sun for short, was an American technology company that existed from 1982 to 2010 which developed and sold computers, computer components, software, and information technology services. Sun contributed sig ...
decided that the new Java language should have better support for internationalization. Since Taligent had experience with such technologies and were close geographically, their Text and International group were asked to contribute the international classes to the
Java Development Kit as part of the
JDK 1.1 internationalization
APIs. A large portion of this code still exists in the and packages. Further internationalization features were added with each later release of Java.
The Java internationalization classes were then ported to C++ and C as part of a library known as ICU4C ("ICU for C"). The ICU project also provides ICU4J ("ICU for Java"), which adds features not present in the standard Java libraries. ICU4C and ICU4J are very similar, though not identical; for example, ICU4C includes a Regular Expression API, while ICU4J does not. Both frameworks have been enhanced over time to support new facilities and new features of Unicode and
Common Locale Data Repository (CLDR).
ICU was released as an open-source project in 1999 under the name IBM Classes for Unicode. It was later renamed to International Components For Unicode. In May 2016, the ICU project joined the Unicode consortium as technical committee ''ICU-TC'', and the library sources are now distributed under the Unicode license.
MessageFormat
A part of ICU is the MessageFormat class, a formatting system that allows for any number of arguments to control the plural form (, ) or more general
switch-case-style selection () for things like
grammatical gender
In linguistics, a grammatical gender system is a specific form of a noun class system, where nouns are assigned to gender categories that are often not related to the real-world qualities of the entities denoted by those nouns. In languages wit ...
. These statements can be nested.
ICU MessageFormat was created by adding the plural and selection system to an identically-named system in
Java SE.
Alternatives
An alternative for using ICU with
C++, or to using it directly, is to use Boost.Locale, which is a C++ wrapper for ICU (while also allowing other backends). The claim for using it rather than ICU directly is that "is absolutely unfriendly to C++ developers. It ignores popular C++ idioms (the STL, RTTI, exceptions, etc), instead mostly mimicking the Java API."
Another claim, that ICU only supports UTF-16 (and thus a reason to avoid using ICU) is no longer true with ICU now also supporting UTF-8 for C and C++.
See also
*
Apple Advanced Typography
*
Apple Type Services for Unicode Imaging
*
gettext
*
Graphite (smart font technology)
*
NetRexx (ICU license)
*
OpenType
*
Pango
*
Uconv
*
Uniscribe
References
External links
*
International Components for Unicode transliteration servicesOnline ICU editor
{{Unicode navigation
Unicode
Component-based software engineering
Digital typography
Pattern matching
Internationalization and localization
Free computer libraries