HOME

TheInfoList



OR:

Apache Tika is a content detection and
analysis Analysis ( : analyses) is the process of breaking a complex topic or substance into smaller parts in order to gain a better understanding of it. The technique has been applied in the study of mathematics and logic since before Aristotle (3 ...
framework, written in
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mos ...
, stewarded at the Apache Software Foundation. It detects and extracts metadata and text from over a thousand different
file type A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free. Some file format ...
s, and as well as providing a
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mos ...
library, has server and command-line editions suitable for use from other programming languages.


History

The project originated as part of the
Apache Nutch Apache Nutch is a highly extensible and scalable open source web crawler software project. Features Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular archite ...
codebase, to provide content identification and extraction when crawling. In 2007, it was separated out, to make it more extensible and usable by
content management systems A content management system (CMS) is computer software used to manage the creation and modification of digital content (content management).''Managing Enterprise Content: A Unified Content Strategy''. Ann Rockley, Pamela Kostur, Steve Manning. New ...
, other
Web crawlers A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spid ...
, and information retrieval systems. The standalone Tika was founded by Jérôme Charron, Chris Mattmann and Jukka Zitting. In 2011 Chris Mattmann and Jukka Zitting released the Manning book "Tika in Action", and the project released version 1.0.


Features

Tika provides capabilities for identification of more than 1400 file types from the
Internet Assigned Numbers Authority The Internet Assigned Numbers Authority (IANA) is a standards organization that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System (DNS), media types, and other Inte ...
taxonomy of
MIME Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Message ...
types. For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities. It can also get text from images by using the OCR software
Tesseract In geometry, a tesseract is the four-dimensional analogue of the cube; the tesseract is to the cube as the cube is to the square. Just as the surface of the cube consists of six square faces, the hypersurface of the tesseract consists of e ...
. While Tika is written in
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mos ...
, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.


Notable uses

Tika is used by financial institutions including the Fair Isaac Corporation (FICO), Goldman Sachs,
NASA The National Aeronautics and Space Administration (NASA ) is an independent agencies of the United States government, independent agency of the US federal government responsible for the civil List of government space agencies, space program ...
and academic researchers and by major content management systems including Drupal, and Alfresco (software) to analyze large amounts of content, and to make it available in common formats using information retrieval techniques. On April 4, 2016
Forbes ''Forbes'' () is an American business magazine owned by Integrated Whale Media Investments and the Forbes family. Published eight times a year, it features articles on finance, industry, investing, and marketing topics. ''Forbes'' also r ...
published an article identifying Tika as one of the key technologies used by more than 400 journalists to analyze 11.5 million leaked documents that expose an international scandal involving world leaders storing money in offshore
shell corporation A shell corporation is a company or corporation that exists only on paper and has no office and no employees, but may have a bank account or may hold passive investments or be the registered owner of assets, such as intellectual property, or s ...
s. The leaked documents and the project to analyze them is referred to as the Panama Papers.


See also

* Magic number


References

{{Apache Software Foundation Tika Java platform Free software programmed in Java (programming language) Java (programming language) libraries Software using the Apache license