Apache Tika is a content detection and
analysis
Analysis ( : analyses) is the process of breaking a complex topic or substance into smaller parts in order to gain a better understanding of it. The technique has been applied in the study of mathematics and logic since before Aristotle (3 ...
framework, written in
Java
Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mos ...
, stewarded at the
Apache Software Foundation. It detects and extracts metadata and text from over a thousand different
file type
A file format is a standard way that information is encoded for storage in a computer file. It specifies how bits are used to encode information in a digital storage medium. File formats may be either proprietary or free.
Some file format ...
s, and as well as providing a
Java
Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mos ...
library, has server and command-line editions suitable for use from other programming languages.
History
The project originated as part of the
Apache Nutch
Apache Nutch is a highly extensible and scalable open source web crawler software project.
Features
Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular archite ...
codebase, to provide content identification and extraction when
crawling. In 2007, it was separated out, to make it more extensible and usable by
content management systems
A content management system (CMS) is computer software used to manage the creation and modification of digital content (content management).''Managing Enterprise Content: A Unified Content Strategy''. Ann Rockley, Pamela Kostur, Steve Manning. New ...
, other
Web crawlers
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spid ...
, and information retrieval systems. The standalone Tika was founded by Jérôme Charron,
Chris Mattmann and Jukka Zitting. In 2011 Chris Mattmann and Jukka Zitting released the Manning book "Tika in Action", and the project released version 1.0.
Features
Tika provides capabilities for identification of more than 1400 file types from the
Internet Assigned Numbers Authority
The Internet Assigned Numbers Authority (IANA) is a standards organization that oversees global IP address allocation, autonomous system number allocation, root zone management in the Domain Name System (DNS), media types, and other Inte ...
taxonomy of
MIME
Multipurpose Internet Mail Extensions (MIME) is an Internet standard that extends the format of email messages to support text in character sets other than ASCII, as well as attachments of audio, video, images, and application programs. Message ...
types. For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.
It can also get text from images by using the
OCR software
Tesseract
In geometry, a tesseract is the four-dimensional analogue of the cube; the tesseract is to the cube as the cube is to the square. Just as the surface of the cube consists of six square faces, the hypersurface of the tesseract consists of e ...
.
While Tika is written in
Java
Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's mos ...
, it is widely used from other languages. The
RESTful server and
CLI Tool permit non-Java programs to access the Tika functionality.
Notable uses
Tika is used by financial institutions including the
Fair Isaac Corporation (FICO), Goldman Sachs,
NASA
The National Aeronautics and Space Administration (NASA ) is an independent agencies of the United States government, independent agency of the US federal government responsible for the civil List of government space agencies, space program ...
and academic researchers and by major content management systems including
Drupal, and
Alfresco (software) to analyze large amounts of content, and to make it available in common formats using information retrieval techniques.
On April 4, 2016
Forbes
''Forbes'' () is an American business magazine owned by Integrated Whale Media Investments and the Forbes family. Published eight times a year, it features articles on finance, industry, investing, and marketing topics. ''Forbes'' also r ...
published an article identifying Tika as one of the key technologies used by more than 400 journalists to analyze 11.5 million leaked documents that expose an international scandal involving world leaders storing money in offshore
shell corporation
A shell corporation is a company or corporation that exists only on paper and has no office and no employees, but may have a bank account or may hold passive investments or be the registered owner of assets, such as intellectual property, or s ...
s. The leaked documents and the project to analyze them is referred to as the
Panama Papers.
See also
*
Magic number
References
{{Apache Software Foundation
Tika
Java platform
Free software programmed in Java (programming language)
Java (programming language) libraries
Software using the Apache license