Table extraction is the process of recognizing and separating a

table Table may refer to: * Table (furniture), a piece of furniture with a flat surface and one or more legs * Table (landform), a flat area of land * Table (information), a data arrangement with rows and columns * Table (database), how the table data ...

from a large document, possibly also recognizing individual rows, columns or elements. It may be regarded as a special form of

information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...

. Table extractions from webpages can take advantage of the special

HTML element An HTML element is a type of HTML (HyperText Markup Language) document component, one of several types of HTML nodes (there are also text nodes, comment nodes and others). The first used version of HTML was written by Tim Berners-Lee in 1993 ...

s that exist for tables, e.g., the "table" tag, and programming libraries may implement table extraction from webpages. The

Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...

pandas Pediatric autoimmune neuropsychiatric disorders associated with streptococcal infections (PANDAS) is a controversial hypothetical diagnosis for a subset of children with rapid onset of obsessive-compulsive disorder (OCD) or tic disorders. Sy ...

software library can extract tables from HTML webpages via its read_html() function. More challenging is table extraction from

PDF Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. ...

s or

scanned images An image scanner—often abbreviated to just scanner—is a device that optically scans images, printed text, handwriting or an object and converts it to a digital image. Commonly used in offices are variations of the desktop ''flatbed scanner'' w ...

, where there usually is no table-specific machine readable markup. Systems that extract data from tables in scientific PDFs have been described.

Wikipedia Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system. Wikipedia is the largest and most-read refer ...

presents some of its information in tables, and, e.g., 3.5 million tables can be extracted from the

English Wikipedia The English Wikipedia is, along with the Simple English Wikipedia, one of two English-language editions of Wikipedia, an online encyclopedia. It was founded on January 15, 2001, as Wikipedia's first edition, and, as of , has the most arti ...

. Some of the tables have a specific format, e.g., the so-called

infobox An infobox is a digital or physical Table (information), table used to collect and present a subset of information about its subject, such as a document. It is a structured document containing a set of attribute–value pairs, and in Wikipedia r ...

es. Large-scale table extraction of Wikipedia infoboxes forms one of the sources for

DBpedia DBpedia (from "DB" for "database") is a project aiming to extract structured content from the information created in the Wikipedia project. This structured information is made available on the World Wide Web. DBpedia allows users to semantica ...

. Commercial web services for table extraction exist, e.g.,

Amazon Amazon most often refers to: * Amazons, a tribe of female warriors in Greek mythology * Amazon rainforest, a rainforest covering most of the Amazon basin * Amazon River, in South America * Amazon (company), an American multinational technology c ...

Textract,

Google's Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...

'' Document AI'', IBM Watson Discovery, and

Microsoft Microsoft Corporation is an American multinational technology corporation producing computer software, consumer electronics, personal computers, and related services headquartered at the Microsoft Redmond campus located in Redmond, Washing ...

Form Recognizer. Open source tools also exist, e.g., PDFFigures 2.0 that has been used in

Semantic Scholar Semantic Scholar is an artificial intelligence–powered research tool for scientific literature developed at the Allen Institute for AI and publicly released in November 2015. It uses advances in natural language processing to provide summaries ...

. In a comparison published in 2017, the researchers found the proprietary program ABBYY FineReader to yield the best PDF table extraction performance among six different tools evaluated.

References

{{Scholia, topic Natural language processing