General Architecture For Text Engineering
   HOME

TheInfoList



OR:

General Architecture for Text Engineering or GATE is a
Java Java (; id, Jawa, ; jv, ꦗꦮ; su, ) is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea to the north. With a population of 151.6 million people, Java is the world's List ...
suite of tools originally developed at the
University of Sheffield , mottoeng = To discover the causes of things , established = – University of SheffieldPredecessor institutions: – Sheffield Medical School – Firth College – Sheffield Technical School – University College of Sheffield , type = Pu ...
beginning in 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for many
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
tasks, including
information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
in many languages. As of May 28, 2011, 881 people are on the gate-users mailing list at SourceForge.net, and 111,932 downloads from
SourceForge SourceForge is a web service that offers software consumers a centralized online location to control and manage open-source software projects and research business software. It provides source code repository hosting, bug tracking, mirrorin ...
are recorded since the project moved to SourceForge in 2005. The paper "GATE: A framework and graphical development environment for robust NLP tools and applications" has received over 2000 citations since publication (according to Google Scholar). Books covering the use of GATE, in addition to the GATE User Guide, include "Building Search Applications: Lucene, LingPipe, and Gate", by Manu Konchady, and "Introduction to Linguistic Annotation and Text Analytics", by Graham Wilcock. GATE community and research has been involved in several European research projects including: Transitioning Applications to Ontologies, SEKT, NeOn, Media-Campaign, Musing, Service-Finder, LIRICS and KnowledgeWeb.


Features

GATE includes an
information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
system called ANNIE (A Nearly-New Information Extraction System) which is a set of modules comprising a
tokenizer In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters (such as in a computer program or web page) into a sequence of ''lexical tokens'' ( strings with an assigned and thus identified ...
, a
gazetteer A gazetteer is a geographical index or directory used in conjunction with a map or atlas.Aurousseau, 61. It typically contains information concerning the geographical makeup, social statistics and physical features of a country, region, or co ...
, a sentence splitter, a
part of speech tagger In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammar, grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular parts of speech, part of speech, b ...
, a named entities transducer and a
coreference In linguistics, coreference, sometimes written co-reference, occurs when two or more expressions refer to the same person or thing; they have the same referent. For example, in ''Bill said Alice would arrive soon, and she did'', the words ''Alice'' ...
tagger. ANNIE can be used as-is to provide basic
information extraction Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
functionality, or provide a starting point for more specific tasks. Languages currently handled in GATE include
English English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national ide ...
,
Chinese Chinese can refer to: * Something related to China * Chinese people, people of Chinese nationality, citizenship, and/or ethnicity **''Zhonghua minzu'', the supra-ethnic concept of the Chinese nation ** List of ethnic groups in China, people of ...
,
Arabic Arabic (, ' ; , ' or ) is a Semitic languages, Semitic language spoken primarily across the Arab world.Semitic languages: an international handbook / edited by Stefan Weninger; in collaboration with Geoffrey Khan, Michael P. Streck, Janet C ...
,
Bulgarian Bulgarian may refer to: * Something of, from, or related to the country of Bulgaria * Bulgarians, a South Slavic ethnic group * Bulgarian language, a Slavic language * Bulgarian alphabet * A citizen of Bulgaria, see Demographics of Bulgaria * Bul ...
, French,
German German(s) may refer to: * Germany (of or related to) ** Germania (historical use) * Germans, citizens of Germany, people of German ancestry, or native speakers of the German language ** For citizens of Germany, see also German nationality law **Ge ...
,
Hindi Hindi (Devanāgarī: or , ), or more precisely Modern Standard Hindi (Devanagari: ), is an Indo-Aryan language spoken chiefly in the Hindi Belt region encompassing parts of northern, central, eastern, and western India. Hindi has been de ...
,
Italian Italian(s) may refer to: * Anything of, from, or related to the people of Italy over the centuries ** Italians, an ethnic group or simply a citizen of the Italian Republic or Italian Kingdom ** Italian language, a Romance language *** Regional Ita ...
, Cebuano,
Romanian Romanian may refer to: *anything of, from, or related to the country and nation of Romania **Romanians, an ethnic group **Romanian language, a Romance language *** Romanian dialects, variants of the Romanian language ** Romanian cuisine, tradition ...
,
Russian Russian(s) refers to anything related to Russia, including: *Russians (, ''russkiye''), an ethnic group of the East Slavic peoples, primarily living in Russia and neighboring countries *Rossiyane (), Russian language term for all citizens and peo ...
,
Danish Danish may refer to: * Something of, from, or related to the country of Denmark People * A national or citizen of Denmark, also called a "Dane," see Demographics of Denmark * Culture of Denmark * Danish people or Danes, people with a Danish a ...
. Plugins are included for
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
with
Weka The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. It is the only extant member of the genus '' Gallirallus''. Four subspecies are recogni ...
, RASP, MAXENT, SVM Light, as well as a
LIBSVM LIBSVM and LIBLINEAR are two popular open source machine learning libraries, both developed at the National Taiwan University and both written in C++ though with a C API. LIBSVM implements the Sequential minimal optimization (SMO) algorithm ...
integration and an in-house
perceptron In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belon ...
implementation, for managing
ontologies In computer science and information science, an ontology encompasses a representation, formal naming, and definition of the categories, properties, and relations between the concepts, data, and entities that substantiate one, many, or all domains ...
like
WordNet WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into '' synsets'' with short definition ...
, for querying
search engines A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
like
Google Google LLC () is an American multinational technology company focusing on search engine technology, online advertising, cloud computing, computer software, quantum computing, e-commerce, artificial intelligence, and consumer electronics. ...
or
Yahoo Yahoo! (, styled yahoo''!'' in its logo) is an American web services provider. It is headquartered in Sunnyvale, California and operated by the namesake company Yahoo! Inc. (2017–present), Yahoo Inc., which is 90% owned by investment funds ma ...
, for
part of speech tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definit ...
with
Brill Brill may refer to: Places * Brielle (sometimes "Den Briel"), a town in the western Netherlands * Brill, Buckinghamshire, a village in England * Brill, Cornwall, a small village to the west of Constantine, Cornwall, UK * Brill, Wisconsin, an un ...
or TreeTagger, and many more. Many external plugins are also available, for handling e.g. tweets. GATE accepts input in various formats, such as TXT,
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
,
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing arbitrary data. It defines a set of rules for encoding documents in a format that is both human-readable and machine-readable ...
,
Doc DOC, Doc, doc or DoC may refer to: In film and television * ''Doc'' (2001 TV series), a 2001–2004 PAX series * ''Doc'' (1975 TV series), a 1975–1976 CBS sitcom * "D.O.C." (''Lost''), a television episode * ''Doc'' (film), a 1971 Wester ...
,
PDF Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. ...
documents, and Java Serial,
PostgreSQL PostgreSQL (, ), also known as Postgres, is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. It was originally named POSTGRES, referring to its origins as a successor to the In ...
,
Lucene Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as ...
,
Oracle An oracle is a person or agency considered to provide wise and insightful counsel or prophetic predictions, most notably including precognition of the future, inspired by deities. As such, it is a form of divination. Description The word '' ...
Databases with help of
RDBMS A relational database is a (most commonly digital) database based on the relational model of data, as proposed by E. F. Codd in 1970. A system used to maintain relational databases is a relational database management system (RDBMS). Many relation ...
storage over
JDBC Java Database Connectivity (JDBC) is an application programming interface (API) for the programming language Java, which defines how a client may access a database. It is a Java-based data access technology used for Java database connectivity. I ...
.
JAPE Jape is a synonym for a practical joke. Jape or JAPE may also refer to: * Jape (band), an Irish electronic/rock band * JAPE (linguistics), a transformation language widely used in natural language processing * JAPE, an automated pun generator * J ...
transducers are used within GATE to manipulate annotations on text. Documentation is provided in the GATE User Guide. A tutorial has also been written by Press Association Images.


GATE Developer

The screenshot shows the document viewer used to display a document and its annotations. In pink are hyperlink annotations from an
HTML The HyperText Markup Language or HTML is the standard markup language for documents designed to be displayed in a web browser. It can be assisted by technologies such as Cascading Style Sheets (CSS) and scripting languages such as JavaScri ...
file. The right list is the annotation sets list, and the bottom table is the annotation list. In the center is the annotation editor window.


GATE Mímir

GATE generates vast quantities of information including; natural language text, semantic annotations, and ontological information. Sometimes the data itself is the end product of an application but often the information would be more useful if it could be efficiently searched. GATE Mimir provides support for indexing and searching the linguistic and semantic information generated by such applications and allows for querying the information using arbitrary combinations of text, structural information, and
SPARQL SPARQL (pronounced "sparkle" , a recursive acronym for SPARQL Protocol and RDF Query Language) is an RDF query language—that is, a semantic query language for databases—able to retrieve and manipulate data stored in Resource Description F ...
.


See also

*
Unstructured Information Management Architecture UIMA ( ), short for Unstructured Information Management Architecture, is an OASIS standard for content analytics, originally developed at IBM. It provides a component software architecture for the development, discovery, composition, and deplo ...
(UIMA) *
OpenNLP The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text. It supports the most common NLP tasks, such as language detection, tokenization, sentence segmentation, part-of-speech tagging, named en ...
*
Pheme In Greek mythology, Pheme ( ; Greek: , ''Phēmē''; Roman equivalent: Fama), also known as Ossa in Homeric sources, was the personification of fame and renown, her favour being notability, her wrath being scandalous rumours. She was a daughte ...
, a major EU project managed by the GATE group on early detection of false information in social media


References


External links

* {{DEFAULTSORT:General Architecture For Text Engineering Data mining and machine learning software Free computer libraries Free science software Free software programmed in Java (programming language) Free integrated development environments Knowledge representation Natural language processing toolkits Ontology editors