Manually Annotated Sub-Corpus (MASC)
   HOME
*





Manually Annotated Sub-Corpus (MASC)
Manually Annotated Sub-Corpus (MASC) is a balanced subset of 500K words of written texts and transcribed speech drawn primarily from the Open American National Corpus (OANC). The OANC is a 15 million word (and growing) corpus of American English produced since 1990, all of which is in the public domain or otherwise free of usage and redistribution restrictions. All of MASC includes manually validated annotations for logical structure (headings, sections, paragraphs, etc.), sentence boundaries, three different tokenizations with associated part of speech tags, shallow parse (noun and verb chunks), named entities (person, location, organization, date and time), anPenn Treebanksyntax. Additional manually produced or validated annotations have been produced by the MASC project for portions of the sub-corpus, including full-text annotation for FrameNet frame elements and a 100K+ sentence corpus with WordNet 3.1 sense tags, of which one-tenth are also annotated for FrameNet frame elem ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


American National Corpus
The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for Lexical category, part of speech and Lemma (morphology), lemma, shallow parsing, shallow parse, and Named entity, named entities. The ANC is available from the Linguistic Data Consortium. A fifteen million word subset of the corpus, called the Open American National Corpus (OANC), is freely available with no restrictions on its use from the ANC Website. The corpus and its annotations are provided according to the specifications of ISO/TC 37 SC4's Linguistic Annotation Framework. By using a freely provided transduction tool (ANC2Go), the corpus and user-chosen annotations are provided in multiple formats, including CoNLL IOB format, t ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


FrameNet
FrameNet is a research and resource development project based at the International Computer Science Institute (ICSI) in Berkeley, California, which has produced an electronic resource based on a theory of meaning called frame semantics. The data that FrameNet has analyzed show that the sentence "John sold a car to Mary" essentially describes the same basic situation (semantic frame) as "Mary bought a car from John", just from a different perspective. A semantic frame is a conceptual structure describing an event, relation, or object along with its participants. The FrameNet lexical database contains over 1,200 semantic ''frames'', 13,000 ''lexical units'' (a pairing of a word with a meaning; polysemous words are represented by several ''lexical units'') and 202,000 example sentences. Charles J. Fillmore, who developed the theory of frame semantics which serves as the theoretical the basis of FrameNet, founded the project in 1997 and continued to lead the effort until he died in 2 ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

WordNet
WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into '' synsets'' with short definitions and usage examples. WordNet can thus be seen as a combination and extension of a dictionary and thesaurus. While it is accessible to human users via a web browser, its primary use is in automatic text analysis and artificial intelligence applications. WordNet was first created in the English language and the English WordNet database and software tools have been released under a BSD style license and are freely available for download from that WordNet website. History and team members WordNet was first created in English only in the Cognitive Science Laboratory of Princeton University under the direction of psychology professor George Armitage Miller starting in 1985 and was later directed by Christiane Fellbaum. The project was ini ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


PropBank
PropBank is a corpus that is annotated with verbal propositions and their arguments—a "proposition bank". Although "PropBank" refers to a specific corpus produced by Martha Palmer ''et al.'', the term ''propbank'' is also coming to be used as a common noun referring to any corpus that has been annotated with propositions and their arguments. The PropBank project has played a role in recent research in natural language processing, and has been used in semantic role labelling. Comparison PropBank differs from FrameNet, the resource to which it is most frequently compared, in several ways. PropBank is a verb-oriented resource, while FrameNet is centered on the more abstract notion of frames, which generalizes descriptions across similar verbs (e.g. "describe" and "characterize") as well as nouns and other words (e.g. "description"). PropBank does not annotate events or states of affairs described using nouns. PropBank commits to annotating all verbs in a corpus, whereas the Fr ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


UIMA
UIMA ( ), short for Unstructured Information Management Architecture, is an OASIS standard for content analytics, originally developed at IBM. It provides a component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and integration with search technologies. Structure The UIMA architecture can be thought of in four dimensions: # It specifies component interfaces in an analytics pipeline. # It describes a set of design patterns. # It suggests two data representations: an in-memory representation of annotations for high-performance analytics and an XML representation of annotations for integration with remote web services. # It suggests development roles allowing tools to be used by users with diverse skills. Implementations and uses Apache UIMA, a reference implementation of UIMA, is maintained by the Apache Software Foundation. UIMA is used in a number of software proj ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


General Architecture For Text Engineering
General Architecture for Text Engineering or GATE is a Java suite of tools originally developed at the University of Sheffield beginning in 1995 and now used worldwide by a wide community of scientists, companies, teachers and students for many natural language processing tasks, including information extraction in many languages. As of May 28, 2011, 881 people are on the gate-users mailing list at SourceForge.net, and 111,932 downloads from SourceForge are recorded since the project moved to SourceForge in 2005. The paper "GATE: A framework and graphical development environment for robust NLP tools and applications" has received over 2000 citations since publication (according to Google Scholar). Books covering the use of GATE, in addition to the GATE User Guide, include "Building Search Applications: Lucene, LingPipe, and Gate", by Manu Konchady, and "Introduction to Linguistic Annotation and Text Analytics", by Graham Wilcock. GATE community and research has been involved in seve ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


Linguistic Data Consortium
The Linguistic Data Consortium is an open consortium of universities, companies and government research laboratories. It creates, collects and distributes speech and text databases, lexicons, and other resources for linguistics research and development purposes. The University of Pennsylvania is the LDC's host institution. The LDC was founded in 1992 with a grant from the US Defense Advanced Research Projects Agency (DARPA), and is partly supported by grant IRI-9528587 from the Information and Intelligent Systems division of the National Science Foundation. The director of LDC is Mark Liberman and the executive director is Christopher Cieri. See also *Corpus linguistics *Cross-Linguistic Linked Data (CLLD) – project coordinating over a dozen linguistics databases; hosted by the Max Planck Institute (Germany) *European Language Resources Association (ELRA) – a Luxembourg- and France-based institute with a mission similar to LDC's * Language Grid – a platform for language resour ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  




Natural Language Toolkit
The Natural Language Toolkit, or more commonly NLTK, is a suite of libraries and programs for symbolic and statistical natural language processing (NLP) for English written in the Python programming language. It was developed by Steven Bird and Edward Loper in the Department of Computer and Information Science at the University of Pennsylvania. NLTK includes graphical demonstrations and sample data. It is accompanied by a book that explains the underlying concepts behind the language processing tasks supported by the toolkit, plus a cookbook. NLTK is intended to support research and teaching in NLP or closely related areas, including empirical linguistics, cognitive science, artificial intelligence, information retrieval, and machine learning. NLTK has been used successfully as a teaching tool, as an individual study tool, and as a platform for prototyping and building research systems. There are 32 universities in the US and 25 countries using NLTK in their courses. NLTK suppor ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


English Corpora
English usually refers to: * English language * English people English may also refer to: Peoples, culture, and language * ''English'', an adjective for something of, from, or related to England ** English national identity, an identity and common culture ** English language in England, a variant of the English language spoken in England * English languages (other) * English studies, the study of English language and literature * ''English'', an Amish term for non-Amish, regardless of ethnicity Individuals * English (surname), a list of notable people with the surname ''English'' * People with the given name ** English McConnell (1882–1928), Irish footballer ** English Fisher (1928–2011), American boxing coach ** English Gardner (b. 1992), American track and field sprinter Places United States * English, Indiana, a town * English, Kentucky, an unincorporated community * English, Brazoria County, Texas, an unincorporated community * Engli ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Computational Linguistics
Computational linguistics is an Interdisciplinarity, interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others. Sub-fields and related areas Traditionally, computational linguistics emerged as an area of artificial intelligence performed by computer scientists who had specialized in the application of computers to the processing of a natural language. With the formation of the Association for Computational Linguistics (ACL) and the establishment of independent conference series, the field consolidated during the 1970s and 1980s. The Association for Computational Linguistics defines computational linguistics as: The term "comp ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]  


picture info

Natural Language Processing
Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves. Challenges in natural language processing frequently involve speech recognition, natural-language understanding, and natural-language generation. History Natural language processing has its roots in the 1950s. Already in 1950, Alan Turing published an article titled "Computing Machinery and Intelligence" which proposed what is now called the Turing test as a criterion of intelligence, t ...
[...More Info...]      
[...Related Items...]     OR:     [Wikipedia]   [Google]   [Baidu]