BasisTech is a software company specializing in applying artificial intelligence techniques to understanding documents and

unstructured data Unstructured data (or unstructured information) is information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, num ...

written in different languages. It has headquarters in

Somerville, Massachusetts Somerville ( ) is a city located directly to the northwest of Boston, and north of Cambridge, in Middlesex County, Massachusetts, United States. As of the 2020 United States Census, the city had a total population of 81,045 people. With an area ...

and offices in San Francisco, Washington, D.C., London, Tel Aviv, and Tokyo. Its legal name is Basis Technology Corp. The company was founded in 1995 by graduates of the

Massachusetts Institute of Technology The Massachusetts Institute of Technology (MIT) is a private land-grant research university in Cambridge, Massachusetts. Established in 1861, MIT has played a key role in the development of modern technology and science, and is one of th ...

to use artificial intelligence techniques for natural language processing to help computer systems understand written human language. Its software focuses on analyzing freeform text so that applications can do a better job understanding the meaning of the words. For example, their software can identify tokens, part-of-speech, and lemmas. The tools can also identify different forms of names and phrases. The name of someone, say Albert P. Jones for instance, can appear in many different ways. Some texts will call him "Al Jones", others "Mr. Jones" and others "Albert Paul Jons". Their software also performs entity extraction, that is finding words which refer to people, places, and organizations from text for uses such as due diligence, intelligence and metadata tagging. The company is best known for its Rosette product which uses

Natural Language Processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to proc ...

techniques to improve

information retrieval Information retrieval (IR) in computing and information science is the process of obtaining information system resources that are relevant to an information need from a collection of those resources. Searches can be based on full-text or other c ...

text mining Text mining, also referred to as ''text data mining'', similar to text analytics, is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extract ...

search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...

s and other applications. The tool is used to enable search engines to search in multiple languages, and match identities and dates. BasisTech software is also used by forensic analysts to search through files for words, tokens, phrases or numbers that may be important to investigators, as well as provide software (Cyber Triage) that helps organizations respond to cyberattacks.

Rosette

Rosette comes as a cloud (public or on-premise) deployment or Java SDK. Rosette provides a variety of natural language processing tools for unstructured text:

language identification In natural language processing, language identification or language guessing is the problem of determining which natural language given content is in. Computational approaches to this problem view it as a special case of text categorization, solv ...

, base linguistics,

entity extraction Named-entity recognition (NER) (also known as (named) entity identification, entity chunking, and entity extraction) is a subtask of information extraction that seeks to locate and classify named entities mentioned in unstructured text into pre ...

, name matching, name translation,

sentiment analysis Sentiment analysis (also known as opinion mining or emotion AI) is the use of natural language processing, text analysis, computational linguistics, and biometrics to systematically identify, extract, quantify, and study affective states and subjec ...

semantic similarity Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tool ...

relationship extraction A relationship extraction task requires the detection and classification of semantic relationship mentions within a set of artifacts, typically from text or XML documents. The task is very similar to that of information extraction (IE), but IE ad ...

, topic extraction,

categorization Categorization is the ability and activity of recognizing shared features or similarities between the elements of the experience of the world (such as objects, events, or ideas), organizing and classifying experience by associating them to a ...

, and Arabic chat translation. It can be integrated into applications to enhance financial compliance onboarding, communication surveillance compliance, social media monitoring, cyber threat intelligence, and customer feedback analysis. The Rosette Linguistics Platform is composed of these modules: * Rosette Language Identifier looks at the structural and statistical signature of the file to identify the language. The pre-configured software can recognize 55 different languages with 45 different encodings. * Rosette Base Linguistics identifies the lemma or

word stem In linguistics, a word stem is a part of a word responsible for its lexical meaning. The term is used with slightly different meanings depending on the morphology of the language in question. In Athabaskan linguistics, for example, a verb stem ...

after finding the tokens. Search is often faster and more accurate when words are grouped by their stem. * Rosette Entity Extractor analyzes raw text and identifies the probable role that words and phrases play in the document, a key step that makes it possible for algorithms to distinguish between the various meanings that many words can have. Splitting the raw text into groups of words according to their role and then classifying their contribution to meaning is often called entity analysis. The Basis hybrid approach mixes statistical modeling with rules,

regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" ...

s, and gazetteers, lists of special words that can be tuned to the language and text to be analyzed. The tool is designed to work directly with varied alphabets and multiple languages, an advantage because foreign words are often transliterated in multiple ways. It is believed to be the first commercially available tool for analyzing Arabic text. * Rosette Name Translator transliterates non-Latin alphabets like Arabic into a consistent Latin form. * Rosette Name Indexer enables simple search across name variations either by plugging into open source search engines or as a standalone service. * Rosette Core Library for Unicode smooths the use of Unicode text. * Rosette Chat Translator for Arabic converts words from the

Arabic chat alphabet The Arabic chat alphabet, ''Arabizi'', Franco-Arabic (), refer to the Romanized alphabets for informal Arabic dialects in which Arabic script is transcribed or encoded into a combination of Latin script and Arabic numerals. These informal chat a ...

to Arabic. Rosette is used in both the United States government offices to support translation and by major Internet infrastructure firms like search engines.

Digital forensics

BasisTech develops open-source

digital forensics Digital forensics (sometimes known as digital forensic science) is a branch of forensic science encompassing the recovery, investigation, examination and analysis of material found in digital devices, often in relation to mobile devices and comp ...

tools, ''The Sleuth Kit'' and ''

Autopsy An autopsy (post-mortem examination, obduction, necropsy, or autopsia cadaverum) is a surgical procedure that consists of a thorough examination of a corpse by dissection to determine the cause, mode, and manner of death or to evaluate any d ...

'', to help identify and extract clues from data storage devices like hard disks or flash cards, as well as devices such as smart phones and iPods. The open-source licensing model allows them to be used as the foundation for larger projects like a Hadoop-based tool for massively parallel forensic analysis of very large data collections. The digital forensics tool set is used to perform analysis of file systems, new media types, new file types and file system metadata. The tools can search for particular patterns in the files allowing it to target significant files or usage profiles. It can, for instance, look for common files using hash functions and also deconstruct the data structures of the important operating system log files. The tools are designed to be customizable with an open plugin architecture. Basis Technology helps manage a large and diverse community of developers who use the tool in investigations.

KonaSearch

BasisTech acquired KonaSearch in June 2019, a startup that specializes in search for Salesforce.com and other office database repositories, which can automate the search step of business workflows.

References

{{Reflist

External links

Official website

Rosette website

Cyber Triage website

Autopsy digital forensics website

KonaSearch website
Software companies based in Massachusetts Privately held companies based in Massachusetts Software companies established in 1995