Trigram Search
   HOME

TheInfoList



OR:

Trigram search is a method of
searching Searching or search may refer to: Computing technology * Search algorithm, including keyword search ** :Search algorithms * Search and optimization for problem solving in artificial intelligence * Search engine technology, software for findi ...
for text when the exact
syntax In linguistics, syntax () is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure ( constituency) ...
or spelling of the target object is not precisely known or when queries may be
regular expressions A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or ...
. It finds objects which match the maximum number of three consecutive character strings (i.e.
trigrams Trigrams are a special case of the ''n''-gram, where ''n'' is 3. They are often used in natural language processing for performing statistical analysis of texts and in cryptography for control and use of ciphers and codes. Frequency Context is ...
) in the entered search terms, which are generally near matches. Two strings with many shared trigrams can be expected to be very similar. Trigrams also allow for efficiently creating
indexes Index (or its plural form indices) may refer to: Arts, entertainment, and media Fictional entities * Index (''A Certain Magical Index''), a character in the light novel series ''A Certain Magical Index'' * The Index, an item on a Halo megastru ...
for searches that are regular expressions or match the text inexactly. Indexes can significantly accelerate searches. A threshold for number of trigram matches can be specified as a cutoff point, after which a result is no longer considered a match. Using trigrams for accelerating searches is a technique used in some systems for
code In communications and information processing, code is a system of rules to convert information—such as a letter, word, sound, image, or gesture—into another form, sometimes shortened or secret, for communication through a communication ...
searching, in situations in which queries that are regular expressions may be useful, in search engines such as
Elasticsearch Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is dual-l ...
, as well as in databases such as
PostgreSQL PostgreSQL (, ), also known as Postgres, is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. It was originally named POSTGRES, referring to its origins as a successor to the In ...
.{{Cite web , date=2022-05-12 , title=F.33. pg_trgm , url=https://www.postgresql.org/docs/14/pgtrgm.html , access-date=2022-05-28 , website=PostgreSQL Documentation , language=en


Examples

Consider the string "alice". The trigrams of the string would be "ali", "lic", and "ice," not including spaces. Searching for this string in a database with a trigram-based index would involve finding which objects contain as many of the three trigrams as possible. As a concrete example of using trigram search to search for a regular expression query, consider searching for the string ab d, where the brackets denote that the third character in the string being searched for could be c or d. In this situation, one could query the index for objects that have the two trigrams abc and bce or the two trigrams abd and bde. Thus, finding this query would involve no string matching, and could just query the index directly, which can be faster in practice.


See also

*
Search engine indexing Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and ...
*
Approximate string matching In computer science, approximate string matching (often colloquially referred to as fuzzy string searching) is the technique of finding strings that match a pattern approximately (rather than exactly). The problem of approximate string matching i ...
*
Trigram Trigrams are a special case of the ''n''-gram, where ''n'' is 3. They are often used in natural language processing for performing statistical analysis of texts and in cryptography for control and use of ciphers and codes. Frequency Context is ...
*
N-gram In the fields of computational linguistics and probability, an ''n''-gram (sometimes also called Q-gram) is a contiguous sequence of ''n'' items from a given sample of text or speech. The items can be phonemes, syllables, letters, words or b ...
*
Regular expression A regular expression (shortened as regex or regexp; sometimes referred to as rational expression) is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or ...
*
Google Code Search Google Code Search was a free beta product from Google which debuted in Google Labs on October 5, 2006, allowing web users to search for open-source code on the Internet. Features included the ability to search using operators, namely , , , a ...


References

String matching algorithms Search algorithms