Trigram search

Last updated

Trigram search is a method of searching for text when the exact syntax or spelling of the target object is not precisely known [1] or when queries may be regular expressions. [2] It finds objects which match the maximum number of three consecutive character strings (i.e. trigrams) in the entered search terms, which are generally near matches. [3] Two strings with many shared trigrams can be expected to be very similar. [4] Trigrams also allow for efficiently creating search engine indexes for searches that are regular expressions or match the text inexactly. Indexes can significantly accelerate searches. [5] [6] A threshold for number of trigram matches can be specified as a cutoff point, after which a result is no longer considered a match. [4]

Contents

Using trigrams for accelerating searches is a technique used in some systems for code searching, in situations in which queries that are regular expressions may be useful, [5] [2] [7] in search engines such as Elasticsearch, [8] as well as in databases such as PostgreSQL. [4]

Examples

Consider the string "alice". The trigrams of the string would be "ali", "lic", and "ice," not including spaces. [5] Searching for this string in a database with a trigram-based index would involve finding which objects contain as many of the three trigrams as possible.

As a concrete example of using trigram search to search for a regular expression query, consider searching for the string ab[cd]e, where the brackets denote that the third character in the string being searched for could be c or d. In this situation, one could query the index for objects that have the two trigrams abc and bce or the two trigrams abd and bde. Thus, finding this query would involve no string matching, and could just query the index directly, which can be faster in practice. [2]

See also

Related Research Articles

<span class="mw-page-title-main">PostgreSQL</span> Free and open-source relational database management system

PostgreSQL, also known as Postgres, is a free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. It was originally named POSTGRES, referring to its origins as a successor to the Ingres database developed at the University of California, Berkeley. In 1996, the project was renamed to PostgreSQL to reflect its support for SQL. After a review in 2007, the development team decided to keep the name PostgreSQL and the alias Postgres.

<span class="mw-page-title-main">Regular expression</span> Sequence of characters that forms a search pattern

A regular expression is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.

In computer science, string-searching algorithms, sometimes called string-matching algorithms, are an important class of string algorithms that try to find a place where one or several strings are found within a larger string or text.

Structured Query Language (SQL) is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS), or for stream processing in a relational data stream management system (RDSMS). It is particularly useful in handling structured data, i.e., data incorporating relations among entities and variables.

grep is a command-line utility for searching plain-text data sets for lines that match a regular expression. Its name comes from the ed command g/re/p, which has the same effect. grep was originally developed for the Unix operating system, but later available for all Unix-like systems and some others such as OS-9.

agrep is an open-source approximate string matching program, developed by Udi Manber and Sun Wu between 1988 and 1991, for use with the Unix operating system. It was later ported to OS/2, DOS, and Windows.

Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for homophones to be encoded to the same representation so that they can be matched despite minor differences in spelling. The algorithm mainly encodes consonants; a vowel will not be encoded unless it is the first letter. Soundex is the most widely known of all phonetic algorithms Improvements to Soundex are the basis for many modern phonetic algorithms.

In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases.

The following tables compare general and technical information for a number of relational database management systems. Please see the individual products' articles for further information. Unless otherwise specified in footnotes, comparisons are based on the stable versions without any add-ons, extensions or external programs.

A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure. Indexes are used to quickly locate data without having to search every row in a database table every time said table is accessed. Indexes can be created using one or more columns of a database table, providing the basis for both rapid random lookups and efficient access of ordered records.

<span class="mw-page-title-main">Null (SQL)</span> Marker used in SQL databases to indicate a value does not exist

In SQL, null or NULL is a special marker used to indicate that a data value does not exist in the database. Introduced by the creator of the relational database model, E. F. Codd, SQL null serves to fulfil the requirement that all true relational database management systems (RDBMS) support a representation of "missing information and inapplicable information". Codd also introduced the use of the lowercase Greek omega (ω) symbol to represent null in database theory. In SQL, NULL is a reserved word used to identify this marker.

<span class="mw-page-title-main">Approximate string matching</span> Finding strings that approximately match a pattern

In computer science, approximate string matching is the technique of finding strings that match a pattern approximately. The problem of approximate string matching is typically divided into two sub-problems: finding approximate substring matches inside a given string and finding dictionary strings that match the pattern approximately.

A WHERE clause in SQL specifies that a SQL Data Manipulation Language (DML) statement should only affect rows that meet specified criteria. The criteria are expressed in the form of predicates. WHERE clauses are not mandatory clauses of SQL DML statements, but can be used to limit the number of rows affected by a SQL DML statement or returned by a query. In brief SQL WHERE clause is used to extract only those results from a SQL statement, such as: SELECT, INSERT, UPDATE, or DELETE statement.

A spatial database is a general-purpose database that has been enhanced to include spatial data that represents objects defined in a geometric space, along with tools for querying and analyzing such data. Most spatial databases allow the representation of simple geometric objects such as points, lines and polygons. Some spatial databases handle more complex structures such as 3D objects, topological coverages, linear networks, and triangulated irregular networks (TINs). While typical databases have developed to manage various numeric and character types of data, such databases require additional functionality to process spatial data types efficiently, and developers have often added geometry or feature data types. The Open Geospatial Consortium (OGC) developed the Simple Features specification and sets standards for adding spatial functionality to database systems. The SQL/MM Spatial ISO/IEC standard is a part of the SQL/MM multimedia standard and extends the Simple Features standard with data types that support circular interpolations. Almost all current relational and object-relational database management systems now have spatial extensions, and some GIS software vendors have developed their own spatial extensions to database management systems.

In relational databases, a condition in a query is said to be sargable if the DBMS engine can take advantage of an index to speed up the execution of the query. The term is derived from a contraction of Search ARGument ABLE. It was first used by IBM researchers as a contraction of Search ARGument, and has come to mean simply "can be looked up by an index."

Google Code Search was a free beta product from Google which debuted in Google Labs on October 5, 2006, allowing web users to search for open-source code on the Internet. Features included the ability to search using operators, namely lang:, package:, license:, and file:.

Language Integrated Query is a Microsoft .NET Framework component that adds native data querying capabilities to .NET languages, originally released as a major part of .NET Framework 3.5 in 2007.

<span class="mw-page-title-main">Elasticsearch</span> Search engine

Elasticsearch is a search engine based on the Lucene library. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is dual-licensed under the source-available Server Side Public License and the Elastic license, while other parts fall under the proprietary (source-available) Elastic License. Official clients are available in Java, .NET (C#), PHP, Python, Ruby and many other languages. According to the DB-Engines ranking, Elasticsearch is the most popular enterprise search engine.

In computer science, the Commentz-Walter algorithm is a string searching algorithm invented by Beate Commentz-Walter. Like the Aho–Corasick string matching algorithm, it can search for multiple patterns at once. It combines ideas from Aho–Corasick with the fast matching of the Boyer–Moore string-search algorithm. For a text of length n and maximum pattern length of m, its worst-case running time is O(mn), though the average case is often much better.

In computer science, an algorithm for matching wildcards is useful in comparing text strings that may contain wildcard syntax. Common uses of these algorithms include command-line interfaces, e.g. the Bourne shell or Microsoft Windows command-line or text editor or file manager, as well as the interfaces for some search engines and databases. Wildcard matching is a subset of the problem of matching regular expressions and string matching in general.

References

  1. Hardarson, Omar (1997). "Interactive coding of economic activity using trigram search in BLAISE III" (PDF). International Blaise User Group. Note: This article discusses trigram search as a way of efficiently encoding certain kinds of economic data, and finds that this technique is particularly useful when the users of the system don't have a lot of context for the structure of the data.
  2. 1 2 3 Cox, Russ (January 2012). "Regular Expression Matching with a Trigram Index or How Google Code Search Worked".
  3. Adams, Elizabeth; Meltzer, Arnold (1 March 1993). "Trigrams as index element in full text retrieval: Observations and experimental results". Proceedings of the 1993 ACM conference on Computer science - CSC '93. pp. 433–439. doi:10.1145/170791.170891. ISBN   0897915585. S2CID   16701550.
  4. 1 2 3 "F.33. pg_trgm". PostgreSQL Documentation. 2022-05-12. Retrieved 2022-05-28.
  5. 1 2 3 "Fast Search Using PostgreSQL Trigram Text Indexes". GitLab. 2016-03-18. Retrieved 2022-05-28.
  6. Zobel, Justin; Moffat, Alistair; Sacks-Davis, Ron (1993). "Searching Large Lexicons for Partially Specified Terms using Compressed Inverted Files" (PDF). Conference on Very Large Databases (VLDB). Note: This research paper does not use the term "trigram search" but does seem to be the first instance in the literature of using n-grams as indices, and is cited in the Russ Cox article as the first instance of the structure of a reverse index based on trigrams. The paper also cited successful performance results from using this style of search.
  7. "Big Grep". resources.sei.cmu.edu. Retrieved 2022-06-12. Also called BigGrep. Uses n-grams (not always 3),
  8. "N-gram tokenizer | Elasticsearch Guide [8.2] | Elastic". www.elastic.co. Retrieved 2022-05-28.