Sphinx (search engine)
   HOME

TheInfoList



OR:

Sphinx is a fulltext
search engine A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
that provides text search functionality to client applications.


Overview

Sphinx can be used either as a stand-alone server or as a
storage engine A database engine (or storage engine) is the underlying software component that a database management system (DBMS) uses to create, read, update and delete (CRUD) data from a database. Most database management systems include their own application ...
("SphinxSE") for the MySQL family of databases. When run as a stand-alone server, Sphinx operates like a
DBMS In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and ana ...
and can communicate with
MySQL MySQL () is an Open-source software, open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A rel ...
,
MariaDB MariaDB is a community-developed, commercially supported Fork (software development), fork of the MySQL relational database management system (RDBMS), intended to remain free and open-source software under the GNU General Public License. Developm ...
, and
PostgreSQL PostgreSQL ( ) also known as Postgres, is a free and open-source software, free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. PostgreSQL features transaction processing, transactions ...
through their native protocols or with any ODBC-compliant DBMS via
ODBC In computing, Open Database Connectivity (ODBC) is a standard application programming interface (API) for accessing database management systems (DBMS). The designers of ODBC aimed to make it independent of database systems and operating systems. An ...
.
MariaDB MariaDB is a community-developed, commercially supported Fork (software development), fork of the MySQL relational database management system (RDBMS), intended to remain free and open-source software under the GNU General Public License. Developm ...
, a fork of MySQL, is distributed with SphinxSE.


SphinxAPI

If Sphinx is run as a stand-alone server, it is possible to use SphinxAPI to connect an application to it. Official implementations of the API are available for
PHP PHP is a general-purpose scripting language geared towards web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by the PHP Group. ...
,
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
,
Perl Perl is a high-level, general-purpose, interpreted, dynamic programming language. Though Perl is not officially an acronym, there are various backronyms in use, including "Practical Extraction and Reporting Language". Perl was developed ...
,
Ruby Ruby is a pinkish-red-to-blood-red-colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sapph ...
, and
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...
languages. Unofficial implementations for other languages, as well as various third-party plugins and modules, are also available. Other data sources can be indexed via a pipe in a custom
XML Extensible Markup Language (XML) is a markup language and file format for storing, transmitting, and reconstructing data. It defines a set of rules for encoding electronic document, documents in a format that is both human-readable and Machine-r ...
format.


SphinxQL

The Sphinx search daemon supports the MySQL binary network protocol and can be accessed with the regular MySQL API and/or clients. Sphinx supports a subset of
SQL Structured Query Language (SQL) (pronounced ''S-Q-L''; or alternatively as "sequel") is a domain-specific language used to manage data, especially in a relational database management system (RDBMS). It is particularly useful in handling s ...
known as SphinxQL. It supports standard querying of all index types with SELECT, modifying RealTime indexes with INSERT, REPLACE, and DELETE, and more.


SphinxSE

Sphinx can also provide a special storage engine for MariaDB and MySQL databases. This allows MySQL and MariaDB to communicate with Sphinx's searchd to run queries and obtain results. Sphinx indices are treated like regular SQL tables. The SphinxSE storage engine is shipped with MariaDB.


Full-text fields and indexing

Sphinx is configured to examine a data set via its Indexer. The Indexer process creates a full-text index (a special
data structure In computer science, a data structure is a data organization and storage format that is usually chosen for Efficiency, efficient Data access, access to data. More precisely, a data structure is a collection of data values, the relationships amo ...
that enables quick keyword searches) from the given data/text.
Full-text In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts r ...
fields are the resulting content that is indexed by Sphinx; they can be (quickly) searched for keywords. Fields are named, and you can limit your searches to a single field (e.g. search through "title" only) or a subset of fields (e.g. to "title" and "abstract" only). Sphinx's index format generally supports up to 256 fields. Note that the original data is not stored in the Sphinx index, but are discarded during the Indexing process; Sphinx assumes that you store those contents elsewhere.


Attributes

Attributes are additional values associated with each document that can be used to perform additional filtering and sorting during search. Attributes are named. Attribute names are case insensitive. Attributes are not full-text indexed; they are stored in the index as is. Currently supported attribute types are: * unsigned integers (1-bit to 32-bit wide); *
UNIX Unix (, ; trademarked as UNIX) is a family of multitasking, multi-user computer operating systems that derive from the original AT&T Unix, whose development started in 1969 at the Bell Labs research center by Ken Thompson, Dennis Ritchie, a ...
timestamps A timestamp is a sequence of characters or encoded information identifying when a certain event occurred, usually giving date and time of day, sometimes accurate to a small fraction of a second. Timestamps do not have to be based on some absolu ...
; * floating point values (32-bit, IEEE 754 single precision); * string ordinals (specially computed integers); * strings (since 1.10-beta); *
JSON JSON (JavaScript Object Notation, pronounced or ) is an open standard file format and electronic data interchange, data interchange format that uses Human-readable medium and data, human-readable text to store and transmit data objects consi ...
(since 2.1.1-beta); * MVA, multi-value attributes (variable-length lists of 32-bit unsigned integers).


JSON attributes in Sphinx

Sphinx, like classic SQL
databases In computing, a database is an organized collection of data or a type of data store based on the use of a database management system (DBMS), the software that interacts with end users, applications, and the database itself to capture and ana ...
, works with a so-called fixed
schema Schema may refer to: Science and technology * SCHEMA (bioinformatics), an algorithm used in protein engineering * Schema (genetic algorithms), a set of programs or bit strings that have some genotypic similarity * Schema.org, a web markup vocab ...
, that is, a set of predefined attribute columns. These work well when most of the data stored actually has values: mapping sparse data to static columns can be cumbersome. Assume for example that you're running a price comparison or an auction site with many different products categories. Some of the attributes like the price or the vendor are identical across all goods. But from there, for laptops, you also need to store the weight, screen size, HDD type, RAM size, etc. And, say, for shovels, you probably want to store the color, the handle length, and so on. So it's manageable across a single category, but all the distinct fields that you need for all the goods across all the categories are legion. The JSON field can be used to overcome this. Inside the JSON attribute you don't need a fixed structure. You can have various keys which may or may not be present in all documents. When you try to filter on one of these keys, Sphinx will ignore documents that don't have the key in the JSON attribute and will work only with those documents that have it.


License

Up until version 3, Sphinx is dual licensed; either: # GNU General Public License version 2 or # proprietary licensing is available for use-cases which are not within the terms of the GNU GPLv2. Since version 3, Sphinx has become proprietary, with a promise to release its source code in the future


Sphinx use examples

*Craigslist.org *Tradebit.com *vBulletin.com *MediaWiki extension *Boardreader.com *OMBE.com *Limundo.com


Feature list

* Batch and incremental (soft real-time) full-text indexing. * Support for non-text attributes (
scalar Scalar may refer to: *Scalar (mathematics), an element of a field, which is used to define a vector space, usually the field of real numbers *Scalar (physics), a physical quantity that can be described by a single element of a number field such a ...
s, strings, sets, JSON). * Direct indexing of SQL databases. Native support for
MySQL MySQL () is an Open-source software, open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A rel ...
,
MariaDB MariaDB is a community-developed, commercially supported Fork (software development), fork of the MySQL relational database management system (RDBMS), intended to remain free and open-source software under the GNU General Public License. Developm ...
,
PostgreSQL PostgreSQL ( ) also known as Postgres, is a free and open-source software, free and open-source relational database management system (RDBMS) emphasizing extensibility and SQL compliance. PostgreSQL features transaction processing, transactions ...
,
MSSQL Microsoft SQL Server is a proprietary relational database management system developed by Microsoft using Structured Query Language (SQL, often pronounced "sequel"). As a database server, it is a software product with the primary function of s ...
, plus
ODBC In computing, Open Database Connectivity (ODBC) is a standard application programming interface (API) for accessing database management systems (DBMS). The designers of ODBC aimed to make it independent of database systems and operating systems. An ...
connectivity. * XML document indexing support. * Distributed searching support out-of-the-box. * Integration via access
API An application programming interface (API) is a connection between computers or between computer programs. It is a type of software interface, offering a service to other pieces of software. A document or standard that describes how to build ...
s. * SQL-like syntax support via
MySQL MySQL () is an Open-source software, open-source relational database management system (RDBMS). Its name is a combination of "My", the name of co-founder Michael Widenius's daughter My, and "SQL", the acronym for Structured Query Language. A rel ...
protocol (since 0.9.9) * Full-text searching syntax. * Database-like result set processing. * Relevance ranking utilizing additional factors besides standard BM25. * Text processing support for SBCS and
UTF-8 UTF-8 is a character encoding standard used for electronic communication. Defined by the Unicode Standard, the name is derived from ''Unicode Transformation Format 8-bit''. Almost every webpage is transmitted as UTF-8. UTF-8 supports all 1,112,0 ...
encodings, stopwords, indexing of words known not to appear in the database ("hitless"),
stemming In linguistic morphology and information retrieval, stemming is the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form. The stem need not be identical to the morphologic ...
, word forms,
tokenizing Tokenization may refer to: * Tokenization (lexical analysis) in language processing * Tokenization in search engine indexing * Tokenization (data security) in the field of data security * Word segmentation * A procedure during the Transformer ...
exceptions, and "blended characters" (dual-indexing as both a real character and a word separator). * Supports UDF (since 2.0.1).


Performance and scalability

* Indexing speed of up to 10-15 MB/sec per core and HDD. * Searching speed of over 500 queries/sec against 1,000,000 document/1.2 GB collection using a 2-core desktop system with 2 GB of RAM. * The biggest known installation using Sphinx, Boardreader.com, indexes 16 billion documents. * The busiest known installation, Craigslist, serves over 300,000,000 queries/day and more than 50 billion page views/month.


Fork

In 2017, key members of the original Sphinx team forked the project under the name Manticore, with the intention of fixing bugs and developing new features. Unlike Sphinx, Manticore continues to be released as open source under version 3 of the
GPL The GNU General Public Licenses (GNU GPL or simply GPL) are a series of widely used free software licenses, or ''copyleft'' licenses, that guarantee end users the freedom to run, study, share, or modify the software. The GPL was the first c ...
.


See also

* List of information retrieval libraries


References


Further reading

* * *No more open-source (2017) https://sphinxsearch.com/blog/2017/07/24/sphinx-2017/


External links

* {{Official website, https://sphinxsearch.com
SphinxSE in MariaDB KnowledgeBase
* https://manticoresearch.com/ - Manticore opensource fork site. Internet search engines Free search engine software