Jay Taylor's notes
back to listing indexAbout | Sphinx
[web search]Sphinx overview
Sphinx is an open source full text search server, designedfrom the ground up with performance, relevance (aka search quality),and integration simplicity in mind. It's written in C++ and works onLinux (RedHat, Ubuntu, etc), Windows, MacOS, Solaris, FreeBSD, anda few other systems.
Sphinx lets you either batch index and search data stored in an SQL database, NoSQL storage, or just files quickly and easily —or index and search data on the fly, working with Sphinx prettymuch as with a database server.
A variety of text processing features enable fine-tuning Sphinxfor your particular application requirements, and a number of relevancefunctions ensures you can tweak search quality as well.
Searching via SphinxAPI is as simple as 3 lines of code, andquerying via SphinxQL is even simpler, with search queries expressedin good old SQL.
Sphinx clusters scale up to tens of billions of documents and hundredsof millions search queries per day, powering top websites such asCraigslist,Living Social,MetaCafe and Groupon... to view a complete list of known users please visit our Powered-by page.
And last but not least, it's licensed under GPLv2.
Performance and scalability
- Indexing performance. Sphinx indexes up to 10-15 MBof text per second per single CPU core, that is 60+ MB/sec perserver (on a dedicated indexing machine).
- Searching performance. Searching through 1,000,000-document,1.2 GB text collection that we use for everyday development and testingruns at 500+ queries/sec on a 2-core desktop machine with 2 GBof RAM.
- Scalability. Biggest known Sphinx clusterindexes 25+ billion documents, resulting in over 9TB of data. Busiest known one isCraigslist, serving 300+ million search queries/day.
Key features
- Batch and Real-Time full-text indexes. Two index backends that support both efficient offline index constructionandincremental on-the-fly index updates are available.
- Non-text attributes support. An arbitrary number ofattributes (product IDs, company names, prices, etc) can bestored in the index and used either just for retrieveal (to avoidhitting the DB), or for efficient Sphinx-side search result setpost-processing.
- SQL database indexing. Sphinx can directly accessand index data stored in MySQL (all storage engines are supported),PostgreSQL, Oracle, Microsoft SQL Server, SQLite, Drizzle, andanything else that supports ODBC.
- Non-SQL storage indexing. Data can also be streamedto batch indexer in a simple XML format called XMLpipe,or inserted directly into an incremental RT index.
- Easy application integration. Sphinx comes with three different APIs, SphinxAPI, SphinxSE, and SphinxQL.SphinxAPI is a native library available for Java, PHP, Python,Perl, C, and other languages. SphinxSE, a pluggable storageengine for MySQL, enables huge result sets to be shippeddirectly to MySQL server for post-processing. SphinxQL letsthe application query Sphinx using standard MySQL clientlibary and query syntax.
- Advanced full-text searching syntax. Our querying enginesupports arbitrarily complex queries combining boolean operators,phrase, proximity, strict order, and quorum matching, field andposition limits, exact keyword form matching, substringsearches, etc.
- Rich database-like querying features. Sphinx does notlimit you to just keyword searching. On top of full-textsearch result set, you can compute arbitrary arithmeticexpressions, add WHERE conditions, do ORDER BY, GROUP BY,use MIN/MAX/AVG/SUM, aggregates etc. Essentially, full-blownSQL SELECT is supported.
- Better relevance ranking. Unlike many other engines,Sphinx does not solely rely on 30-year-old statistical rankingthat only considers keyword frequencies, nor limits you to it.By default, Sphinx additionally analyzes keyword proximity,and ranks closer phrase matches higher, with perfect matchesranked on top. Also, ranking is flexible: you can choosefrom a number of built-in relevance functions, tweak theirweights by using expressions, or develop new ones.
- Flexible text processing. Sphinx indexing featuresinclude full support for SBCS and UTF-8 encodings (meaning thateffectively all world's languages are supported); stopword removaland optional hit position removal (hitless indexing); morphologyand synonym processing through word forms dictionaries and stemmers;exceptions and blended characters; and many more.
- Distributed searching. Searches can be distributed across multiple machines, enabling horizontal scale-out and HA(High Availability).
License
The Sphinx Search server is dual-licensed, thus it can be eithercommercially licensed or freely available via the Downloads page if used in accordance with the terms of the GPL v.2.
For those interested in commercial licensing, typically needed forembedding Sphinx in non-GPL products (OEMs/ISVs). Please refer to theCommercial Licensing page for additional information, or reachout to the Sphinx Licensing team directly via our Contact page.