Go to the first, previous, next, last section, table of contents.

12 MySQL Full-text Search

Since Version 3.23.23, MySQL has support for full-text indexing and searching. Full-text indexes in MySQL are an index of type FULLTEXT. FULLTEXT indexes can be created from VARCHAR and TEXT columns at CREATE TABLE time or added later with ALTER TABLE or CREATE INDEX. For large datasets, adding FULLTEXT index with ALTER TABLE (or CREATE INDEX) would be much faster than inserting rows into the empty table with a FULLTEXT index.

Full-text search is performed with the MATCH function.

mysql> CREATE TABLE t (a VARCHAR(200), b TEXT, FULLTEXT (a,b));
Query OK, 0 rows affected (0.00 sec)

    ->   ('MySQL has now support', 'for full-text search'),
    ->   ('Full-text indexes', 'are called collections'),
    ->   ('Only MyISAM tables','support collections'),
    ->   ('Function MATCH ... AGAINST()','is used to do a search'),
    ->   ('Full-text search in MySQL', 'implements vector space model');
Query OK, 5 rows affected (0.00 sec)
Records: 5  Duplicates: 0  Warnings: 0

| a                         | b                             |
| MySQL has now support     | for full-text search          |
| Full-text search in MySQL | implements vector-space-model |
2 rows in set (0.00 sec)

mysql> SELECT *,MATCH a,b AGAINST ('collections support') as x FROM t;
| a                            | b                             | x      |
| MySQL has now support        | for full-text search          | 0.3834 |
| Full-text indexes            | are called collections        | 0.3834 |
| Only MyISAM tables           | support collections           | 0.7668 |
| Function MATCH ... AGAINST() | is used to do a search        |      0 |
| Full-text search in MySQL    | implements vector space model |      0 |
5 rows in set (0.00 sec)

The function MATCH matches a natural language query AGAINST a text collection (which is simply the columns that are covered by a FULLTEXT index). For every row in a table it returns relevance - a similarity measure between the text in that row (in the columns that are part of the collection) and the query. When it is used in a WHERE clause (see example above) the rows returned are automatically sorted with relevance decreasing. Relevance is a non-negative floating-point number. Zero relevance means no similarity. Relevance is computed based on the number of words in the row, the number of unique words in that row, the total number of words in the collection, and the number of documents (rows) that contain a particular word.

MySQL uses a very simple parser to split text into words. A ``word'' is any sequence of letters, numbers, `'', and `_'. Any ``word'' that is present in the stopword list or just too short (3 characters or less) is ignored.

Every correct word in the collection and in the query is weighted, according to its significance in the query or collection. This way, a word that is present in many documents will have lower weight (and may even have a zero weight), because it has lower semantic value in this particular collection. Otherwise, if the word is rare, it will receive a higher weight. The weights of the words are then combined to compute the relevance of the row.

Such a technique works best with large collections (in fact, it was carefully tuned this way). For very small tables, word distribution does not reflect adequately their semantical value, and this model may sometimes produce bizarre results.

For example, search for the word "search" will produce no results in the above example. Word "search" is present in more than half of rows, and as such, is effectively treated as a stopword (that is, with semantical value zero). It is, really, the desired behavior - a natural language query should not return every other row in 1GB table.

A word that matches half of rows in a table is less likely to locate relevant documents. In fact, it will most likely find plenty of irrelevant documents. We all know this happens far too often when we are trying to find something on the Internet with a search engine. It is with this reasoning that such rows have been assigned a low semantical value in a particular dataset.

12.1 Fine-tuning MySQL Full-text Search

Unfortunately, full-text search has no user-tunable parameters yet, although adding some is very high on the TODO. However, if you have a MySQL source distribution ( 「4.7 MySQL ソースディストリビューションのインストール」節参照.), you can somewhat alter the full-text search behavior.

Note that full-text search was carefully tuned for the best searching effectiveness. Modifying the default behavior will, in most cases, only make the search results worse. Do not alter the MySQL sources unless you know what you are doing!

12.2 New Features of Full-text Search to Appear in MySQL 4.0

This section includes a list of the fulltext features that are already implemented in the 4.0 tree. It explains More functions for full-text search entry of 「H.1 Things that should be in 4.0」節.

12.3 Full-text Search TODO

Go to the first, previous, next, last section, table of contents.