Tuesday, 23 January 2018

Which SQL query is better, MATCH AGAINST or LIKE?

Which SQL query is better, MATCH AGAINST or LIKE?

To search the database for rows that have both keywords "foo" AND "bar" in any of the columns "foo_desc" and "bar_desc", I would do something like:
SELECT * 
FROM t1 
WHERE MATCH (t1.foo_desc, t2.bar_desc) AGAINST ('+foo* +bar*' IN BOOLEAN MODE)
or
SELECT * 
FROM t1 
WHERE (CONCAT(t1.foo_desc, t2.bar_desc) LIKE '%foo%') AND (CONCAT(t1.foo_desc, t2.bar_desc) LIKE '%bar%')
I expect the downside of the last query is performance.
The upside is that the LIKE query finds 'xxfoo' where MATCH AGAINST does not.
Which is the preferred one or is there a better solution?

As of MySQL 5.6 and later, InnoDB tables supports Match... Against.

The first is much better. On MyISAM tables it will use a full text index against those columns. The other will do a full table scan doing a concat on every row and then a comparison.
LIKE is only efficient if you're doing it against:
  • a column (not a result of a function unless your particular database vendor supports functional indexes--Oracle, for example--and you're using them);
  • the start of the column (ie LIKE 'blah%' as opposed to LIKE '%blah%'); and
  • a column that's indexed.
If any one of those conditions are not true the only way for the SQL engine to execute the query is by doing a full table scan. This can be usable under about 10-20 thousand rows. Beyond that it quickly becomes unusable however.
Note: One problem with MATCH on MySQL is that it seems to only match against whole words so a search for 'bla' won't match a column with a value of 'blah', but a search for 'bla*' will.

MariaDB - Full-Text Index Overview

Full-Text Index Overview

MariaDB has support for full-text indexing and searching:
  • A full-text index in MariaDB is an index of type FULLTEXT, and it allows more options when searching for portions of text from a field.
  • Full-text indexes can be used only with MyISAM and Aria tables, from MariaDB 10.0.5 with InnoDB tables and from MariaDB 10.0.15 with Mroonga tables, and can be created only for CHARVARCHAR, or TEXT columns.
  • Partitioned tables cannot contain fulltext indexes, even if the storage engine supports them.
  • A FULLTEXT index definition can be given in the CREATE TABLEstatement when a table is created, or added later using ALTER TABLE or CREATE INDEX.
  • For large data sets, it is much faster to load your data into a table that has no FULLTEXT index and then create the index after that, than to load data into a table that has an existing FULLTEXT index.
Full-text searching is performed using MATCH() ... AGAINST syntax. MATCH() takes a comma-separated list that names the columns to be searched. AGAINST takes a string to search for, and an optional modifier that indicates what type of search to perform. The search string must be a literal string, not a variable or a column name.
MATCH (col1,col2,...) AGAINST (expr [search_modifier])

Excluded Results

  • Partial words are excluded.
  • Words less than 4 characters in length (3 or less) will not be stored in the fulltext index. This value can be adjusted by changing the ft_min_word_length system variable (or, for InnoDBinnodb_ft_min_token_size).
  • Words longer than 84 characters in length will also not be stored in the fulltext index. This values can be adjusted by changing the ft_max_word_length system variable (or, for InnoDBinnodb_ft_max_token_size).
  • Stopwords are a list of common words such as "once" or "then" that do not reflect in the search results unless IN BOOLEAN MODE is used. The stopword list for MyISAM/Aria tables and InnoDB tables can differ. See stopwordsfor details and a full list, as well as for details on how to change the default list.
  • For MyISAM/Aria fulltext indexes only, if a word appears in more than half the rows, it is also excluded from the results of a fulltext search.

Relevance

MariaDB calculates a relevance for each result, based on a number of factors, including the number of words in the index, the number of unique words in a row, the total number of words in both the index and the result, and the weight of the word. In English, 'cool' will be weighted less than 'dandy', at least at present! The relevance can be returned as part of a query simply by using the MATCH function in the field list.

IN NATURAL LANGUAGE MODE

IN NATURAL LANGUAGE MODE is the default type of full-text search, and the keywords can be omitted. There are no special operators, and searches consist of one or more comma-separated keywords.
Searches are returned in descending order of relevance.

IN BOOLEAN MODE

Boolean search permits the use of a number of special operators:
OperatorDescription
+The word is mandatory in all rows returned.
-The word cannot appear in any row returned.
<The word that follows has a lower relevance than other words, although rows containing it will still match
>The word that follows has a higher relevance than other words.
()Used to group words into subexpressions.
~The word following contributes negatively to the relevance of the row (which is different to the '-' operator, which specifically excludes the word, or the '<' operator, which still causes the word to contribute positively to the relevance of the row.
*The wildcard, indicating zero or more characters. It can only appear at the end of a word.
"Anything enclosed in the double quotes is taken as a whole (so you can match phrases, for example).
Searches are not returned in order of relevance, and nor does the 50% limit apply. Stopwords and word minimum and maximum lengths still apply as usual.

WITH QUERY EXPANSION

A query expansion search is a modification of a natural language search. The search string is used to perform a regular natural language search. Then, words from the most relevant rows returned by the search are added to the search string and the search is done again. The query returns the rows from the second search. The IN NATURAL LANGUAGE MODE WITH QUERY EXPANSION or WITH QUERY EXPANSION modifier specifies a query expansion search. It can be useful when relying on implied knowledge within the data, for example that MariaDB is a database.

See Also

  • For simpler searches of a substring in text columns, see the LIKE operator.

Examples

Creating a table, and performing a basic search:
CREATE TABLE ft_myisam(copy TEXT,FULLTEXT(copy)) ENGINE=MyISAM;

INSERT INTO ft_myisam(copy) VALUES ('Once upon a time'),
  ('There was a wicked witch'), ('Who ate everybody up');

SELECT * FROM ft_myisam WHERE MATCH(copy) AGAINST('wicked');
+--------------------------+
| copy                     |
+--------------------------+
| There was a wicked witch |
+--------------------------+
Multiple words:
SELECT * FROM ft_myisam WHERE MATCH(copy) AGAINST('wicked,witch');
+---------------------------------+
| copy                            |
+---------------------------------+
| There was a wicked witch        |
+---------------------------------+
Since 'Once' is a stopword, no result is returned:
SELECT * FROM ft_myisam WHERE MATCH(copy) AGAINST('Once');
Empty set (0.00 sec)
Inserting the word 'wicked' into more than half the rows excludes it from the results:
INSERT INTO ft_myisam(copy) VALUES ('Once upon a wicked time'),
  ('There was a wicked wicked witch'), ('Who ate everybody wicked up');

SELECT * FROM ft_myisam WHERE MATCH(copy) AGAINST('wicked');
Empty set (0.00 sec)
Using IN BOOLEAN MODE to overcome the 50% limitation:
SELECT * FROM ft_myisam WHERE MATCH(copy) AGAINST('wicked' IN BOOLEAN MODE);
+---------------------------------+
| copy                            |
+---------------------------------+
| There was a wicked witch        |
| Once upon a wicked time         |
| There was a wicked wicked witch |
| Who ate everybody wicked up     |
+---------------------------------+
Returning the relevance:
SELECT copy,MATCH(copy) AGAINST('witch') AS relevance 
  FROM ft_myisam WHERE MATCH(copy) AGAINST('witch');
+---------------------------------+--------------------+
| copy                            | relevance          |
+---------------------------------+--------------------+
| There was a wicked witch        | 0.6775632500648499 |
| There was a wicked wicked witch | 0.5031757950782776 |
+---------------------------------+--------------------+
WITH QUERY EXPANSION. In the following example, 'MariaDB' is always associated with the word 'database', so it is returned when query expansion is used, even though not explicitly requested.
CREATE TABLE ft2(copy TEXT,FULLTEXT(copy)) ENGINE=MyISAM;

INSERT INTO ft2(copy) VALUES
 ('MySQL vs MariaDB database'),
 ('Oracle vs MariaDB database'), 
 ('PostgreSQL vs MariaDB database'),
 ('MariaDB overview'),
 ('Foreign keys'),
 ('Primary keys'),
 ('Indexes'),
 ('Transactions'),
 ('Triggers');

SELECT * FROM ft2 WHERE MATCH(copy) AGAINST('database');
+--------------------------------+
| copy                           |
+--------------------------------+
| MySQL vs MariaDB database      |
| Oracle vs MariaDB database     |
| PostgreSQL vs MariaDB database |
+--------------------------------+
3 rows in set (0.00 sec)

SELECT * FROM ft2 WHERE MATCH(copy) AGAINST('database' WITH QUERY EXPANSION);
+--------------------------------+
| copy                           |
+--------------------------------+
| MySQL vs MariaDB database      |
| Oracle vs MariaDB database     |
| PostgreSQL vs MariaDB database |
| MariaDB overview               |
+--------------------------------+
4 rows in set (0.00 sec)
Partial word matching with IN BOOLEAN MODE:
SELECT * FROM ft2 WHERE MATCH(copy) AGAINST('Maria*' IN BOOLEAN MODE);
+--------------------------------+
| copy                           |
+--------------------------------+
| MySQL vs MariaDB database      |
| Oracle vs MariaDB database     |
| PostgreSQL vs MariaDB database |
| MariaDB overview               |
+--------------------------------+
Using boolean operators
SELECT * FROM ft2 WHERE MATCH(copy) AGAINST('+MariaDB -database' 
  IN BOOLEAN MODE);
+------------------+
| copy             |
+------------------+
| MariaDB overview |
+------------------+

Boolean Full-Text Searches

MySQL can perform boolean full-text searches using the IN BOOLEAN MODE modifier. With this modifier, certain characters have special meaning at the beginning or end of words in the search string. In the following query, the + and - operators indicate that a word is required to be present or absent, respectively, for a match to occur. Thus, the query retrieves all the rows that contain the word MySQL but that do not contain the word YourSQL:
mysql> SELECT * FROM articles WHERE MATCH (title,body)
    -> AGAINST ('+MySQL -YourSQL' IN BOOLEAN MODE);
+----+-----------------------+-------------------------------------+
| id | title                 | body                                |
+----+-----------------------+-------------------------------------+
|  1 | MySQL Tutorial        | DBMS stands for DataBase ...        |
|  2 | How To Use MySQL Well | After you went through a ...        |
|  3 | Optimizing MySQL      | In this tutorial we will show ...   |
|  4 | 1001 MySQL Tricks     | 1. Never run mysqld as root. 2. ... |
|  6 | MySQL Security        | When configured properly, MySQL ... |
+----+-----------------------+-------------------------------------+
Note
In implementing this feature, MySQL uses what is sometimes referred to as implied Boolean logic, in which
  • + stands for AND
  • - stands for NOT
  • [no operator] implies OR
Boolean full-text searches have these characteristics:
  • They do not use the 50% threshold.
  • They do not automatically sort rows in order of decreasing relevance. You can see this from the preceding query result: The row with the highest relevance is the one that contains MySQL twice, but it is listed last, not first.
  • They can work even without a FULLTEXT index, although a search executed in this fashion would be quite slow.
  • The minimum and maximum word length full-text parameters apply.
  • The stopword list applies.
The boolean full-text search capability supports the following operators:
  • +
    A leading plus sign indicates that this word must be present in each row that is returned.
  • -
    A leading minus sign indicates that this word must not be present in any of the rows that are returned.
    Note: The - operator acts only to exclude rows that are otherwise matched by other search terms. Thus, a boolean-mode search that contains only terms preceded by- returns an empty result. It does not return all rows except those containing any of the excluded terms.
  • (no operator)
    By default (when neither + nor - is specified) the word is optional, but the rows that contain it are rated higher. This mimics the behavior of MATCH() ... AGAINST()without the IN BOOLEAN MODE modifier.
  • > <
    These two operators are used to change a word's contribution to the relevance value that is assigned to a row. The > operator increases the contribution and the <operator decreases it. See the example following this list.
  • ( )
    Parentheses group words into subexpressions. Parenthesized groups can be nested.
  • ~
    A leading tilde acts as a negation operator, causing the word's contribution to the row's relevance to be negative. This is useful for marking noise words. A row containing such a word is rated lower than others, but is not excluded altogether, as it would be with the - operator.
  • *
    The asterisk serves as the truncation (or wildcard) operator. Unlike the other operators, it should be appended to the word to be affected. Words match if they begin with the word preceding the * operator.
    If a word is specified with the truncation operator, it is not stripped from a boolean query, even if it is too short (as determined from the ft_min_word_len setting) or a stopword. This occurs because the word is not seen as too short or a stopword, but as a prefix that must be present in the document in the form of a word that begins with the prefix. Suppose that ft_min_word_len=4. Then a search for '+word +the*' will likely return fewer rows than a search for '+word +the':
    • The former query remains as is and requires both word and the* (a word starting with the) to be present in the document.
    • The latter query is transformed to +word (requiring only word to be present). the is both too short and a stopword, and either condition is enough to cause it to be ignored.
  • "
    A phrase that is enclosed within double quote (") characters matches only rows that contain the phrase literally, as it was typed. The full-text engine splits the phrase into words and performs a search in the FULLTEXT index for the words. Nonword characters need not be matched exactly: Phrase searching requires only that matches contain exactly the same words as the phrase and in the same order. For example, "test phrase" matches "test, phrase".
    If the phrase contains no words that are in the index, the result is empty. For example, if all words are either stopwords or shorter than the minimum length of indexed words, the result is empty.
The following examples demonstrate some search strings that use boolean full-text operators:
  • 'apple banana'
    Find rows that contain at least one of the two words.
  • '+apple +juice'
    Find rows that contain both words.
  • '+apple macintosh'
    Find rows that contain the word apple, but rank rows higher if they also contain macintosh.
  • '+apple -macintosh'
    Find rows that contain the word apple but not macintosh.
  • '+apple ~macintosh'
    Find rows that contain the word apple, but if the row also contains the word macintosh, rate it lower than if row does not. This is softer than a search for '+apple -macintosh', for which the presence of macintosh causes the row not to be returned at all.
  • '+apple +(>turnover <strudel)'
    Find rows that contain the words apple and turnover, or apple and strudel (in any order), but rank apple turnover higher than apple strudel.
  • 'apple*'
    Find rows that contain words such as appleapplesapplesauce, or applet.
  • '"some words"'
    Find rows that contain the exact phrase some words (for example, rows that contain some words of wisdom but not some noise words). Note that the " characters that enclose the phrase are operator characters that delimit the phrase. They are not the quotation marks that enclose the search string itself.