MySQL FULLTEXT Searchingby Joe Stump
Have you ever wanted to search text stored in your database, but couldn't figure out how to do it efficiently? Are you lazy like me and don't enjoy maintaining reverse indexes, dictionaries, and word scores? You're in luck. The release of MySQL 4.0 has made searching text stored in databases available to the masses.
MySQL has had FULLTEXT
searching in one form or another since version
FULLTEXT indices in MySQL allow database
administrators and programmers to designate any character-based field
TEXT) as a
FULLTEXT index, which allows for complex text searching
against data stored in those fields.
This feature is not to be confused with the
function in MySQL.
LIKE works more along the lines of a
regular expression. On the other hand,
are fully indexed fields which support stopwords, boolean searches,
and relevancy ratings.
This article assumes you have a working installation of MySQL, a
good understanding of how MySQL works, and a basic understanding of
web programming (with PHP, Perl, or something similar). Further, this
article may not be of any interest to those who are already using
FULLTEXT indices in a production environment.
How it Works
The MySQL team has made it extremely easy to add
FULLTEXT searching to your tables. They are created much
PRIMARY KEY indices.
For the purpose of this article we are going to make a basic blog
table, put some data into it, and start searching. Before we get
ahead of ourselves we need to create some tables.
Creating the tables
-- The main blog table with our
FULLTEXTindex -- Nothing extreme here, but you get the idea CREATE TABLE blog_entries ( entryID INT(9) UNSIGNED NOT NULL DEFAULT '0' AUTO_INCREMENT, posted INT(11) UNSIGNED NOT NULL DEFAULT '0', categoryID TINYINT(2) UNSIGNED NOT NULL DEFAULT '0', title CHAR(255) NOT NULL DEFAULT '', entry TEXT NOT NULL DEFAULT '', PRIMARY KEY (entryID), KEY (posted), KEY (categoryID), FULLTEXT(title,entry) ); -- A category table so you can organize your posts and -- later do some fun searching based on such data. CREATE TABLE blog_categories ( categoryID TINYINT(2) UNSIGNED NOT NULL DEFAULT '0' AUTO_INCREMENT, name CHAR(75) NOT NULL DEFAULT '', PRIMARY KEY (categoryID) ); -- Add some data into your blog INSERT INTO blog_categories VALUES (1,'Personal'); INSERT INTO blog_categories VALUES (2,'Work'); INSERT INTO blog_categories VALUES (3,'Editorials'); INSERT INTO blog_entries VALUES (1, 1050942000, 1, 'About miester', 'I was born in michigan in 1980 in a small town called Adrian. My mother is named Sue, while my father is named Mike. They currently live in a small town called East Jordan. On April 27th, 2003 I will graduate from Eastern Michigan University with a degree in Computer Information Systems.'); INSERT INTO blog_entries VALUES (2, 1050942000, 2, 'Today at work', 'While I was at work today I was having some problems with the RAID array. It seems that we have a rogue cron script that is causing problems. When I find out more info I will post it here.'); INSERT INTO blog_entries VALUES (3, 1050942000, 1, 'Vacation!', 'After I graduate I am taking a 2 week vacation. On my agenda is a trip to Washington DC to see my girlfriend\'s sister as well as throwing a few discs at the local disc golf course.'); INSERT INTO blog_entries VALUES (4, 1050942000, 1, 'Vacation!', 'I have had a horrible cold for the last few days. Today I drank a revive vitamin water with 150% of my daily dose of vitamin C. That should help things out.');
Querying the Data
Now that we have data in our tables we can begin to query it. There
are some restrictions to
FULLTEXT searching, which are
covered below. You will want to read over the restrictions before you
FULLTEXT indices in a production environment. For now
we are going to do a simple query for the word
mysql> SELECT entryID,title -> FROM blog_entries -> WHERE MATCH (title,entry) AGAINST('mother'); +---------+---------------+ | entryID | title | +---------+---------------+ | 1 | About miester | +---------+---------------+ 1 row in set (0.00 sec)
There are a few things to note when querying
indices. First, MySQL automatically orders the results by their
relevancy rating. Second, queries that are longer than 20 characters
will not be sorted. Third, and most importantly, the fields in the
MATCH() should be identical to the fields listed in the
All other MySQL syntax works as you'd expect with a
FULLTEXT search, meaning you can further limit your
search terms. We could search blog entries based on posting date or
category. If you let your imagination wander you can think of all
sorts of ways to filter your data. Let's look for blog entries that
only appear in the
Personal category and match the term
mysql> SELECT E.entryID, E.title, C.name -> FROM blog_entries AS E, blog_categories AS C -> WHERE E.categoryID=C.categoryID AND -> MATCH (E.title, E.entry) AGAINST ('michigan') AND -> E.categoryID=1; +---------+---------------+----------+ | entryID | title | name | +---------+---------------+----------+ | 1 | About miester | Personal | +---------+---------------+----------+ 1 row in set (0.00 sec)
Note that we not only did a join but also filtered the results
based on the category. Another thing to note is that
FULLLTEXT indices are not case sensitive. If you would
like to use MySQL's relevancy rating in your code you can add the
MATCH() ... AGAINST() clause to your
statement as well.
mysql> SELECT E.entryID, E.title, C.name, -> MATCH (E.title, E.entry) AGAINST ('michigan') AS score -> FROM blog_entries AS E, blog_categories AS C -> WHERE E.categoryID=C.categoryID AND -> MATCH (E.title, E.entry) AGAINST ('michigan') AND -> E.categoryID=1; +---------+---------------+----------+-----------------+ | entryID | title | name | score | +---------+---------------+----------+-----------------+ | 1 | About miester | Personal | 1.2635315656662 | +---------+---------------+----------+-----------------+ 1 row in set (0.00 sec)
Probably the most anticipated feature in MySQL 4.0's
FULLTEXT is the ability to do boolean searches without
having to process the query strings. This means you can add
-s to your queries, along with a host
of other commands, and MySQL will interpret them for you.
mysql> SELECT E.entryID,E.title,C.name -> FROM blog_entries AS E, blog_categories AS C -> WHERE E.categoryID=C.categoryID AND -> MATCH (E.title,E.entry) AGAINST ('+vacation -washington' IN BOOLEAN MODE) AND -> E.categoryID=1; +---------+-----------+----------+ | entryID | title | name | +---------+-----------+----------+ | 4 | Vacation! | Personal | +---------+-----------+----------+ 1 row in set (0.00 sec)
We have two entries with the word
vacation in the
title, but since we removed
washington, entryID 4 does
not show up in the result. You can read all about
MODE on MySQL's
A few restrictions affect MySQL
indices. Some of the default behaviors of these restrictions can be
changed in your my.cnf or using the
FULLTEXTindices are NOT supported in InnoDB tables.
- MySQL requires that you have at least three rows of data in your result set before it will return any results.
- By default, if a search term appears in more than 50% of the rows then MySQL will not return any results.
- By default, your search query must be at least four characters long and may not exceed 254 characters.
- MySQL has a default
stopwordsfile that has a list of common words (i.e.,
has) which are not returned in your search. In other words, searching for
thewill return zero rows.
- According to MySQL's manual, the argument to
AGAINST()must be a constant string. In other words, you cannot search for values returned within the query.
FULLTEXT's Default Behavior
There are several ways to alter the default behavior of
FULLTEXT. MySQL has some tips for
fine tuning the
FULLTEXT search, but the details are
a little sparse. The most common problem is the four character minimum
word length on queries. Before we go over that, let's review the
variables associated with the
mysql> SHOW VARIABLES LIKE 'ft%'; +--------------------------+----------------+ | Variable_name | Value | +--------------------------+----------------+ | ft_boolean_syntax | + -><()~*:""&| | | ft_min_word_len | 2 | | ft_max_word_len | 254 | | ft_max_word_len_for_sort | 20 | | ft_stopword_file | (built-in) | +--------------------------+----------------+ 5 rows in set (0.00 sec)
The variable we wish to change is
ft_min_word_len. According to the manual we should be
able to change this via the
SET VARIABLE command, but, in
reality, this does not work. After asking the mailing list about this
problem I was told this had to be specified as a startup option. To
change the minimum query string to three characters, start the MySQL
server as follows.
$ cd /path/to/mysql $ ./bin/safe_mysqld --user=mysql -O ft_min_word_len=3 &
After you have restarted your MySQL server (don't forget to change
your startup scripts) you have to rebuild the
indices. The manual suggests the following command:
-- Replace tbl_name with the name of your table mysql> REPAIR TABLE tbl_name QUICK;
After you have rebuilt your indices, you should be able to search with query strings of three or more characters instead of the default four character limit.
Having fun with MySQL
I'll save the implementation details for a later article, but here
are some interesting ways in which you could use MySQL
FULLTEXT searching to finding data on your website more
A great way to add cross referencing to articles would be to store
a query (i.e.,
linux for a post about Debian GNU/Linux)
with each article. If an article had a query, PHP could then perform
FULLTEXT search, returning those results as "Related
Articles". Furthermore, you could use PHP to create an advanced
search script that allowed users to search the database based on
category, criteria, pricing, etc.
ispell is a great tool for open source developers to
use to make their applications more user friendly. By adding
ispell to your search engine, you can check the spelling
of each query, offering alternate queries if the query was spelled
wrong. Everyone's favorite search engine does something similar.
Looking towards the Future
According to the manual, the MySQL team has a lot they
still want to implement into
FULLTEXT searching. Here
is a brief overview of those enhancements:
- Proximity operators
- Making the stopword lists depend on the language of the data
The proximity operators will really make
searching impressive. This will allow you to do searches on words
based on how close together they are. For example, if you currently
searched for 'mysql search' you would get results even where
search appear at opposite ends of
the document. With proximity operators, the scoring algorithm gauges
how close together the words are. Documents where
search appeared directly next to one another would
score higher than documents where they were not close together.
a great way to make search engines smarter. This would allow MySQL to
search for words that share the same lexical root. For example,
running would return documents with
run as well as
As you can see, the possibilities of
almost endless. For those of us looking for an easy and powerful
solution to our searching woes, MySQL has come up with an answer.
Joe Stump is the Lead Architect for Digg where he spends his time partitioning data, creating internal services, and ensuring the code frameworks are in working order.
Return to ONLamp.com.