ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


Building a Simple Search Engine with PHP

by Daniel Solin
10/24/2002

A little while ago, I was working on an intranet site for a mid-sized company. As the site grew in both size and popularity, the assigner requested me to extend the site with a search feature. Since one of the rules of the intranet was that all logic code should be written in-house, using an existing open source engine was not an option.

Within a day, the engine was quite complete, and the result actually turned out better than expected. With PHP, MySQL, and a few techniques, these small projects are very easy. This article presents a cut-down version of the search engine. I hope this will encourage you to develop an engine that suits your particular needs, with the exact features you desire.

Database Design and Logic

We'll use MySQL as a database backend to store our search data. It's possible to shell out to Unix commands such as grep and find, but that would mean running the search engine on the machine hosting the files. As well, it would be more difficult to index pages served from a database. We'll tackle the database first.

The database for the search engine consists of three tables: page, word, and occurrence. page holds all indexed web pages, and word holds all of the words found on the indexed pages. The rows in occurrence correlate words to their containing pages. Each row represents one occurrence of one particular word on one particular page. The SQL for creating these tables are shown below.

CREATE TABLE page (
   page_id int(10) unsigned NOT NULL auto_increment,
   page_url varchar(200) NOT NULL default '',
   PRIMARY KEY (page_id)
) TYPE=MyISAM;

CREATE TABLE word (
   word_id int(10) unsigned NOT NULL auto_increment,
   word_word varchar(50) NOT NULL default '',
   PRIMARY KEY (word_id)
) TYPE=MyISAM;

CREATE TABLE occurrence (
   occurrence_id int(10) unsigned NOT NULL auto_increment,
   word_id int(10) unsigned NOT NULL default '0',
   page_id int(10) unsigned NOT NULL default '0',
   PRIMARY KEY (occurrence_id)
) TYPE=MyISAM;

While page and word hold actual data, occurrence acts only as a reference table. By joining occurrence with page and word, we can determine which pages contain a word, as well as how many times the word occurs. Before that, though, we need some data.

Web Database Applications with PHP, and MySQL

Related Reading

Web Database Applications with PHP, and MySQL
By Hugh E. Williams, David Lane

Populating The Database

Okay, the database is created and we're ready to feed it some content. For this, we'll create a PHP script that takes a user-specified URL, reads the document representing the URL, extracts words from the page, and creates records in the database. Here's the program:

<?
/*
* populate.php
*
* Script for populating the search database with words,
* pages and word-occurences.
*/

/* Connect to the database: */
mysql_pconnect("localhost","root","secret")
    or die("ERROR: Could not connect to database!");

mysql_select_db("test");

/* Define the URL that should be processed: */

$url = addslashes( $_GET['url'] );

if( !$url )
{
   die( "You need to define a URL to process." );
}
else if( substr($url,0,7) != "http://" )
{
   $url = "http://$url";
}

/* Does this URL already have a record in the page-table? */
$result = mysql_query("SELECT page_id FROM page WHERE page_url = \"$url\"");
$row = mysql_fetch_array($result);

if( $row['page_id'] )
{
   /* If yes, use the old page_id: */
   $page_id = $row['page_id'];
}
else
{
   /* If not, create one: */
   mysql_query("INSERT INTO page (page_url) VALUES (\"$url\")");
   $page_id = mysql_insert_id();
}

/* Start parsing through the text, and build an index in the database: */
if( !($fd = fopen($url,"r")) )
   die( "Could not open URL!" );

while( $buf = fgets($fd,1024) )
{
   /* Remove whitespace from beginning and end of string: */
   $buf = trim($buf);

   /* Try to remove all HTML-tags: */
   $buf = strip_tags($buf);
   $buf = ereg_replace('/&\w;/', '', $buf);

   /* Extract all words matching the regexp from the current line: */
   preg_match_all("/(\b[\w+]+\b)/",$buf,$words);

   /* Loop through all words/occurrences and insert them into the database: */
   for( $i = 0; $words[$i]; $i++ )
   {
      for( $j = 0; $words[$i][$j]; $j++ )
      {
         /* Does the current word already have a record in the word-table? */
         $cur_word = addslashes( strtolower($words[$i][$j]) );

         $result = mysql_query("SELECT word_id FROM word 
                                WHERE word_word = '$cur_word'");
         $row = mysql_fetch_array($result);
         if( $row['word_id'] )
         {
            /* If yes, use the old word_id: */
            $word_id = $row['word_id'];
         }
         else
         {
            /* If not, create one: */
            mysql_query("INSERT INTO word (word_word) VALUES (\"$cur_word\")");
            $word_id = mysql_insert_id();
         }

         /* And finally, register the occurrence of the word: */
         mysql_query("INSERT INTO occurrence (word_id,page_id) 
                      VALUES ($word_id,$page_id)");
         print "Indexing: $cur_word<br>";
      }
   }
}

fclose($fd);

?>

Note the use of addslashes() when assigning values to $url and $cur_word. Since these variables are used within SQL queries, it's very important to make sure they're free from harmful data. Imagine if someone called this script with a URL of "; DROP TABLE FOO;. As addslashes() quotes database special characters, someone of malicious bent would have to go to much more work to exploit the database.

This script connects to the database, registering the page in the database if it does not exist, and starts to retrieve data. The preg_match_all() call extracts all words (matching the regular expression) from the page. Each word must be recorded in the occurrence and word table.

When building the index, only three SQL INSERT statements actually matter. When a page is first indexed, it must be recorded:

INSERT INTO page (page_url) VALUES ("http://www.onlamp.com/");

The first occurrence of a word within the entire dataset must be recorded:

INSERT INTO word (word_word) VALUES ("linux");

Each occurrence of a word within a page must be recorded:

INSERT INTO occurrence (word_id,page_id) VALUES ($word_id,$page_id);

Let's now index a few pages. The seven sites that makes up the O'Reilly Network is probably a good idea. We will call populate.php with the site URLs as the only argument, one at a time:

http://localhost/populate.php?url=http://www.macdevcenter.com/
http://localhost/populate.php?url=http://www.onjava.com/
http://localhost/populate.php?url=http://www.onlamp.com/
http://localhost/populate.php?url=http://www.openp2p.com/
http://localhost/populate.php?url=http://www.osdir.com/
http://localhost/populate.php?url=http://www.perl.com/
http://localhost/populate.php?url=http://www.xml.com/

A quick investigation of the tables now should result in something like this:


mysql> SELECT * FROM page;
+---------+------------------------------+
| page_id | page_url                     |
+---------+------------------------------+
| 1       | http://www.macdevcenter.com/ |
| 2       | http://www.onjava.com/       |
| 3       | http://www.onlamp.com/       |
| 4       | http://www.openp2p.com/      |
| 5       | http://www.osdir.com/        |
| 6       | http://www.perl.com/         |
| 7       | http://www.xml.com/          |
+---------+------------------------------+
7 rows in set (0.00 sec)

mysql> SELECT COUNT(*) FROM word;
+----------+
| COUNT(*) |
+----------+
| 2423     |
+----------+
1 row in set (0.00 sec)

mysql> SELECT COUNT(*) FROM occurrence;
+----------+
| COUNT(*) |
+----------+
| 20628    |
+----------+
1 row in set (0.00 sec)

The database now has seven indexed pages, resulting in 2,423 unique words and 20,628 occurrences. We're ready for the next level, the search interface.

The Search Interface

Of course, users will not be able to work with the MySQL database directly. Therefore, we'll create another PHP script that provides an HTML form to query the database. This works just like any other search engine. The user enters a word in a textbox, hits Enter, and receives a page of results linked to the appropriate pages. The result order depends on the number of times a keyword appears in each document. The search.php script is listed below.

<?

/*
* search.php
*
* Script for searching a database populated with keywords by the
* populate.php-script.

*/

print "<html><head><title>My Search Engine</title></head><body>\n";

if( $_POST['keyword'] )
{
   /* Connect to the database: */
   mysql_pconnect("localhost","root","secret")
       or die("ERROR: Could not connect to database!");
   mysql_select_db("test");

   /* Get timestamp before executing the query: */
   $start_time = getmicrotime();

   /* Set $keyword and $results, and use addslashes() to
    *  minimize the risk of executing unwanted SQL commands: */
   $keyword = addslashes( $_POST['keyword'] );
   $results = addslashes( $_POST['results'] );

   /* Execute the query that performs the actual search in the DB: */
   $result = mysql_query(" SELECT p.page_url AS url,
                           COUNT(*) AS occurrences 
                           FROM page p, word w, occurrence o
                           WHERE p.page_id = o.page_id AND
                           w.word_id = o.word_id AND
                           w.word_word = \"$keyword\"
                           GROUP BY p.page_id
                           ORDER BY occurrences DESC
                           LIMIT $results" );

   /* Get timestamp when the query is finished: */
   $end_time = getmicrotime();

   /* Present the search-results: */
   print "<h2>Search results for '".$_POST['keyword']."':</h2>\n";
   for( $i = 1; $row = mysql_fetch_array($result); $i++ )
   {
      print "$i. <a href='".$row['url']."'>".$row['url']."</a>\n";
      print "(occurrences: ".$row['occurrences'].")<br><br>\n";
   }

   /* Present how long it took the execute the query: */
   print "query executed in ".(substr($end_time-$start_time,0,5))." seconds.";
}
else
{
   /* If no keyword is defined, present the search page instead: */
   print "<form method='post'> Keyword: 
          <input type='text' size='20' name='keyword'>\n";
   print "Results: <select name='results'><option value='5'>5</option>\n";
   print "<option value='10'>10</option><option value='15'>15</option>\n";
   print "<option value='20'>20</option></select>\n";

   print "<input type='submit' value='Search'></form>\n";
}

print "</body></html>\n";

/* Simple function for retrieving the current timestamp in microseconds: */
function getmicrotime()
{
   list($usec, $sec) = explode(" ",microtime());
   return ((float)$usec + (float)$sec);
}

?>

The script may be called with or without the keyword argument. If it's defined, the script searches for that word in the database. It will also show the length of time it took to process the query. Otherwise, the script presents the search page instead. That page will resemble Figure 1.


Figure 1 - our simple search page

Let's search on the keyword linux. Our dataset produces results similar to Figure 2.


Figure 2 - the search results page

As expected, onlamp.com appears first on the result page because the keyword linux appears more frequently on this site than on the others. A search for java would probably get onjava.com on the top, and 'xml' would most likely generate the most hits for xml.com. Also note that we've limited the results to the five most interesting pages.

Speeding Up the Database

As the bottom of the results page shows, the query took 0.393 seconds to execute. While this may not seem like an incredibly long time, it does represent quite a hit as the database grows. Fortunately, since we're using a database, there's a very simple solution.

CREATE INDEX word_word_ix ON word (word_word);

This will create an index in the word table on the word_word column. Since all of our searches start with this column, the database will find the appropriate pages much more quickly. To prove this point, we will search for the keyword linux again, to see if we gained any performance. See Figure 3.


Figure 3 - searching with an index

Nice. It took 0.028 seconds, a speed increase of 0.365 seconds, or 1,400 percent. If this engine handled an average of 1,000 queries per hour, this would mean a savings of about 144 minutes per day.

Summary

As shown in this article, useful search engines can be built pretty simply. Without much hassle, you could develop this concept further to handle multiple keywords, boolean operators, stop words, and other features you find in many commercial search facilities. It would also be interesting to populate the database further with a few hundred megs of data. Would the speed still be reasonable? Probably. One thing we could be absolutely sure of, however, is that for an intranet of a mid-sized company with just a few dozen searches per hour, this solution can offer stunning performance with minimal setup.

Whether you're planning to develop a big-scale commercial search engine, or are just playing around, http://www.robotstxt.org/wc/robots.html offers lots of helpful and interesting reading on this topic. For example, it describes the use of the standardized robots.txt file, which every Internet spider should use to determine what it can and can't do on a specific site. Please read and follow the rules if you don't control the sites you want to search.

I wish you good luck and look forward to getting a visit from your spider soon. :)

Daniel Solin is a freelance writer and Linux consultant whose specialty is GUI programming. His first book, SAMS Teach Yourself Qt Programming in 24 hours, was published in May, 2000.


Return to the PHP DevCenter.

Copyright © 2009 O'Reilly Media, Inc.