PHP DevCenter
oreilly.comSafari Books Online.Conferences.

advertisement


Building a Simple Search Engine with PHP
Pages: 1, 2, 3

The Search Interface



Of course, users will not be able to work with the MySQL database directly. Therefore, we'll create another PHP script that provides an HTML form to query the database. This works just like any other search engine. The user enters a word in a textbox, hits Enter, and receives a page of results linked to the appropriate pages. The result order depends on the number of times a keyword appears in each document. The search.php script is listed below.

<?

/*
* search.php
*
* Script for searching a database populated with keywords by the
* populate.php-script.

*/

print "<html><head><title>My Search Engine</title></head><body>\n";

if( $_POST['keyword'] )
{
   /* Connect to the database: */
   mysql_pconnect("localhost","root","secret")
       or die("ERROR: Could not connect to database!");
   mysql_select_db("test");

   /* Get timestamp before executing the query: */
   $start_time = getmicrotime();

   /* Set $keyword and $results, and use addslashes() to
    *  minimize the risk of executing unwanted SQL commands: */
   $keyword = addslashes( $_POST['keyword'] );
   $results = addslashes( $_POST['results'] );

   /* Execute the query that performs the actual search in the DB: */
   $result = mysql_query(" SELECT p.page_url AS url,
                           COUNT(*) AS occurrences 
                           FROM page p, word w, occurrence o
                           WHERE p.page_id = o.page_id AND
                           w.word_id = o.word_id AND
                           w.word_word = \"$keyword\"
                           GROUP BY p.page_id
                           ORDER BY occurrences DESC
                           LIMIT $results" );

   /* Get timestamp when the query is finished: */
   $end_time = getmicrotime();

   /* Present the search-results: */
   print "<h2>Search results for '".$_POST['keyword']."':</h2>\n";
   for( $i = 1; $row = mysql_fetch_array($result); $i++ )
   {
      print "$i. <a href='".$row['url']."'>".$row['url']."</a>\n";
      print "(occurrences: ".$row['occurrences'].")<br><br>\n";
   }

   /* Present how long it took the execute the query: */
   print "query executed in ".(substr($end_time-$start_time,0,5))." seconds.";
}
else
{
   /* If no keyword is defined, present the search page instead: */
   print "<form method='post'> Keyword: 
          <input type='text' size='20' name='keyword'>\n";
   print "Results: <select name='results'><option value='5'>5</option>\n";
   print "<option value='10'>10</option><option value='15'>15</option>\n";
   print "<option value='20'>20</option></select>\n";

   print "<input type='submit' value='Search'></form>\n";
}

print "</body></html>\n";

/* Simple function for retrieving the current timestamp in microseconds: */
function getmicrotime()
{
   list($usec, $sec) = explode(" ",microtime());
   return ((float)$usec + (float)$sec);
}

?>

The script may be called with or without the keyword argument. If it's defined, the script searches for that word in the database. It will also show the length of time it took to process the query. Otherwise, the script presents the search page instead. That page will resemble Figure 1.


Figure 1 - our simple search page

Let's search on the keyword linux. Our dataset produces results similar to Figure 2.


Figure 2 - the search results page

As expected, onlamp.com appears first on the result page because the keyword linux appears more frequently on this site than on the others. A search for java would probably get onjava.com on the top, and 'xml' would most likely generate the most hits for xml.com. Also note that we've limited the results to the five most interesting pages.

Speeding Up the Database

As the bottom of the results page shows, the query took 0.393 seconds to execute. While this may not seem like an incredibly long time, it does represent quite a hit as the database grows. Fortunately, since we're using a database, there's a very simple solution.

CREATE INDEX word_word_ix ON word (word_word);

This will create an index in the word table on the word_word column. Since all of our searches start with this column, the database will find the appropriate pages much more quickly. To prove this point, we will search for the keyword linux again, to see if we gained any performance. See Figure 3.


Figure 3 - searching with an index

Nice. It took 0.028 seconds, a speed increase of 0.365 seconds, or 1,400 percent. If this engine handled an average of 1,000 queries per hour, this would mean a savings of about 144 minutes per day.

Summary

As shown in this article, useful search engines can be built pretty simply. Without much hassle, you could develop this concept further to handle multiple keywords, boolean operators, stop words, and other features you find in many commercial search facilities. It would also be interesting to populate the database further with a few hundred megs of data. Would the speed still be reasonable? Probably. One thing we could be absolutely sure of, however, is that for an intranet of a mid-sized company with just a few dozen searches per hour, this solution can offer stunning performance with minimal setup.

Whether you're planning to develop a big-scale commercial search engine, or are just playing around, http://www.robotstxt.org/wc/robots.html offers lots of helpful and interesting reading on this topic. For example, it describes the use of the standardized robots.txt file, which every Internet spider should use to determine what it can and can't do on a specific site. Please read and follow the rules if you don't control the sites you want to search.

I wish you good luck and look forward to getting a visit from your spider soon. :)

Daniel Solin is a freelance writer and Linux consultant whose specialty is GUI programming. His first book, SAMS Teach Yourself Qt Programming in 24 hours, was published in May, 2000.


Return to the PHP DevCenter.


Valuable Online Certification Training

Online Certification for Your Career
Earn a Certificate for Professional Development from the University of Illinois Office of Continuing Education upon completion of each online certificate program.

PHP/SQL Programming Certificate — The PHP/SQL Programming Certificate series is comprised of four courses covering beginning to advanced PHP programming, beginning to advanced database programming using the SQL language, database theory, and integrated Web 2.0 programming using PHP and SQL on the Unix/Linux mySQL platform.

Enroll today!


Sponsored by: