PHP DevCenter
oreilly.comSafari Books Online.Conferences.

advertisement


Building a Simple Search Engine with PHP
Pages: 1, 2, 3

Populating The Database



Okay, the database is created and we're ready to feed it some content. For this, we'll create a PHP script that takes a user-specified URL, reads the document representing the URL, extracts words from the page, and creates records in the database. Here's the program:

<?
/*
* populate.php
*
* Script for populating the search database with words,
* pages and word-occurences.
*/

/* Connect to the database: */
mysql_pconnect("localhost","root","secret")
    or die("ERROR: Could not connect to database!");

mysql_select_db("test");

/* Define the URL that should be processed: */

$url = addslashes( $_GET['url'] );

if( !$url )
{
   die( "You need to define a URL to process." );
}
else if( substr($url,0,7) != "http://" )
{
   $url = "http://$url";
}

/* Does this URL already have a record in the page-table? */
$result = mysql_query("SELECT page_id FROM page WHERE page_url = \"$url\"");
$row = mysql_fetch_array($result);

if( $row['page_id'] )
{
   /* If yes, use the old page_id: */
   $page_id = $row['page_id'];
}
else
{
   /* If not, create one: */
   mysql_query("INSERT INTO page (page_url) VALUES (\"$url\")");
   $page_id = mysql_insert_id();
}

/* Start parsing through the text, and build an index in the database: */
if( !($fd = fopen($url,"r")) )
   die( "Could not open URL!" );

while( $buf = fgets($fd,1024) )
{
   /* Remove whitespace from beginning and end of string: */
   $buf = trim($buf);

   /* Try to remove all HTML-tags: */
   $buf = strip_tags($buf);
   $buf = ereg_replace('/&\w;/', '', $buf);

   /* Extract all words matching the regexp from the current line: */
   preg_match_all("/(\b[\w+]+\b)/",$buf,$words);

   /* Loop through all words/occurrences and insert them into the database: */
   for( $i = 0; $words[$i]; $i++ )
   {
      for( $j = 0; $words[$i][$j]; $j++ )
      {
         /* Does the current word already have a record in the word-table? */
         $cur_word = addslashes( strtolower($words[$i][$j]) );

         $result = mysql_query("SELECT word_id FROM word 
                                WHERE word_word = '$cur_word'");
         $row = mysql_fetch_array($result);
         if( $row['word_id'] )
         {
            /* If yes, use the old word_id: */
            $word_id = $row['word_id'];
         }
         else
         {
            /* If not, create one: */
            mysql_query("INSERT INTO word (word_word) VALUES (\"$cur_word\")");
            $word_id = mysql_insert_id();
         }

         /* And finally, register the occurrence of the word: */
         mysql_query("INSERT INTO occurrence (word_id,page_id) 
                      VALUES ($word_id,$page_id)");
         print "Indexing: $cur_word<br>";
      }
   }
}

fclose($fd);

?>

Note the use of addslashes() when assigning values to $url and $cur_word. Since these variables are used within SQL queries, it's very important to make sure they're free from harmful data. Imagine if someone called this script with a URL of "; DROP TABLE FOO;. As addslashes() quotes database special characters, someone of malicious bent would have to go to much more work to exploit the database.

This script connects to the database, registering the page in the database if it does not exist, and starts to retrieve data. The preg_match_all() call extracts all words (matching the regular expression) from the page. Each word must be recorded in the occurrence and word table.

When building the index, only three SQL INSERT statements actually matter. When a page is first indexed, it must be recorded:

INSERT INTO page (page_url) VALUES ("http://www.onlamp.com/");

The first occurrence of a word within the entire dataset must be recorded:

INSERT INTO word (word_word) VALUES ("linux");

Each occurrence of a word within a page must be recorded:

INSERT INTO occurrence (word_id,page_id) VALUES ($word_id,$page_id);

Let's now index a few pages. The seven sites that makes up the O'Reilly Network is probably a good idea. We will call populate.php with the site URLs as the only argument, one at a time:

http://localhost/populate.php?url=http://www.macdevcenter.com/
http://localhost/populate.php?url=http://www.onjava.com/
http://localhost/populate.php?url=http://www.onlamp.com/
http://localhost/populate.php?url=http://www.openp2p.com/
http://localhost/populate.php?url=http://www.osdir.com/
http://localhost/populate.php?url=http://www.perl.com/
http://localhost/populate.php?url=http://www.xml.com/

A quick investigation of the tables now should result in something like this:


mysql> SELECT * FROM page;
+---------+------------------------------+
| page_id | page_url                     |
+---------+------------------------------+
| 1       | http://www.macdevcenter.com/ |
| 2       | http://www.onjava.com/       |
| 3       | http://www.onlamp.com/       |
| 4       | http://www.openp2p.com/      |
| 5       | http://www.osdir.com/        |
| 6       | http://www.perl.com/         |
| 7       | http://www.xml.com/          |
+---------+------------------------------+
7 rows in set (0.00 sec)

mysql> SELECT COUNT(*) FROM word;
+----------+
| COUNT(*) |
+----------+
| 2423     |
+----------+
1 row in set (0.00 sec)

mysql> SELECT COUNT(*) FROM occurrence;
+----------+
| COUNT(*) |
+----------+
| 20628    |
+----------+
1 row in set (0.00 sec)

The database now has seven indexed pages, resulting in 2,423 unique words and 20,628 occurrences. We're ready for the next level, the search interface.

Pages: 1, 2, 3

Next Pagearrow




Valuable Online Certification Training

Online Certification for Your Career
Earn a Certificate for Professional Development from the University of Illinois Office of Continuing Education upon completion of each online certificate program.

PHP/SQL Programming Certificate — The PHP/SQL Programming Certificate series is comprised of four courses covering beginning to advanced PHP programming, beginning to advanced database programming using the SQL language, database theory, and integrated Web 2.0 programming using PHP and SQL on the Unix/Linux mySQL platform.

Enroll today!


Sponsored by: