PHP DevCenter
oreilly.comSafari Books Online.Conferences.

advertisement


PHP Search Engine Showdown

by Michael Douma
03/23/2007

Editor's note: Sphinx added to review on 03 April 2007.

It's a universal frustration. You just know that the piece of information you're looking for is somewhere on a site. You click one link, then another, and another. You go back to the home page and try a different branch of the site. After dozens of clicks, you still can't find the information you need. Then it's back to Google and on to another site. At last you find one with an internal search engine. You enter your search term, and voilá!--the information you need pops up in less than a second.

If you want your visitors to have "voilá!" moments, consider incorporating an internal search engine into your web site. Search tools not only make your information easily accessible, but they also increase the time visitors spend on your site. An internal search engine may be a necessity if your site has more than 100 pages of content, if it is deeply hierarchical, or if its architecture is weak. If the purpose of your site is to provide in-depth information on a variety of specific topics, it's ineffective to force a visitor to browse through your site to find the information he seeks. Even if you have designed your site to bring users pleasure through browsing, it's still a good idea to give your visitor an effective option for finding something specific.

Hosted vs. Local

When selecting a search tool, you have two options: a hosted remote search engine or a local search service. If you have a hosted site (a site that is not on your server), you can take advantage of free or fee-based services provided by companies that host search engines on their servers. You simply have to register on their site and you're on your way. You can find some of these search tools at www.atomz.com, www.mondosoft.com, and www.picosearch.com.

Remote site search services offer several advantages. Your costs are significantly lower, as the software and maintenance are often free. Likewise, because index files are stored on the host's servers, you save disk space. There is also less likelihood of downtime, because keeping the search tool up and running is of paramount importance to the host company.

The primary disadvantages of remote site search services are that you have little control over the indexing process and that you can't change the code, add new features, or customize your search engine.

When you choose to incorporate a local search service, you install the search engine on your server and customize the tool yourself. The advantages of using the local approach are that you can ensure the privacy of your data, you can control the indexing process and search results, and that you have the freedom to implement new features.

The disadvantages of installing a local search engine are that indexing and maintenance is your responsibility, and that the index and installation files will use space on your hard drive. You may also incur costs associated with software acquisition--although free, open source software is available.

Getting Started

Integrating a search engine into your site is easy if you prepare the site correctly. You should consider several issues when setting up your site.

Physical issues

You should have enough available disk space for the index; you need adequate processing power; and you have to always remember to update the index after every change you make to your site. You will need to install the appropriate PHP software installed on the server, as well as MySQL if you're using MySQL databases for your indexes. There are few search engines that you can configure using browser-based graphic interfaces. Others may require command-line root access on the server.

The pages

Make sure that the search results list appears the way you want it to, including the relevant page titles, meta descriptions, and text.

Page titles are the most important elements in your search results, so make sure they are relevant to the context of the page. Be very careful with spelling--you can allow no mistakes. Ensure that the page title always contains the most common keywords relating to the subject.

Some search tools display meta descriptions in the results list. If your engine uses descriptions, be sure they are accurate. For example, if you have a site about local food and you want to add a meta description for the Restaurants page, you should do something like this: <META NAME="description" CONTENT="List of restaurant in my area with available specialties, customer opinions and general info.">

Although major public search engines such as Google no longer use meta keywords (because deceptive webmasters used inaccurate keywords), they are very helpful for a local search engine, where you have control and there is no risk of abuse. Use keywords if you want your search engine to find the best results to search terms. Be sure to include any keywords you think are relevant to the context of the page. For the previous example, <META NAME="keywords" CONTENT="food, Washington DC, Pata Mia, Olive Garden, Italian, pasta, etc. ">. If anyone searches for any of these terms, this page will appear highly ranked in the search results list.

Headings are very important if you want the search engine to return good results. Many search tools use the headings to determine the ranking for a given page.

Indexing

A search indexer goes through the pages and builds an index (usually a database) for easy searching, because searching the actual site is very slow. If the engine accesses your web pages by speaking to the web servers, that is web-based "crawling" or "spidering." It if directly accesses directory and file structure on your drive, that is file-system-based crawling. The indexer must be able to save files in a web server directory where the search engine can locate it when a user searches your site.

Usually, the search engine creates an "inverted index." This method makes a list of all the words found in the text you want to search. It also uses (key, pointer) pairs to store the location of each word, in which key is the text itself and pointer is the position in the text where the word occurs. So the method consists of converting a text with words into a list of words in a text--thus, an inverted index. This allows the search engine to find the desired pages much faster, as it is easier to search through a database of words than through a text that will probably contain duplicate words. The advantage with the database is that you index a word only one time and have several different locations of that word in the text. Because of the inherent complexities of building an inverted index, the speed of crawling a site is often quite slow.

Although many indexers do this automatically, you must update the index after every modification to your site. Be sure your indexer never indexes private files. That would result in private information being returned by search engines.

For more information, see the Inverted Index Language Shootout, and for a relevant example, see NIST's Inverted Index explanation.

PHP Hacks

Related Reading

PHP Hacks
Tips & Tools For Creating Dynamic Websites
By Jack Herrington

Pages: 1, 2

Next Pagearrow




Valuable Online Certification Training

Online Certification for Your Career
Earn a Certificate for Professional Development from the University of Illinois Office of Continuing Education upon completion of each online certificate program.

PHP/SQL Programming Certificate — The PHP/SQL Programming Certificate series is comprised of four courses covering beginning to advanced PHP programming, beginning to advanced database programming using the SQL language, database theory, and integrated Web 2.0 programming using PHP and SQL on the Unix/Linux mySQL platform.

Enroll today!


Sponsored by: