Editor's note: Sphinx added to review on 03 April 2007.
It's a universal frustration. You just know that the piece of information you're looking for is somewhere on a site. You click one link, then another, and another. You go back to the home page and try a different branch of the site. After dozens of clicks, you still can't find the information you need. Then it's back to Google and on to another site. At last you find one with an internal search engine. You enter your search term, and voilá!--the information you need pops up in less than a second.
If you want your visitors to have "voilá!" moments, consider incorporating an internal search engine into your web site. Search tools not only make your information easily accessible, but they also increase the time visitors spend on your site. An internal search engine may be a necessity if your site has more than 100 pages of content, if it is deeply hierarchical, or if its architecture is weak. If the purpose of your site is to provide in-depth information on a variety of specific topics, it's ineffective to force a visitor to browse through your site to find the information he seeks. Even if you have designed your site to bring users pleasure through browsing, it's still a good idea to give your visitor an effective option for finding something specific.
When selecting a search tool, you have two options: a hosted remote search engine or a local search service. If you have a hosted site (a site that is not on your server), you can take advantage of free or fee-based services provided by companies that host search engines on their servers. You simply have to register on their site and you're on your way. You can find some of these search tools at www.atomz.com, www.mondosoft.com, and www.picosearch.com.
Remote site search services offer several advantages. Your costs are significantly lower, as the software and maintenance are often free. Likewise, because index files are stored on the host's servers, you save disk space. There is also less likelihood of downtime, because keeping the search tool up and running is of paramount importance to the host company.
The primary disadvantages of remote site search services are that you have little control over the indexing process and that you can't change the code, add new features, or customize your search engine.
When you choose to incorporate a local search service, you install the search engine on your server and customize the tool yourself. The advantages of using the local approach are that you can ensure the privacy of your data, you can control the indexing process and search results, and that you have the freedom to implement new features.
The disadvantages of installing a local search engine are that indexing and maintenance is your responsibility, and that the index and installation files will use space on your hard drive. You may also incur costs associated with software acquisition--although free, open source software is available.
Integrating a search engine into your site is easy if you prepare the site correctly. You should consider several issues when setting up your site.
You should have enough available disk space for the index; you need adequate processing power; and you have to always remember to update the index after every change you make to your site. You will need to install the appropriate PHP software installed on the server, as well as MySQL if you're using MySQL databases for your indexes. There are few search engines that you can configure using browser-based graphic interfaces. Others may require command-line root access on the server.
Make sure that the search results list appears the way you want it to, including the relevant page titles, meta descriptions, and text.
Page titles are the most important elements in your search results, so make sure they are relevant to the context of the page. Be very careful with spelling--you can allow no mistakes. Ensure that the page title always contains the most common keywords relating to the subject.
Some search tools display meta descriptions in the results list. If your engine uses descriptions, be sure they are accurate. For example, if you have a site about local food and you want to add a meta description for the Restaurants page, you should do something like this:
<META NAME="description" CONTENT="List of restaurant in my area with available specialties, customer opinions and general info.">
Although major public search engines such as Google no longer use meta keywords (because deceptive webmasters used inaccurate keywords), they are very helpful for a local search engine, where you have control and there is no risk of abuse. Use keywords if you want your search engine to find the best results to search terms. Be sure to include any keywords you think are relevant to the context of the page. For the previous example,
<META NAME="keywords" CONTENT="food, Washington DC, Pata Mia, Olive Garden, Italian, pasta, etc. ">. If anyone searches for any of these terms, this page will appear highly ranked in the search results list.
Headings are very important if you want the search engine to return good results. Many search tools use the headings to determine the ranking for a given page.
A search indexer goes through the pages and builds an index (usually a database) for easy searching, because searching the actual site is very slow. If the engine accesses your web pages by speaking to the web servers, that is web-based "crawling" or "spidering." It if directly accesses directory and file structure on your drive, that is file-system-based crawling. The indexer must be able to save files in a web server directory where the search engine can locate it when a user searches your site.
Usually, the search engine creates an "inverted index." This method makes a list of all the words found in the text you want to search. It also uses (key, pointer) pairs to store the location of each word, in which key is the text itself and pointer is the position in the text where the word occurs. So the method consists of converting a text with words into a list of words in a text--thus, an inverted index. This allows the search engine to find the desired pages much faster, as it is easier to search through a database of words than through a text that will probably contain duplicate words. The advantage with the database is that you index a word only one time and have several different locations of that word in the text. Because of the inherent complexities of building an inverted index, the speed of crawling a site is often quite slow.
Although many indexers do this automatically, you must update the index after every modification to your site. Be sure your indexer never indexes private files. That would result in private information being returned by search engines.
For more information, see the Inverted Index Language Shootout, and for a relevant example, see NIST's Inverted Index explanation.
If you're going to install a local search engine and are using PHP, you have several great PHP engines to consider. We took the leaders in the field, summarized their features (Table 1), tested them all, and found:
iSearch has an excellent range of options for the needs of nearly any site, yet the core functions are encrypted and highly unchangeable. Also, in testing, the spider would trap itself in a loop or unreachable page every 20 minutes or so, making a cron-based update most unreliable.
MnogoSearch is quite powerful and versatile, but unlike most of its PHP-minded competitors, it must be compiled before usage and has the most substantial learning curve. It is immediately compatible with every major database, including SQLite, and comes with front ends for PHP, C, and Perl. There is a command-line interface to perform all maintenance and indexing; once you have configured it correctly, it is also useful for automation. It has a wide variety of features, including searches of your site, FTP archive searches, news article and newspaper searches, and more.
PHPDig uses a MySQL database, building a glossary with words from the pages you index. The search result displays the pages ranked by keyword density. Though PHPDig's fame and clean code would suggest otherwise, this search engine is far from being one of the best available. The indexing speed is quite slow, especially in comparison with MnogoSearch or RiSearch. It's overflowing with features and plugins for any format of data and has built-in index scheduling routines.
RiSearch is powerful and has a very fast search script, designed to work with hundreds of megabytes of text data. It does not use libraries or databases but is Perl code with PHP front ends. RiSearch is surprisingly fast to search for a file-based storage back end. However, this affects the search result relevancy, which is poorer than other options. It is therefore better for finding unique phrases, like names of species, than for searching concepts.
Sphider is PHP code that uses MySQL for indexing pages. It works for sites up to 20,000 pages. It also works great as a tool for site analysis, such as finding broken links and gathering statistics about the site. It has an efficient back end and search algorithm, but its crawling methods function poorly.
Sphinx is a fast and capable full text search engine, particularly suited for database content. It runs its own daemon (which you compile) and does not have any web crawlers bundled. Features include high performance, good scalability and search quality, advanced sorting, filtering, and grouping.
TSEP causes a long delay when executing the crawler if the data to index is extensive. This was a problem on one server with time-out/keep-alive of 8/15, though adding
ignore_user_abort() to the top of indexer.php bypasses it.
Table 1. Summary of leading PHP engines
|Database||MySQL, SQLite||Several||MySQL, SQLite||MySQL||MySQL||Flat files (text)||MySQL, PostgreSQL, Flat files|
|Support||Medium (forum)||Very good (discussion list, forum, and paid email support)||Medium (forum)||Poor (forum)||Medium (FAQ and forum)||Medium (forum)||Good|
|PHP 5 compatible||Yes||Yes (for the interface)||Yes||Yes||Yes||Yes
(requires PHP 5)
|Install package download||44K||2MB||1.5MB||273K||150K||128K||~300K|
|Access needed to install||Root||Root (need to compile)||Root||Root||Root||FTP||Shell (non root)|
|Recommended file limit||High||Very high||High||High||High||Very high||Very high|
|Index speed||Very slow||~500 in 10 seconds||~500 in 14 seconds||Slow||Medium||~500 in 18 seconds||4-10 MB/sec|
Overall ranking represents the author's overall ranking of the engine, based on ease of use, power, spidering speed, and ranking relevancy.
Database lists the kind of database used for creating and storing the index.
Support refers to the customer support available for each engine and how you can ask questions to clarify any problems you might have on the installation or usage of the tool.
Access needed to install indicates the access you need to have on the server in order to fully install your application and index your site.
Recommended file limit identifies the number of documents that the search engine can support in order to run at its full capacity.
Other PHP search engines, not included in the table but listed below, are available. We do not recommend these engines as highly.
SiteSearch is a PHP engine that uses a text file database to index the information on the site. It includes several useful features, such as indexing by meta tags and multiple word search. It has several add-ons, including multilanguage support and text database support.
Simple Web Search is a script that searches a SWISH-E index. It requires SWISH-E 1.x or 2.x and PHP 3.0.8 or newer on the system, and a web server supporting PHP 3.
IndexServer is a useful plugin package that lets you perform a variety of tasks. Indexing web sites allows you to further query the final index.
Xapian is only an indexing tool, but the company also offers a web site search engine package that includes its Omega solution, which looks promising and has several interesting features. Xapian uses SWIG for PHP, so the indexer is not PHP5 compatible. This is where BeebleX comes in. BeebleX is a search engine that uses a PHP 5-compatible Xapian extension. For more information, visit Marco Tabini's thoughts on BeebleX.
There is no ideal PHP search engine, but our overall impression was that Sphider and MnogoSearch are the best contenders. In general, Sphider returns more accurate hits, and MnogoSearch is easier to set up.
Sphinx is a relatively new contender, and shows good promise. Although Sphinx is little known and has few real-world installations so far, it is worth checking in on in the future, particularly if you don't need a web crawler. Xapian is a strong engine, with support for many programming languages, and an active community, but we found it difficult to set up in PHP.
If you want to know more about search engines, the following sites have plenty of descriptions, reviews, news, guides, how-tos, and technologies:
Michael Douma is an expert in user interface design and web-based interactive education.
Return to the PHP DevCenter.
Copyright © 2009 O'Reilly Media, Inc.