May 9, 2009
To effectively optimize for search engines and to better understand what's really happening, there is value in knowing how modern search algorithms work. This article will walk through the creation of a hypothetical search engine, and will show how this impacts search engine optimization.
Step One: Make a List of URLs and Crawl Them
Before anything can be done, a list of URLs needs to be retrieved to initially crawl. The most popular option for this is to load the URLs in the DMOZ database. These aren't the only sites that will be crawled. The pages linked to by sites in the DMOZ directory are also crawled since the crawler follows the links. It certainly helps to be in DMOZ, especially if you don't have enough links from other sites to be sure that you'll be sufficiently crawled.
Now, a group of computers are set up to download all of the pages on the list. These are called the "crawlers." They will also look at the links on those pages, and crawl those URLs as well (the crawlers will continue following links until their hard drives are full).
Step Two: Analyze the Pages
The crawlers now go through each page and look at their content.
First, the crawler makes a table with every unique word on the page. It gives "points" to each word based on how many times it's used on the page, and words in bold, in the title, in meta tags, or in headers are given extra points.
| Word | Points | |
| shoes | 145 | |
| athletic | 78 | |
| sneakers | 34 | |
| sandals | 12 | |
| (etc.) |
This means that you should use the most important words more often in your text. However, using a word too often will mark your page as being spam, which will cause the crawler to delete your site from its database.
It then creates a percentage of the frequency of each term:
| Word | Points | Percentage | ||
| shoes | 145 | 5.80% | ||
| athletic | 78 | 3.12% | ||
| sneakers | 34 | 1.36% | ||
| sandals | 12 | 0.48% | ||
| (etc.) |
Usually, the percentages are stored in the database and not the actual points, though longer pages may be given a slight advantage later on. As a result, adding a lot of unnecessary text that uses one term a lot will raise your percentage for that term, but will also lower the percentage for other terms.
More advanced engines will also cross-reference each word to other major words based on where they are relative to each other. (Words appearing next to each other are given more points here.) So, for example:
| Word | shoes | athletic | sneakers | sandals | ||||
| shoes | - | 20 | 12 | 7 | ||||
| athletic | 20 | - | 11 | 4 | ||||
| sneakers | 12 | 11 | - | 5 | ||||
| sandals | 7 | 4 | 5 | - |
As a result, the placement of words relative to each other does matter. This is why targeting phrases is usually better than targeting a variety of single words.
Calculate Link Popularity
The crawlers now take their lists of the URLs that each page links to and combine them. So for each page there is now a list of the links on it, as well as the text of each link. The list is then reversed, so that instead of showing the links on each page, it shows for each page the sites that link to it.
Some search engines stop here and simply store the number of links pointing to a given page, but Google takes it a little further.
For every page in its database, Google gives it "points" based on how many links are going to it–just like any other search engine. Then, it re-calculates the number of links pointing to each page, but gives more points to links that had a higher point-value themselves in the first count. It then repeats the process about 100 times, each time making the points more accurate. So:
Now, Google takes the point values–which could be extraordinarily large–and converts them to a PageRank, which is on a scale of 0 to 10. However, it does not simply convert, for example, 1,000 to 1 and 2,000 to 2. The scale is logarithmic, which means that higher PageRanks require much more points.
WebmasterGoodies has an approximation of what the ranges most likely are–look at the first three columns. The actual ranges aren't available to the public, but the ranges on that site are believed to be fairly close. Obviously, a logarithmic scale makes a difference: PR1 requires 6 to 30 "points," while PR10 requires more than 25 million points.
Now What?
Search engines put the databases into a specialized format, and then write the search software.
When a search is made, every site containing the relevant terms is pulled up. The ranking is based on a combination of the points for each relevant term, the site's link popularity (PageRank), and other smaller factors. Each engine weighs these differently.
You should now have a better understanding of what's happening under the hood of the search engine, and this should help in optimizing your pages as well.
Filed under SEO