Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers (3 page)

BOOK: Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers
12.86Mb size Format: txt, pdf, ePub
ads

Now, remember why we introduced these in-page word locations: it was to solve the problem of how to do phrase queries efficiently. So let's see how to do a phrase query with this new index. We'll work with the same query as before, “cat sat”. The first steps are the same as with the old index: extract the locations of the individual words from the index, so for “cat” we get 1-2, 3-2, and for “sat” we get 1-3, 3-7. So far, so good: we know that the only possible hits for the phrase query “cat sat” can be on pages 1 and 3. But just like before, we are not yet sure whether that exact phrase occurs on those pages—it is possible that the two words do appear, but not next to each other in the correct order. Luckily, it is easy to check this from the location information. Let's concentrate on page 1 initially. From the index information, we know that “cat” appears at position 2 on page 1 (that's what the 1-2 means). And we know that “sat” appears at position 3 on page 1 (that's what the 1-3 means). But if “cat” is at position 2, and “sat” is at position 3, then we know “sat” appears immediately after “cat” (because 3 comes immediately after 2)—and so the entire phrase we are looking for, “cat sat,” must appear on this page beginning at position 2!

Top: Our three web pages with in-page word locations added. Bottom: A new index that includes both page numbers and in-page word locations.

I know I am laboring this point, but the reason for going through this example in excruciating detail is to understand
exactly
what information is used to arrive at this answer. Note that we have found a hit for the phrase “cat sat” by looking only at the index information (1-2, 3-2 for “cat,” and 1-3, 3-7 for “sat”), not at the original web pages themselves. This is crucial, because we only had to look at the two entries in the index, rather than reading through all of the pages that might be hits—and there could be literally millions of such pages in a real search engine performing a real phrase query. To summarize: including the in-page word locations in the index has allowed us to find a phrase query hit by looking at only a couple of lines in the index, rather than reading through a large number of web pages. This simple word-location trick is one of the keys to making search engines work!

Actually, we haven't even finished working through the “cat sat” example. We finished processing the information for page 1, but not for page 3. But the reasoning for page 3 is similar: we see that “cat” appears at location 2, and “sat” occurs at location 7, so they cannot possibly occur next to each other—because 7 is not immediately after 2. So we know that page 3 is
not
a hit for the phrase query “cat sat”, even though it
/s
a hit for the multiword query cat sat.

By the way, the word-location trick is important for more than just phrase queries. As one example, consider the problem of finding words that are near to each other. On some search engines, you can do this with the NEAR keyword in the query. In fact, the AltaVista search engine offered this facility from its early days and still does at the time of writing. As a specific example, suppose that on some particular search engine, the query cat NEAR dog finds pages in which the word “cat” occurs within five words of the word “dog.” How can we perform this query efficiently on our data set? Using word locations, it's easy. The index entry for “cat” is 1-2, 3-2, and the index entry for “dog” is 2-2, 3-6. So we see immediately that page 3 is the only possible hit. And on page 3, “cat” appears at location 2, and “dog” appears at location 6. So the distance between the two words is 6 – 2, which is 4. Therefore, “cat”
does
appear within five words of “dog,” and page 3 is a hit for the query cat NEAR dog. Again, note how efficiently we could perform this query: there was no need to read through the actual content of any web pages—instead, only two entries from the index were consulted.

It turns out that NEAR queries aren't very important to search engine users in practice. Almost no one uses NEAR queries, and most major search engines don't even support them. But despite this, the ability to perform NEAR queries is actually crucial to real-life search engines. This is because the search engines themselves are constantly performing NEAR queries behind the scenes. To understand why, we first have to take a look at one of the other major problems that confronts modern search engines: the problem of
ranking.

RANKING AND NEARNESS

So far, we've been concentrating on the matching phase: the problem of efficiently finding all of the hits for a given query. But as emphasized earlier, the second phase, “ranking,” is absolutely essential for a high-quality search engine: this is the phase that picks out the top few hits for display to the user.

Let's examine the concept of ranking a little more carefully. What does the “rank” of a page really depend on? The real question is not “Does this page
match
the query?” but rather “Is this page
relevant
to the query?” Computer scientists use the term “relevance” to describe how suitable or useful a given page is, in response to a particular query.

As a concrete example, suppose you are interested in what causes malaria, and you enter the query malaria cause into a search engine. To keep things simple, imagine there are only two hits for that query in the search engine—the two pages shown in the figure on the following page. Have a look at those pages now. It should be immediately clear to you, as a human, that page 1 is indeed about the causes of malaria, whereas page 2 seems to be the description of some military campaign which just happens, by coincidence, to use the words “cause” and “malaria.” So page 1 is undoubtedly more “relevant” to the query malaria cause than page 2. But computers are not humans, and there is no easy way for a computer to understand the topics of these two pages, so it might seem impossible for a search engine to rank these two hits correctly.

Top: Two example web pages that mention malaria.
Bottom: Part of the index built from the above two web pages.

However, there is, in fact, a very simple way to get the ranking right in this case. It turns out that pages where the query words occur
near
each other are more likely to be relevant than pages where the query words are far apart. In the malaria example, we see that the words “malaria” and “cause” occur within two words of each other in page 1, but are separated by 17 words in page 2. (And remember, the search engine can find this out efficiently by looking at just the index entries, without having to go back and look at the web pages themselves.) So although the computer doesn't really “understand” the topic of this query, it can
guess
that page 1 is more relevant than page 2, because the query words occur much closer on page 1 than on page 2.

To summarize: although humans don't use NEAR queries much, search engines use the information about nearness constantly to improve their rankings—and the reason they can do this efficiently is because they use the word-location trick.

An example set of web pages that each have a title and a body.

We already know that the Babylonians were using indexing 5000 years before search engines existed. It turns out that search engines did not invent the word-location trick either: this is a well-known technique that was used in other types of information retrieval before the internet arrived on the scene. However, in the next section we will learn about a new trick that does appear to have been invented by search engine designers: the
metaword trick.
The cunning use of this trick and various related ideas helped to catapult the AltaVista search engine to the top of the search industry in the late 1990s.

THE METAWORD TRICK

So far, we've been using extremely simple examples of web pages. As you probably know, most web pages have quite a lot of structure, including titles, headings, links, and images, whereas we have been treating web pages as just ordinary lists of words. We're now going to find out how search engines take account of the structure in web pages. But to keep things as simple as possible, we will introduce only one aspect of structuring: we will allow our pages to have a
title
at the top of the page, followed by the
body
of the page. The figure above shows our familiar three-page example with some titles added.

Actually, to analyze web page structure in the same way that search engines do, we need to know a little more about how web pages are written. Web pages are composed in a special language that allows web browsers to display them in a nicely formatted way. (The most common language for this purpose is called HTML, but the details of HTML are not important for this discussion.) The formatting instructions for headings, titles, links, images, and the like are written using special words called
metawords.
As an example, the metaword used to start the title of a web page might be , and the metaword for ending the title might be . Similarly, the body of the web page could be started with and ended with . Don't let the symbols “<” and “>” confuse you. They appear on most computer keyboards and are often known by their mathematical meanings as “less than” and “greater than.” But here, they have nothing whatsoever to do with math—they are just being used as convenient symbols to mark the metawords as different from regular words on a web page.

The same set of web pages as in the last figure, but shown as they might be
written
with metawords, rather than as they would be displayed in a web browser.

Take a look at the figure above, which displays exactly the same content as the previous figure, but now showing how the web pages were actually written, rather than how they would be displayed in a web browser. Most web browsers allow you to examine the raw content of a web page by choosing a menu option called “view source”—I recommend experimenting with this the next time you get a chance. (Note that the metawords used here, such as and , are fictitious, easily recognizable examples to aid our understanding. In real HTML, metawords are called
tags.
The tags for starting and ending titles in HTML are and —search for these tags after using the “view source” menu option.)

When building an index, it is a simple matter to include all of the metawords. No new tricks are needed: you just store the locations of the metawords in the same way as regular words. The figure on the next page shows the index built from the three web pages with metawords. Take a look at this figure and make sure you understand there is nothing mysterious going on here. For example, the entry for “mat” is 1-11, 2-11, which means that “mat” is the 11th word on page 1 and also the 11th word on page 2. The metawords work the same way, so the entry for “,” which is 1-4, 2-4, 3-4, means that “” is the fourth word in page 1, page 2, and page 3.

BOOK: Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers
12.86Mb size Format: txt, pdf, ePub
ads

Other books

The Village by the Sea by Anita Desai
Imago Bird by Nicholas Mosley
Ruffskin by Megan Derr
Jane Feather by Engagement at Beaufort Hall
The Belgravia Club by Fenton, Clarissa
Sexual Shift by Beverly Rae
Woodrose Mountain by Raeanne Thayne
The Secret Rescue by Cate Lineberry