Web Searching and Information Retrieval

The goal of general-purpose search engines is to index a sizeable portion of the Web, independently of topic and domain. Each engine consists of several components; A crawler, an indexer and a query engine.

We can distinguish three architectures for Web searching:


 * Traditional (or centralized)
 * Metasearch
 * Distributed search

Metasearch Architecture
Metasearches can provide access to the information in the hidden Web's text databases. A metasearcher performs three main tasks; database selection, query translation and result merging. Given a query, database selection finds the best databases to evaluate the query, query translation translates the query to a suitable form for each database, and result merging retrieves and merges the results from the different databases. The database selection component is crucial in terms of both query processing efficiency and effectiveness.

Database selection algorithms are traditionally based on pre-collected statistics that characterize each database's contents. These statistics, often called contents summaries, usually include at least the document frequencies of the words that appear in the database. Metasearchers rely on databases to supply this content summary, but many Web-accessible text databases are completly autonomous and don't report any detailed metadata about their content. In such cases, only manually generated descriptions of the content are usable, thus removing the scalability of this approach.

Distributed Search Architecture
Most major Web search engines are based on cluster architectures. There have been attempts to develop peer-to-peer (P2P) search engines, but these suffered from many problems. Most of the objects the crawlers retrieved were useless and discarded, and there was no coordination among the crawlers. A completely distributed and decentralized P2P crawler called Apoidea looked promising, being bitg self-managing and used resource's geographical proximity to its peers for a better and faster crawl. However, this system doesn't seem to be in use, and Web search engines are still based on cluster architectures.

Page Importance
Three approaches can be used to determine a page's importance, which is used to rank the resulting pages from a query: link, content, and anchor.

The best-known link-based technique used today is a variant of the PageRank algorithm implemented in the Google search engine. It determines a page's importance only from the topological structured of a directed graph associated with the Web. A page's rank depends on the ranks of all the pages pointing to it, with each rank divided by the number of out-links those pages have. This method is inadequate for pages with low in-degree.

Another technique, the Kleinberg's algorithm, also called HITS (Hypertext Induced Topic Search), is used at query time and processed on a small subset of relevant documents, but not all of them. It computes two scores per document. Authoritative pages relevant to the initial query have large in-degree: a considerable overlap in the sets of pages point to them. Hub pages have links to multiple relevant authoriative pages. If a page is a good authority, many hubs will point to it.

The content-based approach compute the similarity score between a page and a predefined topic in a similar way to the vector model. A topic vector q is constructed from a sample of pages, and each Web page has its own vector p. The similarity between p and q is defined by the cosine similarity measure. This approach is susceotible to spam, and ignores links.

Anchor text is the visible hyperlinked text on the page. The anchor-based approach defines a page's importance by pattern matching between the query vector and the URL's anchor text, the text around the anchor text (called the anchor window), and the URL's string value. This method works best for Web similarity-search tasks, but bad definition of the anchor window might lead to relevant documents being discarded.