Information Retrieval

Information retrieval (IR) deals with the representation, storage, organization of, and access to information items. Relevance of a document is a subjective judgement and may be based on: The main criteria, however, is that an IR system should fulfill a users information need.
 * Subject/content of document
 * Staleness of information (is it recently published?)
 * Authoriativity of source (can the source be trusted?)
 * Does the information satisfy the intended use.

Basic Approaches
The simplest notion of relevance is that a string in the document is an exact match to the query, while a slightly less strict notion is that all (or a majority) of the words in the query appears frequently in the document, in any order. Problems with this approach is that it does not take synonyms into consideration; a search for "resturant" may not include documents with "café" appearing, or a search for "PRC" may not include documents with "China". It also doesn't take context into consideration, such as a query including "Apple" might return both documents about the fruit and the company.

IR System Compontents
An IR system consists of several components, which can be divided into two categories: text operations and indexing. Text operations forms index words, and consist of tokenization, stopword removal and stemming. Indexing constructs an inverted index of words to document pointers (mapping from keywords to document IDs).

Tokenizing
Tokenizing analyses text into a sequence of discrete tokens (or words). In some cases punctuations, numbers and case (Apple vs apple) can be a meaningful part of a token, but often this is not the case. Thus, the simplest approach to tokinization is to simply ignore all numbers and punctuation and use case-insensitive unbroken strings of alphabetic characters as tokens.

More careful approaches might separate special characters (! ? ; : " ' < > etc), while also being careful when handling punctionations (might be meaningful in an e.g. e-mail address or filename) and numbers (e.g. dates).

Stopwords
Stopwords are words that are filtered out when processing natural language text. Normally high-frequency words (e.g. "a", "the", "in", "to", "I", "he", "she", "it") are excluded, but it is not always done. Stopwords are language dependent, so determining what what should be considered a stopword can be a problem. For english, there are existing lists on the net (SMART's commonword list, WordNet).

Stemming
Stemming is to reduce words to their "root" form to recognize morphological variations. This means that words like "computer", "computational", "computation" are all reduced to the same token, "compute". Doing correct morphological analysis is language specific and can be very complex. Stemming blindly simply strips off known affixes in an iterative fashion.

Lemmatization
Lemmatization reduces variant form to base form, giving a direct impact on the size of the vocabulary (e.g. am, are, is are all reduced to be). The sentance "The boy's cars are different colors" would be reduced to "The boy car be different color". To do this, a list of grammatical rules and a list of irregular words are needed (latter to take care of cases like spoken, which should be reduced to speak). A practical way to implement this is to use WordNet's morphstr function.

IR Models
There are several different models for information retrieval.

Vector Model
In the vector model, a query is compared to the available documents, and partial matches are possible. The result is a ranked set of documents.

Distributed IR
In distributed IR there are two methods for data partitioning: document partitioning and term partitioning. When using document partitioning, the N documents are distributed across the P processors. An example is independentm heterogeneous search servers. In term partitioning, the t indexing items are distributed across the P processors.

Query Processing
The first step in processing a query is to select the collections to search through. This is done by determining which document collections that most likely contains relevant documents, characterization of resources (typically using word histogram), and resource ranking and selection, based on comparing the resource descriptions on a per query basis. Then the query is distributed to the selected collections, the query is evaluated (in parallel) at the collections, and the result is combined from the distributed collections into a final result.

One problem with merging the results into one final output is that there are multiple ranked lists of documents. One solution is to combine the ranked hit-lists using round robin interleaving, but a problem with this is that hits from irrelevant collections are equally important as hits from highly relevant collections. Another solution is to merge hit-lists based on relevance score, using a global term statistics to compute documents scores. A third solution is to re-rank documents based on the weights of the document collections.

Precision and Recall
Precision is the ability to retrieve top-ranked documents that are mostly relevant, and recall is the ability of the search to find all of the relevant items in the corpus. The recall is the number of relevant documents retrieved divided on the total number of documents, while precision is the number of relevant documents retrieved divided on the total number of documents retrieved. Note that knowing the total number of relevant documents on the Web is, in practice, not possible.

The recall and precision scores are often combined into a single measure, called the F-measure, which gives an average score of the systems information retrieval efficiency. To find the F-measure, one starts with a corpus (latin for 'body') of documents and then collects a set of queries for this corpus. The documents then have to be labeled manually (by human experts) to determine the relevant documents for each query. Thus, when testing the IR system on this corpus, the result set can be compared to the set of relevant documents found manually. Typically, binary relevance judgement is assumed. Naturally, this requires considerable human effort for large document/query corpora.

Difficulties in evaluating IR systems are that the effectiveness of the systems is related to the relevancy of retrieved items, and that relevancy is not typically binary, but continuous. Even if relevancy is binary, it may be a difficult judgement to make. Relevancy, from a human point of view, is subjective, situational, cognitive and dynamic; it depends on the user, relates to the users current needs, depends on the users perception and behaviour, and changes over time.