Document Retrieval Systems
– Document retrieval systems match text records against user queries
– Consist of a database of documents, a classification algorithm, and a user interface
– Main tasks are finding relevant documents and evaluating matching results
– Internet search engines are classical applications of document retrieval
– Range from simple Boolean systems to systems using statistical or natural language processing techniques
Indexing Schemata
– Two main classes of indexing schemata: form-based and content-based
– Form-based indexing addresses the exact syntactic properties of a text
– Content-based approach exploits semantic connections between documents and queries
– Most content-based systems use an inverted index algorithm
– Signature file is a technique that creates a quick filter for matching documents
Form-based Indexing
– Addresses the exact syntactic properties of a text
– Text is generally unstructured and not necessarily in a natural language
– Used for processing large sets of chemical representations in molecular biology
– Suffix tree algorithm is an example of form-based indexing
Content-based Indexing
– Exploits semantic connections between documents and queries
– Most content-based systems use an inverted index algorithm
– Signature file is a technique for creating a quick filter
– Can beat inverted files in certain environments with proper parameters
– Involves creating a hash coded version of each file for matching
Example: PubMed
– PubMed form interface features related articles search
– Comparison of words from document title, abstract, and MeSH terms
– Uses a word-weighted algorithm for relevance ranking
– PubMed is a widely used document retrieval system
– Provides access to a vast collection of biomedical literature
Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.
Document retrieval is sometimes referred to as, or as a branch of, text retrieval. Text retrieval is a branch of information retrieval where the information is stored primarily in the form of text. Text databases became decentralized thanks to the personal computer. Text retrieval is a critical area of study today, since it is the fundamental basis of all internet search engines.