Skip to main content
Glossary Term

Document retrieval

Document Retrieval Systems - Document retrieval systems match text records against user queries - Consist of a database of documents, a classification algorithm, and a user interface - Main tasks are finding relevant documents and evaluating matching results - Internet search engines are classical applications of document retrieval - Range from simple Boolean systems to systems using statistical or natural language processing techniques Indexing Schemata - Two main classes of indexing schemata: form-based and content-based - Form-based indexing addresses the exact syntactic properties of a text - Content-based approach exploits semantic connections between documents and queries - Most content-based systems use an inverted index algorithm - Signature file is a technique that creates a quick filter for matching documents Form-based Indexing - Addresses the exact syntactic properties of a text - Text is generally unstructured and not necessarily in a natural language - Used for processing large sets of chemical representations in molecular biology - Suffix tree algorithm is an example of form-based indexing Content-based Indexing - Exploits semantic connections between documents and queries - Most content-based systems use an inverted index algorithm - Signature file is a technique for creating a quick filter - Can beat inverted files in certain environments with proper parameters - Involves creating a hash coded version of each file for matching Example: PubMed - PubMed form interface features related articles search - Comparison of words from document title, abstract, and MeSH terms - Uses a word-weighted algorithm for relevance ranking - PubMed is a widely used document retrieval system - Provides access to a vast collection of biomedical literature