Access the NEW Basecamp Support Portal

Document retrieval

« Back to Glossary Index

Document Retrieval Systems
– Document retrieval systems match text records against user queries
– Consist of a database of documents, a classification algorithm, and a user interface
– Main tasks are finding relevant documents and evaluating matching results
Internet search engines are classical applications of document retrieval
– Range from simple Boolean systems to systems using statistical or natural language processing techniques

Indexing Schemata
– Two main classes of indexing schemata: form-based and content-based
– Form-based indexing addresses the exact syntactic properties of a text
– Content-based approach exploits semantic connections between documents and queries
– Most content-based systems use an inverted index algorithm
– Signature file is a technique that creates a quick filter for matching documents

Form-based Indexing
– Addresses the exact syntactic properties of a text
– Text is generally unstructured and not necessarily in a natural language
– Used for processing large sets of chemical representations in molecular biology
Suffix tree algorithm is an example of form-based indexing

Content-based Indexing
– Exploits semantic connections between documents and queries
– Most content-based systems use an inverted index algorithm
– Signature file is a technique for creating a quick filter
– Can beat inverted files in certain environments with proper parameters
– Involves creating a hash coded version of each file for matching

Example: PubMed
– PubMed form interface features related articles search
– Comparison of words from document title, abstract, and MeSH terms
– Uses a word-weighted algorithm for relevance ranking
– PubMed is a widely used document retrieval system
– Provides access to a vast collection of biomedical literature

Document retrieval (Wikipedia)

Document retrieval is defined as the matching of some stated user query against a set of free-text records. These records could be any type of mainly unstructured text, such as newspaper articles, real estate records or paragraphs in a manual. User queries can range from multi-sentence full descriptions of an information need to a few words.

Document retrieval is sometimes referred to as, or as a branch of, text retrieval. Text retrieval is a branch of information retrieval where the information is stored primarily in the form of text. Text databases became decentralized thanks to the personal computer. Text retrieval is a critical area of study today, since it is the fundamental basis of all internet search engines.

« Back to Glossary Index

Request an article

Please let us know what you were looking for and our team will not only create the article but we'll also email you to let you know as soon as it's been published.
Most articles take 1-2 business days to research, write, and publish.
Content/Article Request Form

Submit your RFP

We can't wait to read about your project. Use the form below to submit your RFP!
Request for Proposal

Contact and Business Information

Provide details about how we can contact you and your business.

Quote Request Details

Provide some information about why you'd like a quote.