Glossary Term
Document retrieval
Document Retrieval Systems
- Document retrieval systems match text records against user queries
- Consist of a database of documents, a classification algorithm, and a user interface
- Main tasks are finding relevant documents and evaluating matching results
- Internet search engines are classical applications of document retrieval
- Range from simple Boolean systems to systems using statistical or natural language processing techniques
Indexing Schemata
- Two main classes of indexing schemata: form-based and content-based
- Form-based indexing addresses the exact syntactic properties of a text
- Content-based approach exploits semantic connections between documents and queries
- Most content-based systems use an inverted index algorithm
- Signature file is a technique that creates a quick filter for matching documents
Form-based Indexing
- Addresses the exact syntactic properties of a text
- Text is generally unstructured and not necessarily in a natural language
- Used for processing large sets of chemical representations in molecular biology
- Suffix tree algorithm is an example of form-based indexing
Content-based Indexing
- Exploits semantic connections between documents and queries
- Most content-based systems use an inverted index algorithm
- Signature file is a technique for creating a quick filter
- Can beat inverted files in certain environments with proper parameters
- Involves creating a hash coded version of each file for matching
Example: PubMed
- PubMed form interface features related articles search
- Comparison of words from document title, abstract, and MeSH terms
- Uses a word-weighted algorithm for relevance ranking
- PubMed is a widely used document retrieval system
- Provides access to a vast collection of biomedical literature