Glossary Term
Full-text search
Full-text search and indexing
- Full-text search is divided into indexing and searching when dealing with a large number of documents or substantial search queries.
- The indexing stage scans the text of all documents and builds a list of search terms (index).
- Stop words, common and meaningless words, are ignored during indexing.
- Language-specific stemming is used to record words with similar concepts under a single index entry.
Precision vs. recall tradeoff
- Recall measures the quantity of relevant results returned by a search, while precision measures the quality of the results.
- Low-precision, low-recall search results in a small number of relevant results returned.
- Full-text search systems use options like stop words and stemming to increase precision and recall.
- Controlled-vocabulary searching helps eliminate ambiguities and improve precision.
- There is a trade-off between precision and recall: increasing precision may lower recall and vice versa.
False-positive problem
- Full-text searching often retrieves irrelevant documents, called false positives.
- False positives are caused by the inherent ambiguity of natural language.
- Clustering techniques based on Bayesian algorithms can reduce false positives.
- Clustering categorizes documents based on relevant words, improving search results.
- This technique is extensively used in the e-discovery domain.
Performance improvements and improved querying tools
- Full text searching deficiencies are addressed by providing users with improved querying tools.
- Keywords improve recall by including synonyms of words that describe the subject.
- Field-restricted search limits searches to a specific field within a data record.
- Boolean queries using operators like AND, NOT, and OR increase precision.
- Phrase search matches documents containing a specified phrase.
- Concept search matches multi-word concepts, such as compound term processing.
- Concordance search produces an alphabetical list of principal words with their context.
- Proximity search matches documents with words separated by a specified number of words.
- Regular expression employs a complex querying syntax for precise retrieval conditions.
- Fuzzy search looks for documents that match given terms with some variation around them.
Software and references
- Thunderstone Software LLC
- Vespa
- Vivísimo
-
- In practice, it may be difficult to determine how a given search engine works.
- The search algorithms employed by web-search services are seldom fully disclosed.
- Capabilities of Full Text Search System (Archived from the original on December 23, 2010)
- Coles, Michael (2008). Pro Full-Text Search in SQL Server 2008 (Version 1ed.). Apress Publishing Company. ISBN978-1-4302-1594-3.
- B., Yuwono; Lee, D. L. (1996). Search and ranking algorithms for locating resources on the World Wide Web. 12th International Conference on Data Engineering (ICDE96). p.164.