Full-text search and indexing
– Full-text search is divided into indexing and searching when dealing with a large number of documents or substantial search queries.
– The indexing stage scans the text of all documents and builds a list of search terms (index).
– Stop words, common and meaningless words, are ignored during indexing.
– Language-specific stemming is used to record words with similar concepts under a single index entry.
Precision vs. recall tradeoff
– Recall measures the quantity of relevant results returned by a search, while precision measures the quality of the results.
– Low-precision, low-recall search results in a small number of relevant results returned.
– Full-text search systems use options like stop words and stemming to increase precision and recall.
– Controlled-vocabulary searching helps eliminate ambiguities and improve precision.
– There is a trade-off between precision and recall: increasing precision may lower recall and vice versa.
False-positive problem
– Full-text searching often retrieves irrelevant documents, called false positives.
– False positives are caused by the inherent ambiguity of natural language.
– Clustering techniques based on Bayesian algorithms can reduce false positives.
– Clustering categorizes documents based on relevant words, improving search results.
– This technique is extensively used in the e-discovery domain.
Performance improvements and improved querying tools
– Full text searching deficiencies are addressed by providing users with improved querying tools.
– Keywords improve recall by including synonyms of words that describe the subject.
– Field-restricted search limits searches to a specific field within a data record.
– Boolean queries using operators like AND, NOT, and OR increase precision.
– Phrase search matches documents containing a specified phrase.
– Concept search matches multi-word concepts, such as compound term processing.
– Concordance search produces an alphabetical list of principal words with their context.
– Proximity search matches documents with words separated by a specified number of words.
– Regular expression employs a complex querying syntax for precise retrieval conditions.
– Fuzzy search looks for documents that match given terms with some variation around them.
Software and references
– Thunderstone Software LLC
– Vespa
– Vivísimo
– [Other software products for full-text indexing and searching]
– In practice, it may be difficult to determine how a given search engine works.
– The search algorithms employed by web-search services are seldom fully disclosed.
– Capabilities of Full Text Search System (Archived from the original on December 23, 2010)
– Coles, Michael (2008). Pro Full-Text Search in SQL Server 2008 (Version 1ed.). Apress Publishing Company. ISBN978-1-4302-1594-3.
– B., Yuwono; Lee, D. L. (1996). Search and ranking algorithms for locating resources on the World Wide Web. 12th International Conference on Data Engineering (ICDE96). p.164.
This article needs additional citations for verification. (August 2012) |
In text retrieval, full-text search refers to techniques for searching a single computer-stored document or a collection in a full-text database. Full-text search is distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections, or bibliographical references).
In a full-text search, a search engine examines all of the words in every stored document as it tries to match search criteria (for example, text specified by a user). Full-text-searching techniques appeared in the 1960s, for example IBM STAIRS from 1969, and became common in online bibliographic databases in the 1990s.[verification needed] Many websites and application programs (such as word processing software) provide full-text-search capabilities. Some web search engines, such as the former AltaVista, employ full-text-search techniques, while others index only a portion of the web pages examined by their indexing systems.