Glossary Term
Enterprise search
Phases of an enterprise search system
- Content goes through various phases from source repository to search results.
- Content awareness is the first phase, which involves the push or pull model.
- Content processing and analysis is the second phase, where different formats are processed and normalized.
- Indexing is the third phase, where the processed text is stored in an optimized index.
- Query processing is the fourth phase, where the user issues a query and navigational actions are considered.
Content awareness
- Content awareness involves either a push or pull model.
- In the push model, new content is directly pushed to the search engine's APIs.
- The pull model gathers content using connectors like web crawlers or database connectors.
- Connectors typically poll the source at intervals to find new, updated, or deleted content.
- The push model is used when real-time indexing is important.
Content processing and analysis
- Content from different sources may have various formats and document types.
- The content processing phase converts documents to plain text using filters.
- Normalization of content is often necessary to improve recall or precision.
- Normalization techniques include stemming, lemmatization, synonym expansion, and entity extraction.
- Tokenization is applied to split the content into basic matching units called tokens.
Indexing
- The resulting text is stored in an index optimized for quick lookups.
- The index contains a dictionary of all unique words in the corpus.
- Information about ranking and term frequency is also stored in the index.
- The index does not store the full text of the documents.
- Indexing enables efficient retrieval of relevant documents.
Query processing and matching
- Users issue queries to the system, including search terms and navigational actions.
- The processed query is compared to the stored index.
- The search system returns results referencing source documents that match the query.
- Some systems can present the document as it was indexed.
- Matching algorithms determine the relevance of the results.