Skip to main content
Glossary Term

Search engine indexing

Indexing and Index Design Factors - Purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query - Search engine would scan every document in the corpus without an index, requiring considerable time and computing power - Indexing 10,000 documents can be queried within milliseconds, while sequential scan of every word in 10,000 large documents could take hours - Additional computer storage is required to store the index - Time saved during information retrieval is traded off for the time required for an update to take place - Merge factors: how data enters the index and if multiple indexers can work asynchronously - Storage techniques: how to store the index data, whether compressed or filtered - Index size: amount of computer storage required to support the index - Lookup speed: how quickly a word can be found in the inverted index - Maintenance: how the index is maintained over time, including dealing with index corruption and bad hardware Index Data Structures - Suffix tree: structured like a tree, supports linear time lookup, used for searching patterns in DNA sequences - Inverted index: stores a list of occurrences of each atomic search criterion - Citation index: stores citations or hyperlinks between documents to support citation analysis - n-gram index: stores sequences of length of data to support other types of retrieval or text mining - Document-term matrix: used in latent semantic analysis, stores occurrences of words in documents in a two-dimensional sparse matrix Challenges in Parallelism and Inverted Indices - Management of serial computing processes is a major challenge in search engine design - Race conditions and coherent faults are common due to competing tasks - Distributed storage and processing magnify the challenge - Search engines may involve distributed computing to scale with larger amounts of indexed information - Synchronization and maintaining a fully parallel architecture become more difficult - Inverted index is used to quickly locate documents containing words in a search query - Stores a list of documents containing each word - Boolean index that determines which documents match a query but does not rank them - Position information enables searching for phrases and frequency helps in ranking relevance - Inverted index is a sparse matrix and can be considered a form of a hash table or binary tree Document Parsing, Tokenization, and Language Recognition - Document parsing breaks apart the components (words) of a document - The words found are called tokens - Tokenization is commonly referred to as parsing in search engine indexing - Tokenization involves multiple technologies and is kept as a corporate secret - Natural language processing is continuously researched and improved - Word boundary ambiguity poses a challenge in tokenization - Language ambiguity affects ranking and additional information collection - Diverse file formats require correct handling for tokenization - Faulty storage can degrade index quality or indexer performance - Multilingual indexing requires language-specific logic and parsers - Computers do not automatically recognize words and sentences in a document - Tokenization requires programming the computer to identify tokens - Tokens can have characteristics like case, language, position, length, etc. - Parsers can identify entities like email addresses, phone numbers, and URLs - Specialized programs like YACC or Lex are used for parsing - Language recognition categorizes the language of a document - It is an initial step in tokenization for supporting multiple languages - Language recognition is language-dependent and involves ongoing research - Automated language recognition uses techniques like language recognition charts - Stemming and part of speech tagging are language-dependent steps Format Analysis, Compression, and HTML Priority System - Format analysis is the process of analyzing different file formats - It is also known as structure analysis, format parsing, and text normalization - Various file formats pose challenges due to their proprietary nature or lack of documentation - Common well-documented file formats include HTML, ASCII text files, PDF, PostScript, and XML - Dealing with different formats can involve using commercial parsing tools or writing custom parsers - Some search engines support inspection of files stored in compressed or encrypted formats - Commonly supported compressed file formats include ZIP, RAR, CAB, Gzip, and BZIP - When working with compressed formats, the indexer decompresses the document before indexing - This step may result in multiple files that need to be indexed separately - Indexing compressed formats can improve search quality and index coverage - Section recognition is the identification of major parts of a document - Not all documents read like well-organized books with chapters and pages - Newsletters and corporate reports often contain erroneous content and side-sections - Content displayed in different areas of the view may be stored sequentially in the raw markup - Section analysis requires implementing the rendering logic of each document and indexing the representation - HTML tags play a role in organizing priority for indexing - Indexing low priority to high margin labels like 'strong' and 'link' can optimize relevance - Search engines like Google and Bing consider strong type system compatibility for relevance - The order of priority for HTML tags affects search engine indexing - Proper recognition and utilization of HTML tags improve search results - Meta tag indexing categorizes web content and plays an important role in organizing it - Specific documents contain embedded meta information such as author, keywords, description, and language - Earlier search engine technologies only indexed keywords in meta tags for the forward index - Full-text indexing became more established as computer hardware capabilities improved - Meta tags were initially designed to be easily indexed without requiring tokenization