Indexing and Index Design Factors
– Purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query
– Search engine would scan every document in the corpus without an index, requiring considerable time and computing power
– Indexing 10,000 documents can be queried within milliseconds, while sequential scan of every word in 10,000 large documents could take hours
– Additional computer storage is required to store the index
– Time saved during information retrieval is traded off for the time required for an update to take place
– Merge factors: how data enters the index and if multiple indexers can work asynchronously
– Storage techniques: how to store the index data, whether compressed or filtered
– Index size: amount of computer storage required to support the index
– Lookup speed: how quickly a word can be found in the inverted index
– Maintenance: how the index is maintained over time, including dealing with index corruption and bad hardware
Index Data Structures
– Suffix tree: structured like a tree, supports linear time lookup, used for searching patterns in DNA sequences
– Inverted index: stores a list of occurrences of each atomic search criterion
– Citation index: stores citations or hyperlinks between documents to support citation analysis
– n-gram index: stores sequences of length of data to support other types of retrieval or text mining
– Document-term matrix: used in latent semantic analysis, stores occurrences of words in documents in a two-dimensional sparse matrix
Challenges in Parallelism and Inverted Indices
– Management of serial computing processes is a major challenge in search engine design
– Race conditions and coherent faults are common due to competing tasks
– Distributed storage and processing magnify the challenge
– Search engines may involve distributed computing to scale with larger amounts of indexed information
– Synchronization and maintaining a fully parallel architecture become more difficult
– Inverted index is used to quickly locate documents containing words in a search query
– Stores a list of documents containing each word
– Boolean index that determines which documents match a query but does not rank them
– Position information enables searching for phrases and frequency helps in ranking relevance
– Inverted index is a sparse matrix and can be considered a form of a hash table or binary tree
Document Parsing, Tokenization, and Language Recognition
– Document parsing breaks apart the components (words) of a document
– The words found are called tokens
– Tokenization is commonly referred to as parsing in search engine indexing
– Tokenization involves multiple technologies and is kept as a corporate secret
– Natural language processing is continuously researched and improved
– Word boundary ambiguity poses a challenge in tokenization
– Language ambiguity affects ranking and additional information collection
– Diverse file formats require correct handling for tokenization
– Faulty storage can degrade index quality or indexer performance
– Multilingual indexing requires language-specific logic and parsers
– Computers do not automatically recognize words and sentences in a document
– Tokenization requires programming the computer to identify tokens
– Tokens can have characteristics like case, language, position, length, etc.
– Parsers can identify entities like email addresses, phone numbers, and URLs
– Specialized programs like YACC or Lex are used for parsing
– Language recognition categorizes the language of a document
– It is an initial step in tokenization for supporting multiple languages
– Language recognition is language-dependent and involves ongoing research
– Automated language recognition uses techniques like language recognition charts
– Stemming and part of speech tagging are language-dependent steps
Format Analysis, Compression, and HTML Priority System
– Format analysis is the process of analyzing different file formats
– It is also known as structure analysis, format parsing, and text normalization
– Various file formats pose challenges due to their proprietary nature or lack of documentation
– Common well-documented file formats include HTML, ASCII text files, PDF, PostScript, and XML
– Dealing with different formats can involve using commercial parsing tools or writing custom parsers
– Some search engines support inspection of files stored in compressed or encrypted formats
– Commonly supported compressed file formats include ZIP, RAR, CAB, Gzip, and BZIP
– When working with compressed formats, the indexer decompresses the document before indexing
– This step may result in multiple files that need to be indexed separately
– Indexing compressed formats can improve search quality and index coverage
– Section recognition is the identification of major parts of a document
– Not all documents read like well-organized books with chapters and pages
– Newsletters and corporate reports often contain erroneous content and side-sections
– Content displayed in different areas of the view may be stored sequentially in the raw markup
– Section analysis requires implementing the rendering logic of each document and indexing the representation
– HTML tags play a role in organizing priority for indexing
– Indexing low priority to high margin labels like ‘strong’ and ‘link’ can optimize relevance
– Search engines like Google and Bing consider strong type system compatibility for relevance
– The order of priority for HTML tags affects search engine indexing
– Proper recognition and utilization of HTML tags improve search results
– Meta tag indexing categorizes web content and plays an important role in organizing it
– Specific documents contain embedded meta information such as author, keywords, description, and language
– Earlier search engine technologies only indexed keywords in meta tags for the forward index
– Full-text indexing became more established as computer hardware capabilities improved
– Meta tags were initially designed to be easily indexed without requiring tokenization
Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.
Popular search engines focus on the full-text indexing of online, natural language documents. Media types such as pictures, video, audio, and graphics are also searchable.
Meta search engines reuse the indices of other services and do not store a local index whereas cache-based search engines permanently store the index along with the corpus. Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs, while agent-based search engines index in real time.