Glossary Term
Search engine indexing
Indexing and Index Design Factors
- Purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query
- Search engine would scan every document in the corpus without an index, requiring considerable time and computing power
- Indexing 10,000 documents can be queried within milliseconds, while sequential scan of every word in 10,000 large documents could take hours
- Additional computer storage is required to store the index
- Time saved during information retrieval is traded off for the time required for an update to take place
- Merge factors: how data enters the index and if multiple indexers can work asynchronously
- Storage techniques: how to store the index data, whether compressed or filtered
- Index size: amount of computer storage required to support the index
- Lookup speed: how quickly a word can be found in the inverted index
- Maintenance: how the index is maintained over time, including dealing with index corruption and bad hardware
Index Data Structures
- Suffix tree: structured like a tree, supports linear time lookup, used for searching patterns in DNA sequences
- Inverted index: stores a list of occurrences of each atomic search criterion
- Citation index: stores citations or hyperlinks between documents to support citation analysis
- n-gram index: stores sequences of length of data to support other types of retrieval or text mining
- Document-term matrix: used in latent semantic analysis, stores occurrences of words in documents in a two-dimensional sparse matrix
Challenges in Parallelism and Inverted Indices
- Management of serial computing processes is a major challenge in search engine design
- Race conditions and coherent faults are common due to competing tasks
- Distributed storage and processing magnify the challenge
- Search engines may involve distributed computing to scale with larger amounts of indexed information
- Synchronization and maintaining a fully parallel architecture become more difficult
- Inverted index is used to quickly locate documents containing words in a search query
- Stores a list of documents containing each word
- Boolean index that determines which documents match a query but does not rank them
- Position information enables searching for phrases and frequency helps in ranking relevance
- Inverted index is a sparse matrix and can be considered a form of a hash table or binary tree
Document Parsing, Tokenization, and Language Recognition
- Document parsing breaks apart the components (words) of a document
- The words found are called tokens
- Tokenization is commonly referred to as parsing in search engine indexing
- Tokenization involves multiple technologies and is kept as a corporate secret
- Natural language processing is continuously researched and improved
- Word boundary ambiguity poses a challenge in tokenization
- Language ambiguity affects ranking and additional information collection
- Diverse file formats require correct handling for tokenization
- Faulty storage can degrade index quality or indexer performance
- Multilingual indexing requires language-specific logic and parsers
- Computers do not automatically recognize words and sentences in a document
- Tokenization requires programming the computer to identify tokens
- Tokens can have characteristics like case, language, position, length, etc.
- Parsers can identify entities like email addresses, phone numbers, and URLs
- Specialized programs like YACC or Lex are used for parsing
- Language recognition categorizes the language of a document
- It is an initial step in tokenization for supporting multiple languages
- Language recognition is language-dependent and involves ongoing research
- Automated language recognition uses techniques like language recognition charts
- Stemming and part of speech tagging are language-dependent steps
Format Analysis, Compression, and HTML Priority System
- Format analysis is the process of analyzing different file formats
- It is also known as structure analysis, format parsing, and text normalization
- Various file formats pose challenges due to their proprietary nature or lack of documentation
- Common well-documented file formats include HTML, ASCII text files, PDF, PostScript, and XML
- Dealing with different formats can involve using commercial parsing tools or writing custom parsers
- Some search engines support inspection of files stored in compressed or encrypted formats
- Commonly supported compressed file formats include ZIP, RAR, CAB, Gzip, and BZIP
- When working with compressed formats, the indexer decompresses the document before indexing
- This step may result in multiple files that need to be indexed separately
- Indexing compressed formats can improve search quality and index coverage
- Section recognition is the identification of major parts of a document
- Not all documents read like well-organized books with chapters and pages
- Newsletters and corporate reports often contain erroneous content and side-sections
- Content displayed in different areas of the view may be stored sequentially in the raw markup
- Section analysis requires implementing the rendering logic of each document and indexing the representation
- HTML tags play a role in organizing priority for indexing
- Indexing low priority to high margin labels like 'strong' and 'link' can optimize relevance
- Search engines like Google and Bing consider strong type system compatibility for relevance
- The order of priority for HTML tags affects search engine indexing
- Proper recognition and utilization of HTML tags improve search results
- Meta tag indexing categorizes web content and plays an important role in organizing it
- Specific documents contain embedded meta information such as author, keywords, description, and language
- Earlier search engine technologies only indexed keywords in meta tags for the forward index
- Full-text indexing became more established as computer hardware capabilities improved
- Meta tags were initially designed to be easily indexed without requiring tokenization