Search engine indexing

« Back to Glossary Index

Indexing and Index Design Factors
– Purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query
Search engine would scan every document in the corpus without an index, requiring considerable time and computing power
– Indexing 10,000 documents can be queried within milliseconds, while sequential scan of every word in 10,000 large documents could take hours
– Additional computer storage is required to store the index
– Time saved during information retrieval is traded off for the time required for an update to take place
– Merge factors: how data enters the index and if multiple indexers can work asynchronously
– Storage techniques: how to store the index data, whether compressed or filtered
– Index size: amount of computer storage required to support the index
– Lookup speed: how quickly a word can be found in the inverted index
– Maintenance: how the index is maintained over time, including dealing with index corruption and bad hardware

Index Data Structures
Suffix tree: structured like a tree, supports linear time lookup, used for searching patterns in DNA sequences
Inverted index: stores a list of occurrences of each atomic search criterion
Citation index: stores citations or hyperlinks between documents to support citation analysis
n-gram index: stores sequences of length of data to support other types of retrieval or text mining
Document-term matrix: used in latent semantic analysis, stores occurrences of words in documents in a two-dimensional sparse matrix

Challenges in Parallelism and Inverted Indices
– Management of serial computing processes is a major challenge in search engine design
– Race conditions and coherent faults are common due to competing tasks
– Distributed storage and processing magnify the challenge
– Search engines may involve distributed computing to scale with larger amounts of indexed information
– Synchronization and maintaining a fully parallel architecture become more difficult
Inverted index is used to quickly locate documents containing words in a search query
– Stores a list of documents containing each word
– Boolean index that determines which documents match a query but does not rank them
– Position information enables searching for phrases and frequency helps in ranking relevance
Inverted index is a sparse matrix and can be considered a form of a hash table or binary tree

Document Parsing, Tokenization, and Language Recognition
– Document parsing breaks apart the components (words) of a document
– The words found are called tokens
– Tokenization is commonly referred to as parsing in search engine indexing
– Tokenization involves multiple technologies and is kept as a corporate secret
Natural language processing is continuously researched and improved
– Word boundary ambiguity poses a challenge in tokenization
Language ambiguity affects ranking and additional information collection
– Diverse file formats require correct handling for tokenization
– Faulty storage can degrade index quality or indexer performance
– Multilingual indexing requires language-specific logic and parsers
– Computers do not automatically recognize words and sentences in a document
– Tokenization requires programming the computer to identify tokens
– Tokens can have characteristics like case, language, position, length, etc.
– Parsers can identify entities like email addresses, phone numbers, and URLs
– Specialized programs like YACC or Lex are used for parsing
Language recognition categorizes the language of a document
– It is an initial step in tokenization for supporting multiple languages
Language recognition is language-dependent and involves ongoing research
– Automated language recognition uses techniques like language recognition charts
Stemming and part of speech tagging are language-dependent steps

Format Analysis, Compression, and HTML Priority System
– Format analysis is the process of analyzing different file formats
– It is also known as structure analysis, format parsing, and text normalization
– Various file formats pose challenges due to their proprietary nature or lack of documentation
– Common well-documented file formats include HTML, ASCII text files, PDF, PostScript, and XML
– Dealing with different formats can involve using commercial parsing tools or writing custom parsers
– Some search engines support inspection of files stored in compressed or encrypted formats
– Commonly supported compressed file formats include ZIP, RAR, CAB, Gzip, and BZIP
– When working with compressed formats, the indexer decompresses the document before indexing
– This step may result in multiple files that need to be indexed separately
– Indexing compressed formats can improve search quality and index coverage
– Section recognition is the identification of major parts of a document
– Not all documents read like well-organized books with chapters and pages
– Newsletters and corporate reports often contain erroneous content and side-sections
– Content displayed in different areas of the view may be stored sequentially in the raw markup
– Section analysis requires implementing the rendering logic of each document and indexing the representation
HTML tags play a role in organizing priority for indexing
– Indexing low priority to high margin labels like ‘strong’ and ‘link’ can optimize relevance
– Search engines like Google and Bing consider strong type system compatibility for relevance
– The order of priority for HTML tags affects search engine indexing
– Proper recognition and utilization of HTML tags improve search results
– Meta tag indexing categorizes web content and plays an important role in organizing it
– Specific documents contain embedded meta information such as author, keywords, description, and language
– Earlier search engine technologies only indexed keywords in meta tags for the forward index
– Full-text indexing became more established as computer hardware capabilities improved
– Meta tags were initially designed to be easily indexed without requiring tokenization

Search engine indexing is the collecting, parsing, and storing of data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process, in the context of search engines designed to find web pages on the Internet, is web indexing.

Popular search engines focus on the full-text indexing of online, natural language documents. Media types such as pictures, video, audio, and graphics are also searchable.

Meta search engines reuse the indices of other services and do not store a local index whereas cache-based search engines permanently store the index along with the corpus. Unlike full-text indices, partial-text services restrict the depth indexed to reduce index size. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs, while agent-based search engines index in real time.

« Back to Glossary Index

Submit your RFP

We can't wait to read about your project. Use the form below to submit your RFP!

Gabrielle Buff
Gabrielle Buff

Just left us a 5 star review

Great customer service and was able to walk us through the various options available to us in a way that made sense. Would definitely recommend!

Stoute Web Solutions has been a valuable resource for our business. Their attention to detail, expertise, and willingness to help at a moment's notice make them an essential support system for us.

Paul and the team are very professional, courteous, and efficient. They always respond immediately even to my minute concerns. Also, their SEO consultation is superb. These are good people!

Paul Stoute & his team are top notch! You will not find a more honest, hard working group whose focus is the success of your business. If you’re ready to work with the best to create the best for your business, go Stoute Web Solutions; you’ll definitely be glad you did!

Wonderful people that understand our needs and make it happen!

Paul is the absolute best! Always there with solutions in high pressure situations. A steady hand; always there when needed; I would recommend Paul to anyone!

Vince Fogliani

The team over at Stoute web solutions set my business up with a fantastic new website, could not be happier

Steve Sacre

If You are looking for Website design & creativity look no further. Paul & his team are the epitome of excellence.Don't take my word just refer to my website ""that Stoute Web Solutions created.This should convince anyone that You have finally found Your perfect fit

Jamie Hill

Paul and the team at Stoute Web are amazing. They are super fast to answer questions. Super easy to work with, and knows their stuff. 10,000 stars.

Paul and the team from Stoute Web solutions are awesome to work with. They're super intuitive on what best suits your needs and the end product is even better. We will be using them exclusively for our web design and hosting.

Dean Eardley

Beautifully functional websites from professional, knowledgeable team.

Along with hosting most of my url's Paul's business has helped me with website development, graphic design and even a really cool back end database app! I highly recommend him as your 360 solution to making your business more visible in today's social media driven marketplace.

I hate dealing with domain/site hosts. After terrible service for over a decade from Dreamhost, I was desperate to find a new one. I was lucky enough to win...

Paul Stoute has been extremely helpful in helping me choose the best package to suite my needs. Any time I had a technical issue he was there to help me through it. Superb customer service at a great value. I would recommend his services to anyone that wants a hassle free and quality experience for their website needs.

Paul is the BEST! I am a current customer and happy to say he has never let me down. Always responds quickly and if he cant fix the issue right away, if available, he provides you a temporary work around while researching the correct fix! Thanks for being an honest and great company!!

Paul Stoute is absolutely wonderful. Paul always responds to my calls and emails right away. He is truly the backbone of my business. From my fantastic website to popping right up on Google when people search for me and designing my business cards, Paul has been there every step of the way. I would recommend this company to anyone.

I can't say enough great things about Green Tie Hosting. Paul was wonderful in helping me get my website up and running quickly. I have stayed with Green...