History and Concepts
– Hans Peter Luhn wrote an internal IBM memorandum that used hashing with chaining.
– Gene Amdahl, Elaine M. McGraw, Nathaniel Rochester, and Arthur Samuel of IBM Research implemented hashing for the IBM 701 assembler.
– A. D. Linh proposed open addressing with linear probing.
– W. Wesley Peterson coined the term ‘open addressing’ in an article.
– Arnold Dumey discussed the idea of using remainder modulo a prime as a hash function.
– Load factor is defined as the number of entries occupied in the hash table divided by the number of buckets.
– The performance of the hash table deteriorates as the load factor increases.
– The software ensures that the load factor never exceeds a certain constant.
– The hash table is resized or rehashed when the load factor reaches a certain maximum value.
– Separate chaining hash tables have no fixed point beyond which resizing is absolutely needed.
– A hash function maps the universe of keys to array indices or slots within the table.
– Conventional implementations of hash functions are based on the integer universe assumption.
– A perfect hash function is an injective function that maps each element to a unique value.
– Perfect hash functions can be created if all the keys are known ahead of time.
– Hash functions can be evaluated empirically using statistical tests.
– Hashing schemes based on the integer universe assumption include hashing by division, multiplication, universal hashing, dynamic perfect hashing, and static perfect hashing.
– Hashing by division is the commonly used scheme.
– Hashing by multiplication uses a real-valued constant to compute the hash function.
– The golden ratio is suggested by Donald Knuth for hashing by multiplication.
– The choice of hash function depends on the desired distribution and table size.
– Uniform distribution of hash values is a fundamental requirement.
– Non-uniform distribution increases collisions and their resolution cost.
– Uniformity can be evaluated using statistical tests like Pearsons chi-squared test.
– The distribution needs to be uniform for table sizes that occur in the application.
– Dynamic resizing with exact doubling and halving requires uniformity for power of two table sizes.
Collision Resolution Techniques
– Hashing uses a hash function to transform a search key into an array index.
– The ideal case is that no two search keys hash to the same array index.
– Collision resolution is the second part of the algorithm.
– Two common methods for collision resolution are separate chaining and open addressing.
– Collision resolution is necessary because it is impossible to guarantee no collisions for unseen data.
– Separate chaining involves building a linked list for each search array index.
– Collided items are chained together through a single linked list.
– Chained-Hash-Insert, Chained-Hash-Search, and Chained-Hash-Delete are operations in separate chaining.
– If elements are inserted in a total order, it speeds up unsuccessful searches.
– Other data structures like self-balancing binary search trees can be used for separate chaining.
– Open addressing stores every entry record in the bucket array itself.
– Hash resolution is performed through probing.
– Linear probing, quadratic probing, and double hashing are probe sequences used in open addressing.
– Open addressing may be slower than separate chaining when the load factor approaches 1.
– Linear probing can lead to better CPU cache utilization due to locality of references.
– Separate chaining with linked lists may not be cache-conscious due to spatial locality.
– Cache-conscious variants use a more cache-friendly dynamic array.
– Linear probing in open addressing can utilize CPU cache better due to locality of references.
– Caching and locality of reference can improve memory latency and reduce access time.
– Array-based separate chaining is more performant than the standard linked list method under heavy load.
– Coalesced hashing is a hybrid of separate chaining and open addressing.
– Cuckoo hashing guarantees worst-case lookup complexity and constant amortized time for insertions.
– Hopscotch hashing combines elements of cuckoo hashing, linear probing, and chaining.
– Coalesced hashing is suited for fixed memory allocation.
– Hopscotch hashing provides high throughput in concurrent settings and is suitable for resizable concurrent hash tables.
Hash Table Implementation
– Hash tables use a hash function to map keys to indices in an array.
– They provide constant-time average-case performance for insertion, deletion, and search operations.
– Hash collisions occur when two keys map to the same index.
– Collision resolution techniques include chaining and open addressing.
– Chaining uses linked lists to store multiple items at the same index.
– Chaining resolves collisions by storing multiple items in linked lists at the same index.
– Open addressing resolves collisions by finding alternative positions within the hash table.
– Linear probing is a common open addressing technique that checks the next position in the table.
– Quadratic probing uses a quadratic function to search for the next position.
– Robin Hood hashing favors displacing items with longer probe sequence lengths.
– Dynamic resizing is necessary when the number of entries in a hash table grows.
– Resizing maintains the performance of lookup and insertion operations.
– Items are rehashed into the buckets of the new hash table.
– Resizing may be performed to avoid excessive memory usage.
– Resizing can be computationally expensive, especially with all-at-once rehashing.
Applications of Hash Tables
– Hash tables are commonly used to implement associative arrays and in-memory tables.
– They are used in database indexing, although B-trees are more popular in this context.
– Hash tables can be used to implement caches, speeding up access to data stored in slower media.
– They are also used in set data structures for testing membership of a value.
– Transposition tables are complex hash tables that store information about searched sections.
– Hash tables are widely used in databases and caching systems.
– They provide efficient lookup operations in symbol tables and dictionaries.
– Hash tables are used in spell checkers and spell correction algorithms.
– They are employed in hash-based data structures like bloom filters and hash maps.
– Cryptocurrencies and blockchain technologies utilize hash functions.
Hash Table Implementations in Programming Languages
– Many programming languages provide hash table functionality
In computing, a hash table, also known as a hash map, is a data structure that implements an associative array, also called a dictionary, which is an abstract data type that maps keys to values. A hash table uses a hash function to compute an index, also called a hash code, into an array of buckets or slots, from which the desired value can be found. During lookup, the key is hashed and the resulting hash indicates where the corresponding value is stored.
Hash table | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Type | Unordered associative array | |||||||||||||||||||||||
Invented | 1953 | |||||||||||||||||||||||
|
Ideally, the hash function will assign each key to a unique bucket, but most hash table designs employ an imperfect hash function, which might cause hash collisions where the hash function generates the same index for more than one key. Such collisions are typically accommodated in some way.
In a well-dimensioned hash table, the average time complexity for each lookup is independent of the number of elements stored in the table. Many hash table designs also allow arbitrary insertions and deletions of key–value pairs, at amortized constant average cost per operation.
Hashing is an example of a space-time tradeoff. If memory is infinite, the entire key can be used directly as an index to locate its value with a single memory access. On the other hand, if infinite time is available, values can be stored without regard for their keys, and a binary search or linear search can be used to retrieve the element.
In many situations, hash tables turn out to be on average more efficient than search trees or any other table lookup structure. For this reason, they are widely used in many kinds of computer software, particularly for associative arrays, database indexing, caches, and sets.