Hashing is a technique used for storing and retrieving information as quickly as possible. It allows performing optimal searches and is useful in implementing hash tables.
As part of Hashing a string is transformed into a shorter, fixed-length value(or key) which represents the original string. In general, when we store strings in a database and then have to search those strings, character by character comparison is carried out on each string until a match is found, this is a very cumbersome process.
But what if we generate a unique 4 digit number for each string using some conversion function and then whenever we are asked to search for any string, we can convert the string to be searched into the 4 digit number and then look for that number. This way, the search will be very fast.
Hashing provides a way to perform insert, delete and search operation in
O(1) average time, which although can reach
O(n) in the worst cases. When we use hashing, we use a function to generate a hash code for the data to be inserted and then whenever we have to search that data, we again use the same function to generate the unique hash code and can easily find the data.
Let's take an example to understand the need for hashing. Let's say we have a table in our database in which we store names of students like:
If the table has 1 million such name entries, then whenever we have to search for any name, names are searched by matching characters one by one which is time-consuming. But what if we create a function which can produce a unique 7 digit number for every name. For example,
Now, if someone has to search for 'Tony Stark', rather than searching strings, we can convert the name into numeric 7 digit code and can then look for that code which would be multifold faster than string search.
The table which stores the keys and their respective hashed value is termed as a hash table, and the function which is used to generate the hash keys is called a hash function.
Hashing is not only useful for searching and indexing but this technique is also used for encryption.
It is used to transform a string into a hash key. Ideally, the hash function should map each string to a unique hash key, but it is indeed very difficult to achieve it practically.
One way to create such a perfect hash function is to have the size of the hash table so large that it can accommodate all the hash keys in it. But then there will be no point of calling it hashing, and memory will be wasted too as most of the keys aren't required frequently.
Our goal is to choose a hash function that quickly calculates a hash value, evenly distributes the keys into the hash table and causes a minimum number of collisions.
The condition in which for two records the same hash key is generated.
Finding an alternate location for the hashed key is called collision resolution. There are a number of such techniques:
1. Direct Chaining: Array of linked list implementation.
2. Open Addressing: Array-based implementation.
When two or more string hash to the same location(hash key), we can append them into a singly linked list (called chain) pointed by the hash key.
It is called open hashing because it uses extra memory to resolve collision.
Search the hash table sequentially starting from the original hash location. If we reach the end of the table, wrap up from last to the first location using the following function :
Rehash(datastring) = (id + stepsize) % TableSize
The major disadvantage of linear probing is the clustering of keys together in a consecutive pattern. This results in one part being very dense, while other parts having few items. It causes long probe searches.
Start from the original hash location, and check for locations id + 12, id + 22, id + 32,.. and so on.
Wrap around the table if the end is reached.
The function is:
Rehash(key) = (id + k2) % TableSize
Clustering may also occur in the case of quadratic probing.
The interval between probes is calculated using another hash function, hence the name double hashing.
It reduces clustering effectively.
Rehash(key) = (id + k * ID) % tableSize
where ID is the value calculated by the first hash function.