Hash Function and Map

Explanation of hash map and hashing table, how is the hash functions are used in Cryptograpy.



Today we discuss about...

Hash Function and Mapping in Operating system. 

{ Please note: All the content have in this article like a Research Paper Article, so their is a lot of content for deep knowledge, such as, how's the hash function work, it's Cryptography  implementation, hash tables and lot of content. So read it carefully }.

 A function data as input perform a numeric operation on this data and returns on numeric value. the numeric value can then be used as an index into a table to quickly retrieve the data. whereas searching for a data item through a list of size N can require up to O(n) comparisons in the worst case, using a Hash Function for retrieving data from table can be as good as O(1) in worst case, depending on implementation details. because of this performance Hash Function are used extensively in operating systems.


One potential difficulty with Hash Function is that two inputs can result in the same output value--- that is, they can link to the same table location. we can accommodate this hash collection by having a linked list at that table location that contains all of the items with the same hash value. 



Hash mapping table





One use of a hash function is to implement a hash map, which associates [key:value] pairs using a hash function. 

 For Example, we can map the key -"operating" to the value -"system".

 Once the mapping is established, we can apply the Hash Function to the key to obtain the value from the hash map. For example, suppose that a username is mapped to a password. Password authentication then proceed as follows: a user enters his user name and password. The Hash Function is applied to the user name, which is then used to retrieve the password. The retrieved password is then compared with the password entered by the user for authentication


In hashing, large keys are converted into small keys by using hash functions. The values are then stored in a data structure called hash table. The idea of hashing is to distribute entries (key/value pairs) uniformly across an array. Each element is assigned a key (converted key). By using that key you can access the element in O(1) time. Using the key, the algorithm (hash function) computes an index that suggests where an entry can be found or inserted.

Hashing is implemented in two steps:

  •  An element is converted into an integer by using a hash function. This element can be used as an index to store the original element, which falls into the hash table.
  • The element is stored in the hash table where it can be quickly retrieved using hashed key.

    {   hash = hashfunc(key)   }
    {   index = hash % array_size }

In this method, the hash is independent of the array size and it is then reduced to an index (a number between 0 and array_size − 1) by using the modulo operator (%).


Simple hash functions

The following functions map a single integer key (k) to a small integer bucket value h(k). "M" is the size of the hash table (number of buckets).

Division method (Cormen) Choose a prime that isn't close to a power of 2. h(k) = k mod m. Works badly for many types of patterns in the input data.

Knuth Variant on Division h(k) = k(k+3) mod m. Supposedly works much better than the raw division method.

Multiplication Method (Cormen). Choose m to be a power of 2. Let A be some random-looking real number. Knuth suggests M = 0.5*(sqrt(5) - 1).

 Then do the following:

                  s = k*A  

                 x = fractional part of s 

                 h(k) = floor(m*x)

      

This seems to be the method that the theoreticians like.

To do this quickly with integer arithmetic, let " W " be the number of bits in a word (e.g. 32) and suppose m is 2^p.

Then compute:

  1st.          s = floor(A * 2^w) 

              // i.e. right shift x by (w-p) bits 

 2nd.         x = k*s   

             // i.e. extract the p most significant

 3rd.          h(k) = x >> (w-p)   

             // bits from x Hash tables


Suppose :

we want a data structure to implement either a mutable set of elements (with operations like contains, add, and remove that take an element as an argument) or a mutable map from keys to values (with operations like get, put, and remove that take a key for an arguments)

A mutable map is also known as an associative array. We've now seen a few data structures that could be used for both of these implementation tasks. We consider the problem of implementing sets and maps together because most data structures that can implement a set can also implement a map. A set of key–value pairs can act as a map, as long as the way we compare key–value pairs is to compare the keys. Alternatively, we can view the transformation from a set to a map as starting with a data structure that implements set of keys and then adding an associated value to each data structure node that stores a key. Here are the data structures we've seen so far, with the asymptotic complexities for each of their key operations....




Data structurelookup (contains/get)add/putremove
ArrayO(n)O(1)O(n)
FunctionO(1)O(n)N/A
Linked listO(n)O(1)O(n)
Search treeO(lg n)O(lg
 n)


Hash table pointed the elements using the link list to give the better performance but its two inputs pointed the same values that are the disadvantages of hash table, that's why it is used link like to find and store the elements into the memory. There is some more information about hashing manipulation. Many variations on hash tables have been developed. We'll explore the most common ones, building up in steps.


[ Many variations on hash tables have been developed. We'll explore the most common ones, building up in steps ].


 Step 1:

 Direct address tables.

  While arrays make a slow data structure when you don't know what index to look at, all of their operations are very fast when you do. This is the insight behind the direct address table. Suppose that for each element that we want to store in the data structure, we can determine a unique integer index in the range { 0,1,2,3  ........ m-1, m–1 }

That is, we need an injective function that maps elements (or keys) to integers in the range. Then we can use the indices produced by the function to decide at which index to store the elements in an array of size m. For example, suppose we are maintaining a collection of objects representing houses on the same street. We can use the street address as the index into a direct address table. Not every possible street address will be used, so some array entries will be empty. This is not a problem as long as there are not too many empty entries. However, it is often hard to come up with an injective function that does not require many empty entries.

 For example: 

 Suppose that instead we are maintaining a collection of employees whom we want to look up by social security number. Using the social security number as the index into a direct address table means we need an array of 10 billion elements, almost all of which are likely to be unused. Even assuming our computer has enough memory to store such a sparse array, it will be a waste of memory. Furthermore, on most computer hardware, the use of caches means that accesses to large arrays are actually significantly slower than accesses to small arrays—sometimes, two orders of magnitude slower!

Step 2:

 Hashing Instead of requiring that the key be mapped to an index without any collisions, we allow collisions in which two keys maps to the same array index. To avoid having many collisions, this mapping is performed by a hash function that maps the key in a reproducible but “random” way to a hash, that is a legal array index. If the hash function is good, collisions occur as if completely at random. 

Suppose that we are using an array with 13 entries and our keys are social security numbers, expressed as long values. Then we might use modular hashing, in which the array index is computed as key % 13. This is not a very random hash function, but is likely to be good enough unless there is an adversary purposely trying to produce collisions.


  Step 3:

Collision resolution

 Open hashing (chaining):  There are two main ideas for how to deal with collisions. The best way is usually chaining: each array entry corresponds to a bucket containing a mutable set of elements. (Confusingly, this approach is also known as closed addressing or open hashing.) Typically, the bucket is implemented as a linked list, so each array entry (if nonempty) contains a pointer to the head of the linked list. To check whether an element is in the hash table, the key is first hashed to find the correct bucket to look in. Then, the linked list is scanned to see if the desired element is present. If the linked list is short, this scan is very quick. An element is added or removed by hashing it to find the correct bucket. Then, the bucket is checked to see if the element is there, and finally the element is added or removed appropriately from the bucket in the usual way for linked lists.

 Closed hashing (probing)

 Another approach to collision resolution that is worth knowing about is probing. (Confusingly, this technique is also known as open addressing or closed hashing.) Rather than put colliding elements in a linked list, all elements are stored in the array itself. When adding a new element to the hash table creates a collision, the hash table finds somewhere else in the array to put it. The simple way to find an empty index is to search ahead through the array indices with a fixed stride (often 1), looking for an unused entry; this linear probing strategy tends to produce a lot of clustering of elements in the table, leading to bad performance. A better strategy is to use a second hash function to compute the probing interval; this strategy is called double hashing. Regardless of how probing is implemented, however, the time required to search for or add an element grows rapidly as the hash table fills up. By contrast, the performance of chaining degrades more gracefully, and chaining is usually faster than probing even when the hash table is not nearly full. Therefore chaining is usually preferred over probing. A recently popular variant of closed hashing is Cuckoo hashing, in which two hash functions are used. Each element is stored at one of the two locations computed by these hash functions, so at most two table locations must be consulted in order to determine whether the element is present. If both possible locations are occupied, the newly added element displaces the element that was there, and this element is then re-added to the table. In general, a chain of displacements occurs. 

 Performance of hash tables

 Suppose we are using a chained hash table with m buckets, and the number of elements in the hash table is n. Then the average number of elements per bucket is n/m, which is called the load factor of the hash table, denoted α. When an element that is not in the hash table is searched for, the expected length of the linked list traversed is α. Since there is always the initial (constant) cost of hashing, the cost of hash table operations with a good hash function is, 

on average, O(1 + α). 

If we can ensure that the load factor α never exceeds some fixed value αmax, then all operations will be 

                  O(1 + αmax) = O(1).

 In practice, we will get the best performance out of hash tables when α is within a narrow range, from approximately 1/2 to 2. 

● If α is less than 1/2, the bucket array is becoming sparse and a smaller array is likely to give better performance.

 ● If α is greater than 2, the cost of traversing the linked lists limits performance.

 One way to hit the desired range for α is to allocate the bucket array is just the right size for the number of elements that are being added to it. In general, however, it's hard to know ahead of time what this size will be, and in any case, the number of elements in the hash table may need to change over time. 

Step 4:

 Resizable arrays Since we can't predict how big to make the bucket array ahead of time, why not dynamically adjust its size? We can use a resizable array data structure to achieve this. Instead of representing the hash table as a bucket array, we introduce a header object that contains a pointer to the current bucket array, and also keeps track of the number of elements in the hash table. Whenever adding an element would cause α to exceed αmax, the hash table generates a new bucket array whose size is a multiple of the original size. Typically, the new bucket array is twice the size of the current one. Then, all of the elements must be rehashed into the new bucket array. This means a change of hash function; typically, hash functions are designed so they take the array size m as a parameter, so this parameter just needs to be changed. 

 Amortized complexity

 Since some add() operations cause all the elements to be rehashed, the cost of each such operation is O(n) in the number of elements. For a large hash table, this may take enough time that it causes problems for the program. Perhaps surprisingly, however, the cost per operation is still always O(1).

 In particular, any sequence of n operations on the hash table always takes O(n) time, or O(1) per operation. Therefore we say that the amortized asymptotic complexity of hash table operations is O(1).

 To see why this is true, consider a hash table with αmax = 1. The most expensive sequence of n operations we can do is a series of n add() calls where n = 2j, meaning that the hash table resizes on the very last call to add(). 

The cost of the operations can be measured in the number of uses of the hash functions. There are n initial hashes when elements are added. The hash table is resized whenever it hits a power of two is size, so the extra hashes caused by resizing are

         {  1 + 2 + 4 + 8 + ... + 2j }. 

This sum is bounded by 2*2j = 2n, so the total number of hashes is less than 3n, which is O(n).

 Notice that : it is crucial that the array size grows geometrically (doubling). It may be tempting to grow the array by a fixed increment (e.g., 100 elements at time), but this causes n elements to be rehashed O(n) times on average, resulting in O(n2) total insertion time, or amortized complexity of O(n).


Code:

The actual hash function is the composition of these two functions, hclient∘himpl:

 



To see what goes wrong, suppose our hash code function on objects is the memory address of the objects, as in Java. This is the usual choice. And suppose that our implementation hash function is like the one in SML/NJ; it takes the hash code modulo the number of buckets, where the number of buckets is always a power of two. This is also the usual implementation-side choice. But memory addresses are typically equal to zero modulo 16, so at most 1/16 of the buckets will be used, and the performance of the hash table will be 16 times slower than one might expect.  



Message transfer through Hashing Algorithm

Image 2










This is called hashing—a technique often used to secure passwords (among other things). Instead of keeping your secret, “dog”, in plain text for everyone to see, I’ll store the ugly 32-character code (the code is commonly called a hash). 

 Message Digest Algorithm 5 

 Message Digest Algorithm 5 (MD5) is a cryptographic hash algorithm that can be used to create a 128-bit string value from an arbitrary length string. Although there has been insecurities identified with MD5, it is still widely used. MD5 is most commonly used to verify the integrity of files. However, it is also used in other security protocols and applications such as SSH, SSL, and IPSec. Some applications strengthen the MD5 algorithm by adding a salt value to the plaintext or by applying the hash function multiple times.

 Examining Checksum and Hashes: 

 There's one central problem with relying on file attributes to determine if the files have been changed: File attributes are easy to fake. It's dead simple to set the file to any size, date, and time you want. Most applications won't bother to do this, but sometimes viruses, Trojans, or root kits do something like this to hide. One way around this trick is to use checksum or cryptographic hash algorithms on the files and store the results.

 Check-sums, such as a cyclic redundancy check (CRC), are also pretty easy to fake if the attacker or attacking program knows which checksum algorithm is being used to check files, so it is recommended that you use a photographically strong hash algorithm instead. The essential property of a hash algorithm that we're interested in is that the chances of two files hashing to the same value are impossibly small. Therefore, it isn't possible for an attacker to produce a different file that hashes to the same value. Hash values are typically 128 or 160 bits long, so are much smaller than the typical Hash map and hash function concepts is widely used in operating system to manage the data bits processing, if your research is belong in this field, so this information will help you a lot..

 God Bless you For your Research. AMEN...

✍✍✍✍🎗🎗

About the author

D Shwari
I'm a professor at National University's Department of Computer Science. My main streams are data science and data analysis. Project management for many computer science-related sectors. Next working project on Al with deep Learning.....

Post a Comment