? Advantages:
Hash tables can provide fast operations.
Disadvantages:
Hash tables are usually based on arrays and are difficult to expand after creation.
There is also no simple way to traverse the data items in a table in any order (for example, from small to large).
To sum up, if you don't need to traverse the data in sequence, you can predict the size of the data in advance. Then hash tables are unparalleled in speed and ease of use.
1.? Use the hash function to convert the searched keywords into the index of the array.
2.? Handle hash conflicts.
If the keyword is k, its value is stored in the storage location of f(k). So you can get the searched records directly without comparison. Call this correspondence f a hash function, and the table established according to this idea is a hash table.
If the probability of any keyword in the keyword set mapping to any address in the address set is equal, then this hash function is called a uniform hash function, which means that keywords can get a "random address" through the hash function, thus reducing collisions.
Hash function can make the process of accessing data sequence faster and more efficient, and data elements can be located faster through hash function.
A good hash function should usually consider the following factors:
1. The calculation is simple and the conversion speed is improved.
2. Address spaces corresponding to keywords are evenly distributed to minimize conflicts.
1.? Direct addressing method
Take the keyword or a linear function value of the keyword as the hash address, that is, H(Key)=Key or H(Key) = a * key+b (where a and b are integers). This hash function is also called self-function. If the hash address of h (key) already has a value, then search at the next location until the location of H(Key) is found to have no value, and put the element in it.
2.? Digital analysis method
Digital analysis is to find out the rules of numbers and use these data as much as possible to construct hash addresses with low collision probability.
3.? Intermediate square method
Take the middle number after the keyword is squared as the hash address. The principle of this method is to enlarge the difference by taking the square, and the middle digit of the square value is related to each digit of this number, so the hash function values obtained by different keywords are not easy to conflict, and the hash addresses obtained are relatively unified. This method is suitable for the phenomenon that every digit in a keyword has a high repetition frequency.
4.? Folding method
The method of folding is to divide the keyword into several parts with the same number of digits, and the last part can have different numbers of digits, and then take the superposition sum of these parts (note: remove the carry when adding) as the hash address.
Digital superposition can be divided into shift superposition and boundary superposition. Shift superposition is to align the lowest bits of each divided part and then add them; Boundary superposition is to fold back and forth along the dividing line from one end to the other, and then align and add.
This method is suitable for cases with many keywords.
5.? Random number method
Choose a random number as the hash address, which is usually used when the keyword length is different.
6.? Division and remainder method
The remainder obtained by dividing the keyword by a number p not greater than the length m of the hash table is taken as the hash address, that is, h (key) = key mod p, p.
Different keywords may get the same hash address, that is, k 1≠k2, and f(k 1)=f(k2). This phenomenon is called collision. For this hash function, keywords with the same function value are called synonyms.
By constructing a hash function with good performance, conflicts can be reduced, but it is generally impossible to completely avoid conflicts, so conflict resolution is another key issue of hash method. Creating a hash table and finding a hash table will encounter conflicts, and the methods to solve the conflicts in both cases should be consistent.
Here, taking the creation of hash table as an example, the method to solve the conflict is explained.
1. Open addressing method
This method is also called rehash method. The basic idea is: when the hash address p=H(key) of the keyword key conflicts, another hash address p 1 is generated based on p, and if p 1 still conflicts, another hash address p2 is generated based on p until a non-conflicting hash address pi is found and the corresponding elements are stored in it. This method has a general form of rehash function: Hi=(H(key)+di)%m? I = 1, 2, …, m- 1, where H(key) is a hash function, m is a table length, di is an incremental sequence, and I is the number of collisions. The corresponding rehash method is different with different values of incremental sequence. Incremental sequence mainly includes the following contents:
( 1)? Linear detection rehash
di= 1,2,3,…,m- 1
The characteristic of this method is: when a conflict occurs, the next cell in the table is searched in sequence until an empty cell is found or the whole table is searched.
(2) secondary detection and rehash
di= 12,- 12,22,-22,…,k2,-k2(k & lt; =m/2)
The characteristic of this method is that when there is a conflict, it is more flexible to detect the left and right of the table by jumping.
(3) Pseudorandom detection and rehash
Di= pseudo-random number sequence.
The advantage of linear probe rehash is that as long as the hash table is not satisfied, non-conflicting hash addresses can be found, while the second probe rehash and pseudo-random probe rehash are not necessarily. The rehash of linear probe is easy to produce "secondary aggregation", that is, it leads to the conflict of non-synonyms when dealing with the conflict of synonyms.
In fact, in addition to the above methods, there are many variants of the open addressing method, but they all have different di representations. (For example, double hash detection method: di=i*h2(k))
Step 2: Hash again
This method is to construct several different hash functions at the same time: Hi=RHi(key), i= 1, 2, 3, …, n.
When the hash address H 1=RH 1 (key) conflicts, calculate H2 = rh2 (key) ... until the conflict no longer occurs. This method is not easy to produce polymerization, but it increases the calculation time.
? 3. Chain address method (zipper method)
The basic idea of this method is that all elements with the same hash address form a single linked list called synonym chain, and the head pointer of the single linked list is stored in a hash table (array), so the search, insertion and deletion are mainly carried out in the synonym chain. If the length of the selected hash table is m, the hash table can be defined as a pointer array T [0...m- 1] consisting of m headers and pointers. All nodes with hash address I are inserted into the single linked list with T[i] as the head pointer. The initial value of each component in t should be a null pointer. The chain address method is suitable for frequent insertion and deletion.
Advantages of zipper method
Compared with the open addressing method, the zipper method has the following advantages:
(1) Zipper method is simple to deal with conflicts, and there is no accumulation phenomenon, that is, non-synonyms will never conflict, so the average search length is short;
(2) Because the node space of each linked list in the zipper method is dynamically applied, it is more suitable for the situation that the table length cannot be determined before making the table;
(3) In order to reduce conflicts, the open addressing method requires a small fill factor α, so it will waste a lot of space when the node scale is large. In the zipper method, α≥ 1 can be chosen theoretically, and when the node is large, the pointer field added in the zipper method can be ignored, thus saving space; (The fill factor of hash table is defined as: α = number of elements filled in the table/length of hash table)
Note: The default fill factor of HashMap is 0.75.
(4) In the hash table constructed by zipper method, the operation of deleting nodes is easy to realize. Just delete the corresponding node on the linked list. However, for the hash table constructed by the open address method, the deleting node cannot simply empty the space of the deleted node, otherwise the search path of the synonym node filling the hash table will be truncated. This is because in various open addressing methods, empty address units are understood as not looking for elements. Therefore, when deleting a hash table that uses open addressing to deal with conflicts, you can only mark the deleted nodes, but not really delete them.
Disadvantages of zipper method
The disadvantage of zipper method is that the pointer needs extra space, so when the node size is small, the open addressing method saves space. At this time, the saved pointer space is used to expand the scale of hash table, which can reduce the filling factor, and then reduce the conflict in open addressing mode, thus improving the average search speed.
4. Set up a public overflow area.
The basic idea of this method is to divide the hash table into two parts: the basic table and the overflow table, and all the elements that conflict with the basic table are filled into the overflow table (in this method, the elements are stored in two tables respectively).
The search process of hash table is basically the same as the table making process. Some key codes can be found directly from the address converted by hash function, and some key codes have conflicts in the address obtained by hash function, which need to be found according to the method of dealing with conflicts. Among the three methods introduced to deal with conflicts, finding conflicts is still a process of comparing given values with key codes. Therefore, the measure of hash table search efficiency is still the average search length.
In the process of searching, the number of key code comparisons depends on the number of conflicts. If there are few conflicts, the search efficiency will be high, and if there are many conflicts, the search efficiency will be low. So the factor that affects the number of conflicts is also the factor that affects the search efficiency.
There are three factors that affect the number of conflicts:
Whether the 1. hash function is unified;
2. Ways to deal with conflicts;
3. Fill factor of hash table.
Fill factor of hash table
Defined as: α = number of elements filled in the table/length of hash table.
α is the sign factor of hash table filling degree. Because the length of the table is a constant value, α is directly proportional to the number of elements filled in the table, so the greater α, the more elements filled in the table, and the greater the possibility of conflict; The smaller α is, the fewer elements are filled in the table, and the less possibility of conflict.
In fact, the average search length of hash table is a function of the filling factor α, but different methods to deal with conflicts have different functions.
This hash algorithm is not the hash table algorithm in the data structure class in the university. The hash algorithm here is the basis of cryptography. Knowing the basic definition of hash, we can't help but mention some famous hash algorithms. MD5 and SHA- 1 can be said to be the most widely used hash algorithms at present, both of which are designed based on MD4.
The application of hash algorithm in information security is mainly reflected in the following three aspects:
⑴? Document inspection
We are familiar with parity and CRC. Neither of these checks has the ability to resist data tampering. They can detect channel errors in data transmission to a certain extent, but they cannot prevent malicious destruction of data.
The "digital fingerprint" feature of md5 hash algorithm makes it the most widely used file integrity checksum algorithm at present, and many Unix systems provide commands to calculate MD5 checksum.
⑵? digital signature
Hash algorithm is also an important part of modern cryptography. Because of the slow operation speed of asymmetric algorithm, one-way hash function plays an important role in digital signature protocol. Digitally signing the hash value, also known as "digital digest", can be considered as equivalent to digitally signing the file itself. There are other benefits to such an agreement.
? (3) Authentication protocol
The following authentication protocol is also called challenge authentication mode: when the transmission channel can be intercepted but cannot be tampered with, it is a simple and safe method.
DHT is mainly used for distributed cache, which can be used to solve the problems caused by dynamically adding and deleting nodes under distributed storage structure. For example, in a distributed storage system, to store data on a specific node, if the data is mapped to a specific node by a common hash method, such as key%N(key is the key of data, and n is the number of machine nodes), if a machine joins or leaves the cluster, all data mapping will be invalid. If it is persistent storage, data migration is needed. If it is distributed cache, other caches will be invalid.
Four definitions for judging the quality of hash algorithm:
1, Balance: Balance means that the hash result is distributed to all buffers as much as possible, so that all buffer spaces are utilized.
2. Monotonicity: Monotonicity means that if some content has been allocated to the corresponding buffer through hashing, a new buffer will be added in the system. The hash result should ensure that the original allocated contents can be mapped to the original or new buffer, and not to other buffers in the old buffer set.
3. Propagation: In a distributed environment, the terminal may not see all the buffers, but only part of them. When the terminal wants to map the content to the buffer through the hashing process, because different terminals may see different buffer ranges, the hashing results are inconsistent, and the final result is that the same content is mapped to different buffers by different terminals. This situation should obviously be avoided, because it causes the same content to be stored in different buffers, which reduces the efficiency of system storage. The definition of deviation is the severity of the above situation. A good hash algorithm should be able to avoid inconsistency as much as possible, that is, to minimize dispersion.
4. Load: The problem of load is actually a discrete problem from another angle. Because different terminals can map the same content to different buffers, different users can also map specific buffers to different contents. Like dispersion, this situation should be avoided, so a good hash algorithm should be able to reduce the buffer load as much as possible.
In distributed cluster, adding or deleting machines or automatically leaving the cluster after machine failure is the most basic function of distributed cluster management. If the commonly used hash modulus algorithm is used, many original data can't be found after adding or deleting machines, which seriously violates the monotonicity principle. Next, it mainly explains how the consistent hash algorithm is designed.
For the ketama algorithm of SpyMemcached, the idea is as follows:
Map data to a large space with hash function, as shown in the figure. When storing data, first get a hash value, corresponding to each position in the ring, such as the position shown in the map of k 1, and then find a machine node B clockwise, and store k 1 in this Node B.
If node B goes down, the data on node B will fall to node C, as shown in the following figure:
In this way, only node C will be affected, and the data of other nodes A and D will not be affected. However, this will cause an "avalanche" situation, that is, because node C accepts the data of node B, the load of node C will become high, and node C will easily go down, and so on, leading to the suspension of the whole cluster.
Therefore, the concept of "virtual nodes" is introduced: that is, imagine that there are many "virtual nodes" on this ring. The storage of data is to find a virtual node along the clockwise direction of the ring, and each virtual node will be associated with a real node, as shown in the following figure:
In the figure, A 1, A2, B 1, B2, C 1, C2, D 1 and D2 are all virtual nodes, with computer A storing data of A 1 and A2 and computer B storing data of B 1 and B2. Because these virtual nodes are numerous and evenly distributed, there will be no avalanche phenomenon.