HashMap and HashSet are two important members of the Java Collection Framework. HashMap is a common implementation class of the Map interface, and HashSet is a common implementation class of the Set interface. Although the interface specifications implemented by HashMap and HashSet are different, their underlying Hash storage mechanisms are exactly the same, and even HashSet itself is implemented by HashMap.
Analyze the Hash storage mechanism through the source code of HashMap and HashSet
In fact, there are many similarities between HashSet and HashMap. For HashSet, the system uses the Hash algorithm to determine The storage location of the set elements, which ensures that the set elements can be stored and retrieved quickly; for HashMap, the system key-value is processed as a whole, and the system always calculates the storage location of the key-value based on the Hash algorithm, which ensures It can quickly store and retrieve Map key-value pairs.
Before introducing collection storage, one thing needs to be pointed out: although collections claim to store Java objects, they do not actually put Java objects into Set collections, but only retain the contents of these objects in Set collections. For reference. That is to say: a Java collection is actually a collection of multiple reference variables that point to actual Java objects.
Collections and references
Just like arrays of reference types, when we put a Java object into an array, we do not actually put the Java object into the array, we just put it into the array. The reference to the object is put into the array, and each array element is a reference variable.
Storage implementation of HashMap
When the program tries to put multiple key-values ??into HashMap, take the following code snippet as an example:
Java code
HashMaplt; String, Doublegt; map = new HashMaplt; String, Doublegt; ();
map.put("中文", 80.0);
map .put("Math", 89.0);
map.put("English", 78.2);
HashMap uses a so-called "Hash algorithm" to determine each element storage location.
When the program executes map.put("中文", 80.0);, the system will call the hashCode() method of "中文" to get its hashCode value - every Java object has a hashCode() method , its hashCode value can be obtained through this method. After getting the hashCode value of this object, the system will determine the storage location of the element based on the hashCode value.
We can look at the source code of the put(K key, V value) method of the HashMap class:
Java code
public V put(K key, V value)
{
// If key is null, call the putForNullKey method for processing
if (key == null)
return putForNullKey(value);
// Calculate the Hash value according to the keyCode of the key
int hash = hash(key.hashCode());
// Search the index of the specified hash value in the corresponding table
int i = indexFor(hash, table.length);
// If the Entry at index i is not null, loop through Continuously traverse the next element of e element
for (Entrylt; K, Vgt; e = table[i]; e != null; e = e.next)
{
Object k;
// Find the specified key that is equal to the key that needs to be put in (the hash value is the same
// Return true through equals comparison) p>
if (e.hash == hash amp; amp; ((k = e.key) == key
|| key.equals(k)))
{
V oldValue = e.value;
e.value = value;
e.recordAccess(this);
return oldValue;
}
}
// If the Entry at index i is null, it means there is no Entry here
modCount;
// Add key and value to i index
addEntry(hash, key, value, i);
return null;
}
The above program uses an important internal interface: Map.Entry. Each Map.Entry is actually a key-value pair. As can be seen from the above program: when the system decides to store the key-value pairs in the HashMap, it does not consider the value in the Entry at all, but only calculates and determines the storage location of each Entry based on the key. This also illustrates the previous conclusion: we can completely regard the value in the Map collection as an accessory to the key. After the system determines the storage location of the key, the value can be stored there.
The above method provides a method to calculate the Hash code based on the return value of hashCode(): hash(). This method is a pure mathematical calculation, and the method is as follows:
Java Code
static int hash(int h)
{
h ^= (h gt; gt; gt; 20) ^ (h gt; gt; gt; 12);
return h ^ (h gt; gt; gt; 7) ^ (h gt; gt; gt; 4);
}
For any given object, as long as its hashCode() return value is the same, the Hash code value calculated by the program calling the hash(int h) method is always the same. Next, the program will call the indexFor(int h, int length) method to calculate which index of the table array the object should be saved at. The code of the indexFor(int h, int length) method is as follows:
Java code
static int indexFor(int h, int length)
{
return h amp; (length-1);
}
This method is very clever, it is always obtained by h amp; (table.length -1) The storage location of the object - and the length of the underlying array of HashMap is always 2 to the nth power. Please refer to the introduction of the HashMap constructor later.
When length is always a multiple of 2, h amp; (length-1) will be a very clever design: assuming h=5, length=16, then h amp; length - 1 will get 5; if h=6, length=16, then h & length - 1 will get 6...if h=15, length=16, then h & length - 1 will get 15; but when h=16, When length=16, then h amp; length - 1 will get 0; when h=17, when length=16, then h amp; length - 1 will get 1... This ensures that the calculated index value is always Located within the index of the table array.
As can be seen from the source code of the put method above, when the program attempts to put a key-value pair into a HashMap, the program first determines the storage location of the Entry based on the hashCode() return value of the key. : If the hashCode() return values ??of two Entry keys are the same, then their storage locations are the same. If the keys of these two Entries return true through equals comparison, the value of the newly added Entry will overwrite the value of the original Entry in the collection, but the key will not be overwritten. If the keys of these two Entries return false through equals comparison, the newly added Entry will form an Entry chain with the original Entry in the collection, and the newly added Entry is located at the head of the Entry chain - continue to see the description of the addEntry() method for specific instructions. .
When a key-value pair is added to a HashMap, the storage location of the key-value pair (that is, the Entry object) is determined by the hashCode() return value of its key. When the hashCode() return value of the key of two Entry objects is the same, the key will use eqauls() to compare the values ??to determine whether to adopt the overwriting behavior (return true) or generate an Entry chain (return false).
The above program also calls addEntry(hash, key, value, i); code, where addEntry is a package access method provided by HashMap. This method is only used to add a key-value pair. . The following is the code of this method:
Java code
void addEntry(int hash, K key, V value, int bucketIndex)
{
// Get the Entry at the specified bucketIndex index
Entrylt; K, Vgt; e = table[bucketIndex]; // ①
// Put the newly created Entry in Enter the bucketIndex index, and let the new Entry point to the original Entry
table[bucketIndex] = new Entrylt; K, Vgt; (hash, key, value, e);
// If the number of key-value pairs in the Map exceeds the limit
if (size gt; = threshold)
// Expand the length of the table object to 2 times.
resize(2 * table.length); // ②
}
The code of the above method is very simple, but it contains a very elegant design : The system always puts the newly added Entry object into the bucketIndex index of the table array - if there is already an Entry object at the bucketIndex index, then the newly added Entry object points to the original Entry object (generating an Entry chain). If there is no Entry object at the bucketIndex index, that is, the e variable in the code No. 1 of the above program is null, that is, the newly placed Entry object points to null, that is, no Entry chain is generated.
JDK source code
A src.zip compressed file can be found in the JDK installation directory, which contains all source files of the Java basic class library. As long as the reader is interested in learning, he or she can open this compressed file at any time to read the source code of the Java class library, which is very helpful to improve the reader's programming ability. It should be pointed out that the source code contained in src.zip does not contain Chinese comments like the above. These comments were added by the author myself.
Performance options of Hash algorithm
As can be seen from the above code, when the same bucket stores the Entry chain, the newly placed Entry is always located in the bucket, and the earliest The Entry placed in the bucket is at the end of the Entry chain.
There are two other variables in the above program:
* size: This variable stores the number of key-value pairs contained in the HashMap.
* threshold: This variable contains the limit of key-value pairs that HashMap can accommodate. Its value is equal to the capacity of HashMap multiplied by the load factor.
As can be seen from code ② in the above program, when size >= threshold, HashMap will automatically call the resize method to expand the capacity of HashMap. Each time it is expanded, the capacity of HashMap is doubled.
The table used in the above program is actually an ordinary array. Each array has a fixed length. The length of this array is the capacity of HashMap. HashMap contains the following constructors:
* HashMap(): Construct a HashMap with an initial capacity of 16 and a load factor of 0.75.
* HashMap(int initialCapacity): Construct a HashMap with an initial capacity of initialCapacity and a load factor of 0.75.
* HashMap(int initialCapacity, float loadFactor): Create a HashMap with the specified initial capacity and specified load factor.
When creating a HashMap, the system will automatically create a table array to save the Entry in the HashMap. The following is the code of a constructor in the HashMap:
Java code
//Create a HashMap with specified initial capacity and load factor
public HashMap(int initialCapacity, float loadFactor)
{
// The initial capacity cannot be Is a negative number
if (initialCapacity lt; 0)
throw new IllegalArgumentException(
"Illegal initial capacity: "
initialCapacity);
// If the initial capacity is greater than the maximum capacity, let the capacity be displayed
if (initialCapacity gt; MAXIMUM_CAPACITY)
initialCapacity = MAXIMUM_CAPACITY;
// The load factor must be greater than 0
if (loadFactor lt;= 0 || Float.isNaN(loadFactor))
throw new IllegalArgumentException(
loadFactor);
// Calculate the smallest 2 to the nth power value that is greater than initialCapacity.
int capacity = 1;
while (capacity lt; initialCapacity)
capacity lt; lt; = 1;
this. loadFactor = loadFactor;
// Set the capacity limit equal to capacity * load factor
threshold = (int)(capacity * loadFactor);
// Initialize table Array
table = new Entry[capacity]; // ①
init();
}
Bold font in the above code The subcode contains a concise code implementation: find the smallest nth value of 2 that is greater than initialCapacity, and use it as the actual capacity of the HashMap (stored by the capacity variable). For example, given the initialCapacity is 10, the actual capacity of the HashMap is 16.
You can see in the code of program No. 1: The essence of table is an array, an array with a length of capacity.
For HashMap and its subclasses, they use the Hash algorithm to determine the storage location of elements in the collection. When the system starts to initialize the HashMap, the system will create an Entry array with a length of capacity. The location where elements can be stored in this array is called a "bucket". Each bucket has its designated index, and the system can use it according to its index. Quickly access elements stored in the bucket.
At any time, each "bucket" of HashMap only stores one element (that is, one Entry), because the Entry object can contain a reference variable (that is, the last parameter of the Entry constructor) for Points to the next Entry, so the possible situation is: there is only one Entry in the HashMap bucket, but this Entry points to another Entry - this forms an Entry chain. As shown in Figure 1:
Figure 1. Storage diagram of HashMap
Reading implementation of HashMap
When the Entry stored in each bucket of HashMap is just Single Entry - that is, when the Entry chain is not generated through a pointer, the HashMap at this time has the best performance: when the program retrieves the corresponding value through the key, the system only needs to first calculate the hashCode() return value of the key, and then based on the key The hashCode return value finds the index of the key in the table array, then takes out the Entry at the index, and finally returns the value corresponding to the key.
Look at the get(K key) method code of the HashMap class:
Java code
public V get(Object key)
{
// If key is null, call getForNullKey to get the corresponding value
if (key == null)
return getForNullKey();
// According to this The hashCode value of key calculates its hash code
int hash = hash(key.hashCode());
// Directly take out the value at the specified index in the table array,
p>for (Entrylt; K, Vgt; e = table[indexFor(hash, table.length)];
e != null;
// Search for the Entry The next Entr of the chain
e = e.next) // ①
{
Object k;
// If The key of this Entry is the same as the searched key
if (e.hash == hash amp; amp; ((k = e.key) == key
|| key. equals(k)))
return e.value;
}
return null;
}
As can be seen from the above code, if there is only one Entry in each bucket of HashMap, HashMap can quickly retrieve the Entry in the bucket based on the index; in the event of a "Hash conflict", the Entry stored in a single bucket It is not an Entry, but an Entry chain. The system can only traverse each Entry in order until it finds the Entry it wants to search - if the Entry to be searched happens to be at the end of the Entry chain (the Entry is the earliest one placed in the bucket), the system must loop to the end to find the element.
To sum up, HashMap treats key-value as a whole at the bottom layer, and this whole is an Entry object. The bottom layer of HashMap uses an Entry[] array to save all key-value pairs. When an Entry object needs to be stored, its storage location will be determined based on the Hash algorithm; when an Entry needs to be retrieved, its storage will also be found based on the Hash algorithm. Position, directly take out the Entry. It can be seen that the reason why HashMap can quickly store and retrieve the Entries it contains is completely similar to what our mothers taught us since childhood in real life: different things should be placed in different locations so that they can be found quickly when needed.
When creating a HashMap, there is a default load factor (load factor), and its default value is 0.75. This is a trade-off between time and space costs: increasing the load factor can reduce the size of the Hash table (that is, that Entry array), but it will increase the time cost of querying data, and query is the most frequent operation (both the get() and put() methods of HashMap use query); reducing the load factor will Improve the performance of data query, but will increase the memory space occupied by the Hash table.
After mastering the above knowledge, we can appropriately adjust the value of load factor according to actual needs when creating a HashMap; if the program is more concerned about space overhead and memory is tight, the load factor can be appropriately increased; if the program You are more concerned about time overhead. If the memory is relatively ample, you can appropriately reduce the load factor. Normally, programmers do not need to change the value of the load factor.
If you know from the beginning that HashMap will save multiple key-value pairs, you can use a larger initialization capacity when creating it. If the number of Entries in the HashMap will never exceed the limit capacity (capacity * load factor ), HashMap does not need to call the resize() method to reallocate the table array, thus ensuring better performance. Of course, setting the initial capacity too high at the beginning may waste space (the system needs to create an Entry array with a length of capacity), so the initial capacity setting also needs to be treated carefully when creating a HashMap.