Saturday, November 12, 2011

Java HashMap implementation, load factor, synchronization etc.,

HashMap is one of the data structures used very commonly in Java & I've always had my curiosity on what exactly goes behind the implementation of the same. On the surface, it looks fairly straightforward. It has to be using some simple hashing technique (of course, with collision resolution). But what bothered me was where exactly do the keys reside! It should have occurred even if I had put in a little more thought about how the collision resolution can be done. Basically, the values with the same hash code will end up in the same bucket & they have to be accommodated using a linear linked list. If this is the case, how do we retrieve a value for a given key (among the possible candidates within that bucket)? Simple, just store the key along with the value so that you can compare & return the value for the correct key. It also helped my confusion on where do the keys reside so that an iterator kind of function can pick all the keys present in the hash map.

Another important thing to note about HashMap (or Collections in general) is about their synchronization. The 'Collections' framework offers a bunch of static methods that will return a synchronized versions of the collection passed to them. But to what extent does this synchronization help? The doc says that it just guarantees individual operations to be synchronized & any complex operations using them should be synchronized manually. Note that the internal synchronization uses the object itself as the lock & so, we need to use the same as well.

An interesting point here is, if we don't use the synchronized version of the collection, what are the side effects? The hint again lies in the collision resolution. Assume that when we are about to do a put, the load factor limit of the hash map is reached. When this happens, Java tries to relocate the hash map to another location with twice the size of the buckets currently used. Also, note that the linear linked list used for collision resolution is actually inserting the elements at the front of the list rather than the end (so as to avoid tail traversing for insertions). Because of this, when the rehashing at the new location happens, the elements in the linked list are now inserted in the opposite order. So, given all these, when two threads try to do the relocation of the hash map, it can potentially lead to an infinite loop (basically one thread can change the next pointer of an element in the list to point to the previous element & the other thread changes the next pointer of the previous element to point to the element changed by the first thread). To avoid this tricky problem, synchronization is definitely needed on the collection if multiple threads are manipulating it.

No comments:

Post a Comment