IC 312 Fall 2022 / Notes


This is the archived website of IC 312 from the Fall 2022 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Unit 7: Hash tables

1 Hash Tables

Hashtables are one of the really simple and effective ideas in computer science. They show up absolutely everywhere, because they work SO well.

Suppose in our Map, we were going to get 100 pieces of data added, and the keys would be the integers 0-99. How would you store them? You’d just put them in an array, right? The runtime of insert, get, and remove, would all be O(1). That’d be wonderful!

Of course, this is unrealistic for a couple reasons. First, we don’t always know how many pieces of data we have coming in, and we don’t always have integral keys. And, even if we do have integral keys, odds are they’re not all small, positive numbers (large numbers would require too large an array to be worth it).

An O(1) insert, get, and remove is quite the prize, so let’s not give up so quickly. Let’s look at these problems one at a time.

  • My Keys Aren’t Integers!

    Well… make them integers. Consider Strings, for example. We know that a char is actually just an integer. So, maybe, just add the characters in the string? That gives us an int! Job done!

    We can imagine coming up with a similar function for any type of key. This function, unsurprisingly, is called a hash function, and it takes in a key and returns an int.

  • My ints Are Really Big (or are negative)!

    Decide on a size for your array. Then, using absolute value and the modulo operator (%), you can make your ints be as small and positive as you need.

    To be specific, if \(m\) is the size of your hashtable array, you can take any integer \(x\) and compute \({\rm abs}(x) \bmod m\).

    That will always give you an integer between 0 and \(m-1\) - exactly the range of valid array indexes!

  • What if two keys hash to the same integer?

    Well, yes, that is a problem, because it means that more than one thing is supposed to be stored in one slot in the array. This situation is called a collision, and it’s going to be the trickiest part (still not that tricky) about making a good hashtable. There are two parts: first you want to choose a good hash function to avoid having very many collisions, and then you set up your hashtable so that it can handle any collisions that do happen.

2 What Makes a good hash function?

If you took SY110, then you’ve seen hash functions before, in the form of cryptographic hash functions like MD5 or SHA-1. Remember the Rubik’s Cube? You learned that you could go from a string to a number (maybe it was hexadecimal, but it was a number nonetheless), but it would be extremely difficult to go from the number to the string that resulted in that number. The words “hash function” just meant “a function that takes you from a string to a number;” the cryptographic part is what put this extra one-way-function requirement on us so that it can’t be reversed.

In data structures, a hash function still has to do the job of going from any type of key to an integer, and the hash function will work best if it “scrambles” up the key in some way. But we don’t care about our hash functions being one-way: there’s absolutely no harm in someone being able to predict which key might hash to some integer.

So is any hash function as good as any other hash function? Well, no. First of all, a hash function has to be deterministic, which is to say, every time we call our hash function on our key, it has to result in the same number. Second, we want our hash function to minimize collisions. Suppose a hash function, given a random String, is more likely to return numbers between 10 and 20 than it is for any other location. That would be a bad hash function, as there would be a huge number of collisions in indices between 10 and 20, while other parts of the hash table would be underused!

Writing good hash functions is an active area of research for computer scientists and number theorists, so we won’t become experts at doing this. But there are a couple of general principles that are used to minimize collisions:

  • The hash function should produce any integer in its range, with equal probability. (Like we just said.)

  • The hash function should depend somehow on the entire key. For example, if you are hashing a string, you had better make it depend on every character in the string!

  • The hash function should be somehow “strange”, not conforming to any patterns that might occur in the data itself. For example, if you are storing prices at a convenience store, many of them will end in 99 cents. If you are storing the displayed names of USNA students, many of them will start with MIDN. A great hash function can take in a very non-random set of keys and produce a bunch of integer hashes that look totally random. This point is closely related to the first bullet about equal probability values.

Java’s Object class defines a hashCode() method, but it’s not very good, because at the Object class level, we don’t know anything about the object at all. This means they base it on the address the object is stored at. This is problematic. Suppose we have stored a String “hello” in our hashtable, and now want to find it by calling get("hello"). Because these two objects are potentially stored at different address locations, we might get different hashes, and this get wouldn’t work.

Fortunately, for Strings, Java has fixed this problem by overriding this method. Check out the Javadoc! If you’re implementing a hashtable for Strings, you should feel free to use this. Another cool hashCode() method to check out is the one for Java’s List interface. Yes, lists can be keys in a map!

3 Handling Collisions

Despite our best attempts, our hash function has resulted in a collision: two keys, after being absolute value’d and modulo’d, have resulted in the same hash integer. Even with the very best hash function, this is still going to happen unless we make the array ridiculously huge, as a consequence of the Birthday Problem. Just as there are lots and lots of ways to make self-balanced trees, there are lots and lots of ways to resolve collisions. We’re going to talk about two of the most common: Open Addressing, and Separate Chaining.

  • Open Addressing

    Array full at the index you’d like to insert something? Move to the next one. Your find, then, goes to where it’d like to be inserted, then moves along linearly, until it finds an empty space. That’s called “linear probing”, and it’s fast and simple, but can result in clusters of dead space in the hash table. Many other, more sophisticated approaches to open addressing have been proposed which eliminate the clustering issue.

    No matter how sophisticated, though, the problem remains: if the array isn’t big enough, the array fills up, and now insert() and get() become O(n) operations. With a big enough array, with enough empty space, though, this works well.

  • Separate Chaining

    In this approach, we have an array of Linked Lists (or Trees, or something). Upon hashing to a certain spot, you add that element to the data structure of all keys that have been hashed to that same index. get() just has to find that key in that data structure.

    Linked lists work really well for separate chaining, because your insert() operation can just insert the new element at the front of the list. That guarantees that insert() at least is always \(O(1)\) time

4 Runtime of Map functions with Hashtables

Assume there are \(n\) keys, and the hashtable array size is \(m\). Also, assume that we have a good hash function, such that inserted elements are likely to be evenly distributed across the array. This means that on average, there are \(\frac{n}{m}\) elements that have been hashed to each index; this number is known as the load factor. In separate chaining, this means there is an \(\frac{n}{m}\)-long linked list hanging from each index; with linear probing, there is an \(O(\frac{n}{m})\)-long cluster of adjacent, filled indices.

Let’s first consider separate chaining, and think about insert(k,v). On the first insert, given separate chaining, you calculate the index, and add your key-value pair to the front of the linked list at that index. None of that depends upon the number of elements in the array! This makes it \(O(1)\). However, get(k) requires you to calculate the index, then search through the \(\frac{n}{m}\)-long linked list hanging from that index. This makes it \(O(n)\), which is a bummer.

To avoid this, we assign our hashtable a maximum-allowed load factor (in Java’s HashMap, for example, it’s .75). When \(n\) increases so that the load factor exceeds this value, the array is doubled, and everything is rehashed into the new array (see why the rehashing is necessary?). This keeps it so that the load factor is always less than some constant number, and does not increase with \(n\) (or at least, not for long). The doubling of the array means that this happens less and less often as \(n\) increases. This makes the insert(k,v) into an amortized \(O(1)\) operation. get(k), meanwhile, turns into an average-time \(O(1)\) operation, as the load factor never increases beyond some number. \(O(1)\) and \(O(1)\)! That’s really good!

Now let’s consider linear probing. Here, the logic is actually exactly the same, except that rather than search through a linked list, you search through contiguous filled indices. Again, the insert is amortized \(O(1)\), and get is \(O(1)\)!

Holy crap! Everything is some flavor of O(1)! Why the hell did we just spend this long on Trees? Two reasons. First, in the worst case, everything hashes to the same index, and get and remove are actually O(n). Balanced maps have superior worst-case performance, which is really important for some kinds of applications.

Second, the exact hash functions that are needed to really “spread out” the keys and make the average-case \(O(1)\) lookup a reality are a subject of ongoing research and debate (outside the scope of this course). For small but “reasonable” values of \(n\), it may be that a lightweight \(O(\log(n))\) for trees is actually faster than computing a large, but \(O(1)\) hash function. But also, there are other ADTs…

5 Ordered Map

Ordered Maps are just an extension of Maps, with a few extra methods.

  • first() - returns the element with the smallest key.

  • last() - returns the element with the largest key.

  • next(k) - returns the element with the smallest key larger than k.

  • prev(k) - returns the element with the largest key smaller than k.

Let’s first look at the naive implementations:

Unsorted Linked List Sorted Array
get(key) O(n) O(log n)
insert(key,value) O(1) O(n)
first() O(n) O(1)
last() O(n) O(1)
next(k) O(n) O(log n)
prev(k) O(n) O(log n)

Caution: if you’re thinking that first and last could be \(O(1)\) for a linked list if we only kept track of the smallest and largest things as we add them, think again! The problem is that you can also have removals (by setting a key’s value to null), and then if you remove the largest thing you’re stuck!

6 Trees

Obviously, we’re going to assume that we’re using some flavor of our self-balancing trees: AVL, 2-3-4, Red-Black, or something else. These different flavors have slightly different heights, but these heights are all O(log n).

So, insert, get, and remove are easy - they’re all O(log n). Let’s consider our new methods.

Hopefully, you can write first() - it’s just findMin(root). How long does this take? In the worst case, O(log n). last() is the same thing, just findMax.

To calculate next(k) (or prev(k)), we first have to find k, which is O(log n). If the node containing k has a right child, next() is found in findMin(n.getRight()). If it doesn’t, it’s the Node we most recently took a left at. In either case, that’s an additional O(log n), in the worst case, making next and prev also O(log n).

7 Hashtables

Remember that if we have a well-built hash table, on average, we have O(1) performance in insert, get, and remove. Worst-case, we have O(n) for get and remove, while insert can still be O(1).

How long does first() or last() take? Unfortunately, because we have a good hash function that scatters hashes all over our array, we have NO clues as to where the smallest or largest keys might be. So, there’s no better way to do this than just iterate through and look at everything. O(n).

Similarly, we have no guarantee that two keys that are close to each other will be hashed close to each other. So, again, we have to look at everything. O(n).

8 Performance Overview

Balanced Tree Hash Table
get(key) O(log n) O(1) average,O(n) worst-case
insert(key,value) O(log n) amortized O(1)
first() O(log n) O(n)
last() O(log n) O(n)
next(k) O(log n) O(n)
prev(k) O(log n) O(n)