Unit 8: Sets
1 Set ADT
The Set ADT implements the mathematical notion of a set, which you should be familiar with from discrete math. There is a Universe of items, and grouping of the item into named entities, called sets. In the Universe, there are no such things as duplicates, so there can be no duplicate entries in a set. In the set of fruits, apple is either in the set or not. In the set of even integers, 2 is either in the set or not.
There are six core operations on sets:
add(e)
— Add the element e to the set if it isn’t already in there.remove(e)
— Drops element e from the set if it’s in there. If e is not in the set, nothing happens.contains(e)
— Returns true if the element in in the set, false if it is not.size()
— Returns the number of unique items in the set.union(s1,s2)
— Creates a new set with every item present in s1 or in s2intersection(s1,Ss)
— Creates a new set with every item present in both s1 and s2.
2 Relationship to Map
You should notice a strong similarity to an ADT we have already studied extensively: Map. The first four set operations can easily be accomplished with a Map, where the keys are the set elements, and the values are booleans (or just nulls).
Set.add(e)
: callMap.put(e, true)
Set.remove(e)
: callMap.remove(e)
Set.contains(e)
: callMap.containsKey(e)
Set.size()
: callmap.size()
So, using the data structures we already know about for maps, we can implement these set operations too with the following costs:
add(e) |
remove(e) |
contains(e) |
size() |
|
---|---|---|---|---|
Unsorted linked list | \(O(1)\) | \(O(n)\) | \(O(n)\) | \(O(n^2)\) |
Sorted array | \(O(n)\) | \(O(n)\) | \(O(\log n)\) | \(O(1)\) |
Balanced search tree | \(O(\log n)\) | \(O(\log n)\) | \(O(\log n)\) | \(O(1)\) |
Hash table (open addr.) | \(O(1)\) avg | \(O(1)\) avg | \(O(1)\) avg | \(O(1)\) |
A few details here might not be obvious from the table above:
size()
for an unsorted linked list: Here, we have optimized the cost ofadd(e)
by just inserting the new element in the front. But then, the same thing may have been added multiple times and we wouldn’t know it. Therefore thesize()
method has to remove duplicates from the list before returning the list size, and that operation takes \(O(n^2)\) time.remove(e)
for an open addressing hash table: As we discussed briefly in the previous unit, we can’t just remove the item completely, because that may leave a “hole” in the hash table that will make other lookup operations incorrect. So when removing an item, it’s important to leave a special “removed” value in its place — not the same as a null!
In terms of Big-O this is all correct, but in practice we don’t just use a Map to implement a Set because it’s wasteful memory-wise to store all the values when they aren’t needed. So, for example, if an AVL tree is used for a Set, each node will just store the key and not the value; or if an array is used, each entry in the array will just be a set element and not a key-value pair.
3 Union and Intersection
This leaves us with two extremely common and important Set methods, which don’t have an equivalent in the Map ADT: union and intersection.
Until now, it’s been quite trivial to talk about the meaning of \(n\). Here, though, to analyze an implementation’s run time, there are two Sets as input. To account for this, let’s define \(n\) as the size of the first Set, and \(m\) as the size of the second. This will allow us to analyze the Big-O of union/intersection as a function of both \(n\) and \(m\).
3.1 Generic Union
Here’s a generic way to compute the union of two sets, using only the other operations we know so far:
union = new Set();
for (Key k : set1) {
union.add(k);
}
for (Key k : set2) {
union.add(k);
}
To analyze the total cost, we see there are \(n+m\) calls to add(e)
, on
a set whose size is at most \(n+m\). Plugging into the table above gives
us the following costs depending on the data structure:
Unsorted LL | Sorted Arr | Balanced Tree | Hashtable |
---|---|---|---|
\(O(n+m)\) | \(O((n+m)^2)\) | \(O((n+m)\log(n+m))\) | \(O(n+m)\) avg |
3.2 Generic Intersection
We can also implement the intersection operation using only the other set methods:
intersect = new Set();
for (Key k : set1) {
if (set2.contains(k)) {
intersect.add(k);
}
}
Here we have just one loop, which goes \(n\) times, and each time through
calls contains(e)
on a size-\(m\) set as well as add(e)
on a set of
size at most \(\min(n,m)\).
Unlike with the generic union operation, we have an interesting choice
here: should we choose set1
to be the smaller set, or set2
? It will
make the difference between either:
- If
set1
is larger: Checkingcontains(e)
many times on a small set; or - If
set2
is larger: Checkingcontains(e)
few times on a large set.
Which approach will be faster? If you think about it, we can see that
for every data structure, contains(e)
is \(O(n)\) or faster. This
means that it we would rather call this fewer times on a larger set,
than the other way around.
So let’s assume that set2 is larger, i.e., \(n \le m\). (If not, we could just swap the variable names without actually modifying the original sets, and get the same result. Mathematicians would say this is ``without loss of generality’’.)
Unsorted LL | Sorted Arr | Balanced Tree | Hashtable |
---|---|---|---|
\(O(nm)\) | \(O(n\log m + n^2)\) | \(O(n\log m)\) | \(O(n)\) avg |
4 Less Naive Ideas
In algorithm development, we refer to one of the first things you would think of as being “naive.” Sometimes the naive approach is the best one! Sometimes it isn’t. Those of you who take algorithms class will learn lots of strategies for identifying tricks that help speed up naive algorithms.
The above are all naive approaches. Is there anything we can do to improve them?
4.1 Unsorted linked list union
Generically, to get the union of two unsorted linked lists, we make two for loops to go therough each list and add everything to a new linked list.
But if we are willing to alter the first list, and assuming we have both head and tail pointers stored in the list, then all we really need to do is attach the head of the second set to the tail of the first set, which is \(O(1)\).
4.2 Sorted array merge
In the generic approaches above for sorted arrays, we get basically quadratic time for union or intersection. But if you think about it, this is not taking advantage of the fact that the arrays are both sorted! If we go through both arrays simultaneously and in order, we can actually get the intersection or the union quickly.
Here is the pseudocode for “going through both arrays simultaneously and in order”. This algorithm actually shows up in a few different contexts and is usually called merge. Imagine some fictional civilized highway where everyone takes turns fairly as two lanes come together.
i = 0
j = 0
while (i < set1.size() && j < set2.size()) {
if (set1[i] == set2[j]) {
// set1[i] is in the INTERSECTION (and UNION)
i += 1;
j += 1;
} else if (set1[i] < set2[j]) {
// set1[i] is in the UNION only
i += 1;
} else if (set1[i] > set2[j]) {
// set2[j] is in the UNION only
j += 1;
}
}
if (i < set1.size()) {
// all leftovers in set1[i...] are in the UNION only
} else if (j < set2.size() {
// all leftovers in set2[j...] are in the UNION only
}
How long does this take? Well, it’s a little difficult to see, but you should notice that at each step, either \(i\) or \(j\) increases, or both. Since they both start at 0 and go up to at most \(n\) and \(m\) respectively, the total number of iterations through the while loop is at most \(n+m\).
Notice furthermore that the elements of the union or intersection are also discovered in sorted order, so the new set gets sorted “for free” as we go.
This is great news! It means that union and intersection on sorted arrays both cost \(O(n+m)\), linear time.
4.3 Balanced search tree merge
We saw above that union and intersection on a balanced search tree set cost (respectively) \(O((n+m)\log(n+m))\) and \(O(n\log m)\) time. Can we do better?
The first thing to notice is that because balanced search trees can do a sorted in-order traversal in linear time, we can actually use the sorted array merging strategy above to get an array of the union (or intersection) in \(O(n+m)\) time.
The second thing to notice is that, if you have a sorted array of elements and want to make a new balanced search tree for the union or intersection set, you don’t have to insert them one at a time. Instead, you can just build a perfectly balanced tree from the sorted array you already have!
For example, to build an AVL tree from a sorted array, you start by inserting the median element as the root, then recursively create the left subtree from the first half of the array, and the right subtree from the second half of the array (not including the median, of course). This tree will be perfectly balanced already, so there’s no need to do any rotations!
The pseudocode looks like this:
Node buildTree(Element[] sortedArr) {
return buildTreeHelper(sortedArr, 0, sortedArr.length);
}
Node buildTreeHelper(Element[] sortedArr, int start, int end) {
if (start >= end) {
// empty array; base case
return null;
}
int median = (start + end) / 2;
Node leftSubtree = buildTreeHelper(sortedArr, start, median);
Node rightSubtree = buildTreeHelper(sortedArr, median+1, end);
return new Node(sortedArr[median], leftSubtree, rightSubtree);
}
We can write a recurrence to analyze this algorithm for a size-\(n\) array:
\[T(n) = \begin{cases} 1,& n = 0 \\ 1 + 2T(\frac{n}{2}),& n \ge 1 \end{cases}\]
Solving this looks like
\[T(n) = 1 + 2 + 4 + 8 + \cdots + n < 2n,\]
which is \(O(n)\). Linear time! This means that both union and intersection on a balanced search tree set can be done as fast as they can be on sorted arrays.
5 Summary
Putting this all together, we get the following costs for our set operations using the best ways discussed above, assuming that \(n\le m\).
union | intersection | |
---|---|---|
Unsorted linked list | \(O(1)\) destructively | \(O(nm)\) |
Sorted array | \(O(n+m)\) | \(O(n+m)\) |
Balanced search tree | \(O(n+m)\) | \(O(n+m)\) |
Hash table (open addr.) | \(O(n+m)\) avg | \(O(n)\) avg |
In Java, there is a
Set
interface, but it doesn’t have union or intersection methods. Instead, the equivalent to union isset1.addAll(set2)
, which essentially modifiesset1
so that it contains the union, andset1.retainAll(set2)
, which modifiesset1
so that it contains the intersection.If you want to do these without destroying
set1
in the process, you would just create a copy ofset1
first by using the constructor.