This is the archived website of SY 301 from the Fall 2016 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Unit 4: ADTs and LSQs

Credit: Gavin Taylor for the original version of these notes.

In this unit, we are going to learn about Abstract Data Types (ADTs) and our first three ADTs: Lists, Stacks, and Queues - i.e., LSQs. Besides being the first data structures we learn, these are also really important data structures that get used over and over again in many different computing problems. They will come up in future parts of this class as well as future classes that you take for your CS or IT degree.

Beginning of Class 10

1 Our first ADT: List

Abstract Data Types

Allegedly, this course is about Data Structures. In actuality, it is about Abstract Data Types (ADTs) and their implementation with Data Structures. An ADT is a description of an interface, and the data structures are how we make them work. For instance, last semester, we built a Queue. A Queue is an ADT; the linked list is the data structure.

Lists

Our first ADT is a List. A List is a container that holds data in some order, and can perform the following methods:

  1. get(i) - return the data at the i-th place in the list.
  2. set(i,e) - return the i-th piece of data, and also replace it with e.
  3. add(i,e) - insert e in the list at location i. All data with locations greater than i are shifted to a location one greater.
  4. remove(i) - remove and return the data at i. Shift all data at higher indices down.
  5. size() - return the number of elements in the list.

This is all an ADT is! Any way of making this work, no matter how stupid or slow, still means it is a list. However, implementing it with different data structures gives us different run times.

So how would we implement this? Perhaps, arrays or linked lists. Presumably, you could implement all four of those methods with both, and analyze the run times.

Iterators

Assume we have a list a_list, and we have the following code:

1
2
for i in range(n):
    print(a_list.get(i));

What is the runtime if our list is implemented with an array? \(O(n)\), right? What if it's implemented with a LL? Well, each get is O(n), and we've got n of them, so it's \(O(n^2)\). That's stupid! This should clearly be O(n).

We can get this to be linear-time by using a generator in Python (which is actually a special type of iterator), that allows us to keep track of the last element looked at and return the next one. In this way, we can jump from element to element, doing a \(O(1)\) update \(n\) times. We're not going to spend much time implementing these, but suffice to say they exist. Just so you'll believe me, an example of a generator for a Linked List is below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class Node:
    def __init__(self, data):
        self.data = data
        self.nextNode = None
 
class LinkedList:
    def __init__(self):
        self.head = None
        self.tail = None
 
    def __iter__(self):
        cur = self.head
        while cur is not None:
            yield cur.data # The "yield" is what makes this a generator function
            cur = cur.nextNode
 
    def add_front(self, data):
        new_head = Node(data)
        new_head.nextNode = self.head
        self.head = new_head
        if self.tail is None:
            self.tail = new_head
        self.head = new_head
    
    # ... other linked list stuff here ...
 
if __name__ == "__main__":
    # add 1..20 to a linked list and print it out (in reverse!)
    lst = LinkedList()
    for i in range(1, 21):
        lst.add_front(i)
    # this part will use the __iter__ method!
    for x in lst:
        print(x)

A note about size

You can pretty easily imagine how size() is implemented for an array-based list - you just get the length of the array. But for a linked list, we would have to iterate over the whole list, taking \(O(n)\), right?

Well, all that work could be easily avoided by storing one extra little piece of data in the Linked List class - namely, an int for the current size of the list. This value would be updated on each add or remove operation, and returned by each a call to size().

There is something subtle but important about why this is okay: An ADT implementation is allowed to store extra data besides whatever information the ADT itself specifies. In the case of a List, the ADT just specifies that you are storing an ordered sequence of elements. It says nothing about storing the size. But your data structure is allowed to store the size, in order to make the size() operation more efficient! The key is actually that the total storage is still the same in terms of big-O, since \(n+1 \in O(n)\).

So here's what we have so far:

ArrayLinked List
get(i)O(1)O(n)
set(i,e)O(1)O(n)
add(i,e)O(n)O(n)
remove(i)O(n)O(n)
for all i, get(i)O(n)O(n)
size()O(1)O(1)

Beginning of Class 11

A better Array implementation of List?

Remember what linked lists are faster at than arrays? That's right, adding to the beginning or the end of the list! With linked lists this is \(O(1)\) but arrays always take \(O(n)\) because the entire array has to be resized and copied. Well, there's not much we can do about adding to the beginning of an array, but it is possible to make adding to the end of an array-based list faster.

Let's use the scheme that every time we run out of room on our array, we double the size of the array. Will we have lots of unused indices? Yup. But, if our biggest number of elements in the list is n, the most memory we'll use is 2n. And, while that might hurt, at least that's still O(n). (Yup! We can (and do) use Big-O in reference to memory, too!)

So, OK, we only resize every time we actually fill this too-big array. Isn't the worst case still O(n)? Well, sort of. For the first time, our "number of operations as a function of n" function isn't smoothly increasing; instead, it's spiky. This is because almost all of the time, adding to the end is O(1). However, every now and then, it's O(n). See:

If you drew a line threw those little red "outlier" dots, it'd be a straight line; \(O(n)\). But almost all the dots are right there at the bottom, \(O(1)\), not increasing at all even with thousands of elements in the array.

What we'd like to say is that those \(O(n)\) dots don't ultimately matter, and the real cost is \(O(1)\). Amortized analysis is what lets us say this and still be mathematically correct.

Pretend you have a business. Every 10 years, you have to buy a $100,000 machine. You have two ways you could consider this; either, every ten years you lose massively, having to pay $100,000, or every year you put aside $10,000 to be used later. This is called amortizing the cost. We can do the same thing here.

Let's assume assigning one element to an array is 1 operation. Again, if we double the size of the array every time we need more space, starting with an array of size one, we would get the following series of costs when adding 17 elements to an initially-empty array:

1 2 3 1 5 1 1 1 9 1 1 1 1 1 1 1 17

The reason is that when the size reaches \(1, 2, 3, 5, 9, 17\), which is 1 more than each power of 2, we have to allocate a new array and copy everything over. So those are the "outlier", expensive additions. But all the other additions are cheap; they just involve copying a single element.

But we can amortize these expensive additions over the previous "cheap" ones! Check it out:

Size 1 2 3 5 9 17
Actual costs 1 1,2 1,2,3 1,2,3,1,5 1,2,3,1,5,1,1,1,9 1,2,3,1,5,1,1,1,9,1,1,1,1,1,1,1,17
Amortized 1 3,0 3,3,0 3,3,3,3,0 3,3,3,3,3,3,3,3,0 3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,0

Noticing the theme? No matter how big our stack becomes, the amortized cost of a single push never gets bigger than 3. So, the cost of a push, even with resizing the array, is O(1).

Another way of putting this is that the sum total cost of a series of \(n\) additions onto the end of an initially-empty array is \(O(n)\). In fact, if we take the same graph as above and plot the running sum instead of the individual cost of each operation, the fact that this line is \(O(n)\) becomes fairly obvious:

As you may already know, this is in fact exactly how Java's ArrayList class works. Using similar ideas they can also remove from the end of an ArrayList in \(O(1)\) amortized time, and guarantee that the space usage is kept at \(O(n)\). In fact, the previous graphs came from timing Java's own ArrayList!

There is one additional problem with the doubling-size array implementation of a List: the size of the array no longer matches the size of the list! So just like with linked lists, we have to store an extra int for the "actual" size of the list, which will always be ≤ the size of the array. Again, this keeps the storage as \(O(n)\) and so it's okay.

2 Stacks

Stacks

Our second ADT is the Stack. We're pretty familiar with Stacks already, so this might not be so hard. Stacks work like a stack of plates: you only add plates onto the top of the stack, and you only remove stacks from the top of the stack. This makes them a LIFO (last-in-first-out).

The methods of a stack are:

  • push(e) - Add element e to the top of the stack.
  • pop() - Take the top element off the top of the stack, and return it.
  • peek() - Return the top element of the stack, without altering the stack.

Let's consider an example. On the left column of this table, I will have some commands. In the middle, I'll put what is returned, if anything. On the right, I'll put the state of the stack, with the top of the stack on the left.

CommandReturned ValueStack
push(3)-3
push(5)-5 3
pop()53
peek()33
push(4)-4 3
peek()44 3
pop()43
pop()3

Implementation with Different Data Structures

Implementation with Linked Lists should seem relatively easy to you; if "first" refers to the top element of the stack, then push() is just an add-to-front, pop() is a remove-from-front, and peek is just return first.getData(); (modulo error checking for an empty stack, of course).

This means that for stacks implemented by linked lists, all three methods can be run in O(1) time. Great!

What about arrays? Well, if we used a "dumb" array, each push and pop would cost \(O(n)\) because of the need to copy the entire contents each time. But as we saw above with ArrayList, we can be smart about this and leave some empty space in the array, doubling it when necessary. You could technically also shrink the array (halving when necessary) in pop(), but that makes everything more cumbersome, so we'll assume the resizings only happen during push().

Here's how the two implementations stack up (get it?):

Linked ListArray
push()O(1)O(1) amortized
pop()O(1)O(1)
peek()O(1)O(1)

Usefulness of Stacks

OK, so stacks are fast! What can we do with them?

  1. Reverse stuff - pushing everything on, and then popping it all off puts it in reverse order.
  2. Maze solving - every time you come to a junction, push it on the stack. When your decision doesn't work out, pop to the last junction, and go the other way. If there are no more possible directions, pop again.
  3. For more complex uses of stacks, stay tuned...

Beginning of Class 12

3 Queues

Queues

We know Queues well from IC211. If a Stack is a LIFO (Last-In First-Out), then a Queue is a FIFO (First-In First-Out). In other words, things come out in exactly the same order they went in.

Think about a line for the checkout counter at the grocery store (after all, "queue" is British for "line"). People can only join the line at the end, and only leave the line at the front. In between, they never switch order.

The methods a Queue must have are:

  • enqueue(e) - Add an element to the back of the Queue
  • dequeue() - Remove and return the front element of the Queue
  • peek() - Return, but do not remove, the front element of the Queue

Implementation

Implementation with a Linked List is fairly straightforward. Enqueue is just an add-to-back, which is O(1) if we have a pointer to the last link in the Linked List. Dequeue is just a remove-from-front, which, again, is O(1). peek is Dequeue but easier, so it is also O(1).

Implementation with an array requires a bit more bookkeeping. Every time we dequeue, we don't want to shift everything down so that the first element remains at the front of the array, that would be expensive and silly. So, we keep two values: front, and size. Front the index of the first element of the queue, and size indicates the number of elements in the queue. As elements are enqueued and dequeued, the queue can wrap around the end of the array, back to the beginning of the array if appropriate. If the array fills up, we double the size of the array, as with the stack. So, enqueue is an AMORTIZED O(1), just like push. dequeue and peek are also trivially O(1).

So, we resize our array when size==array length. Aside from that, how do we do find the index of the first open space, when we're allowed to wrap around? Well, if our array was infinitely long, or no wrapping was required, the first open index would be at front+size. Since we're wrapping, it's at (front+size)%array.length. After inserting your element, you then have to increment size.

To dequeue, we want to indicate that the element currently at index "front" is now eligible for a new element. So, we increment front (wrapping back around to 0 if need be), and decrement size.