IC 312 Fall 2022 / Projects


This is the archived website of IC 312 from the Fall 2022 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Project 2: Pills

  • Due by: 23:59 on Thursday, November 3
  • Submit as proj2

Overview

Pharmacies have to have a stock of opiates on hand for filling prescriptions. The DEA tracks every purchase of opiate pills made by pharmacies to replenish their stock. Recently, due to the legal efforts of the Washington Post and the Charleston Gazette-Mail, this database of opiate purchases was made public, and was published for general download and analysis. The database covers 178.5 million purchases of oxycodone and hydrocodone pills from 2006-2012, for a total of more than 76 billion pills. You can read more about this dataset here.

As policymakers confront America’s opiate problem, this dataset provides important clues to where the problem exists. You can see a list of local news stories from around the country based on this dataset here.

We will write programs to analyze this data and attempt to answer two questions: which pharmacies are buying the greatest number of opioid pills, and which zip codes have the highest per-capita purchases.

And of course, you will need to write some data structures to make this happen! Here is how the project will break down:

  • Part 1 (30%): Write an implementation of the Map interface using a balanced search tree, in a file TreeMap.java.

  • Part 2 (15%): Use your Map to write a main() method in Pharmacies.java that will read in an arcos file and report the total number of pills purchased by each pharmacy, in alphabetical order.

  • Part 3 (25%): Use a heap to implement the methods in the TopK.java class, which allows you to efficiently list out the k largest items inserted, for any value of k.

  • Part 4 (30%): Use everything you’ve developed so far to write the Zips.java program, which reads in arcos data, zip code data, and a number k, and reports the k zip codes with the highest per-capita pill purchases.

  • Bonus (up to 5%): If you get everything working perfectly, it still will probably take at least 10 seconds or more to run Zips on the larger files. Your challenge is to make this faster, using better data structures and any Java tricks you can think of.

Grading and collaboration

For consideration of full credit, besides getting the code working perfectly, you must also:

  • Include a completed README.txt file in your submission
  • Use a balanced search tree for your TreeMap and a heap for your TopK, both of which work perfectly and with the correct big-O running times.
  • Not use anything from java.util except List, ArrayList, and possibly Iterator.
  • Use good variable names, spacing, design, and appropriate documentation. See the coding style rubric for this class.

Review the late policy for projects. The grade penalty for (any fraction of) \(k\) days late is \(5^{k-1}\).

Review the course policy on collaboration for projects. No outside help is allowed from any human other than your instructor and the MGSP leader. This includes both giving and receiving assistance to your classmates or other Midshipmen.

Your instructors and MGSP leaders are happy to help you — please reach out if you get stuck or have any questions.

Data and starter code

Some of the data files for this project are unusually large (around 800MB). So, you have to download a tarball from this link: proj2.tar.gz, and then extract it by running

tar -xzvf proj2.tar.gz

Data (tsv files)

The main data is in two files, arcos-dmv.tsv and arcos-annapolis.tsv. The first one contains about 9 million drug purchases from DC, Maryland, Virginia, and West Virginia, while the second one focuses only on the roughly 60,000 purchases in Annapolis.

Each line of these arcos files is a purchase made by a buyer. The first columns BUYER_NAME, which is the name of the company (e.g. pharmacy) making the purchase. The second column, BUYER_ADDL_CO_INFO, is often simply the string null, but sometimes contains things like store numbers which may be unique when the BUYER_NAME isn’t. For the purposes of this project, when we refer to the “name” of the pharmacy, we mean:

  • If the BUYER_ADDL_CO_INFO is null, then the name is just the BUYER_NAME
  • Otherwise, the name is the BUYER_NAME and the BUYER_ADDL_CO_INFO, separated by a single space.

The sixth column, BUYER_ZIP, has the zip code of where the pharmacy is located. The seventh column, DOSAGE_UNIT, is the number of pills purchased.

Of course, there’s lots of other interesting columns as well, those are just the ones we’ll use in this project. You can see descriptions of the rest of the columns here.

The directory you downloaded also contains zip code population info from the 2010 U.S. Census. As before, we give you a larger file for bigger testing (uszips.tsv), and a smaller one to help you as you test and develop your code (mdzips.tsv). Just like the arcos files, these are tab-separated files with a single header line.

README

You need to complete and answer all questions in the README.txt file before submitting.

Java files

The directory you downloaded contains 5 normal Java files:

  • Map.java is an interface that you must implement with a balanced search tree (Part 1). You must not modify this file.

  • TsvReader.java is a class provided for you to conveniently read tsv files. All you need to do is change one line (look for the “TODO”) to add your Map implementation so that each line of the tsv file can be returned as a Map from column names to corresponding values on that line.

  • Pharmacies.java is the program you need to complete for Part 2, to list total pill counts by pharmacy.

  • TopK.java is the class you need to change for Part 3 so that it’s more efficient using a heap. Note, this is already complete as-is and works perfectly correctly, but with a very inefficient implementation. So fix it!

  • Zips.java is the program you need to complete for Part 4, to list the highest pills-per-population ratios by zip code.

Testing and sample runs

As usual, you are expected to thoroughly test your own code before submitting.

To help you get started, we are giving you two “sanity test” checks for parts 1 and 3 respectively, in TreeMapSanityTest.java and TopKSanityTest.java.

Look back at HW7 for a refresher on how to run (and modify!) JUnit tests. Remember these are only a few basic tests — you are strongly encouraged to add your own!

We are also providing two sample outputs for Parts 2 and 4 respectively, in pharmacies-annapolis.txt and zips-dmv-50.txt. For example, to check your code on part 2, you can run

java Pharmacies arcos-annapolis.tsv | diff - pharmacies-annapolis.txt

If you don’t see any output, then that means everything was a match!

(This is running your code and piping the output (|) to the diff command. The diff command compares two files, and the special command line argument - to diff tells it to use standard input, which is piped from java, and comparing to the sample output file we are providing you.)

Requirements and tips for each part

Part 1: TreeMap

Implement the Map interface in a new file TreeMap.java, using some kind of balanced search tree such as an AVL tree or 2-3-4 tree.

The two most important methods are get(k) and put(k,v), both of which must have worst-case \(O(\log n)\) running time. The size() method should be O(1).

The keys() method produces a list of all keys in the map, in any order. For your balanced search tree, you probably want to do an in-order traversal so the keys come out in alphabetical order (this is useful for the next part!). The required running time is O(n).

Notice that you are allowed to use the built-in ArrayList now in order to return the list from the keys() method. You can use this because you’ve “earned it” by doing the previous homeworks and projects where we implemented this ourselves!

Part 2: Pharmacies

Complete the main method in Pharmacies.java so that it reads transactions from the given arcos file and adds up the total pill counts according to each pharmacy name.

In the starter code we give you, the logic is already there to either read the filename from the command line or from the terminal. You shouldn’t need to change that part.

To actually read in the arcos file, you will want to use the TsvReader class we provide for you. Because that class implements Iterable, you can put it in a for-each loop, like this:

TsvReader tr = new TsvReader(somefile);
for (Map<String,String> aLine : tr) {
  String buyer = aLine.get("BUYER_NAME");
  System.out.println(buyer);
}

Notice that each line of the TsvReader is returned as a Map (which should be the map you made in part 1!) to associate each column header such as BUYER_NAME or DOSAGE_UNIT to the corresponding part of that line.

To complete the Pharmacies program, you will want to create a Map (obviously!) to associate pharmacy names with their total pill counts. Note that even for the smaller Annapolis-only test file, there are about 60,000 transactions but only 64 different pharmacies. So it’s important that your TreeMap is efficient for doing lots and lots of get and set operations!

The folder you downloaded contains a file pharmacies-annapolis.txt that you can use to check your output on the arcos-annapolis.tsv dataset.

On the larger arcos-dmv.tsv dataset, even a great implementation will take a few seconds to run because it’s a really big file! But if you implemented things well, it should finish under half a minute. The first few lines of output look like this:

$ java Pharmacies arcos-dmv.tsv | head
  431200 4 S PHARMACY INC DBA MACPHAIL PHARMACY AT UPPER CHESAPEAKE
  498500 4 S PHARMACY INC. MACPHAIL PHARMACY
   47900 A & M MEDICAL SUPPLY & PHARMAC
   14700 A+ CARE PHARMACY
      30 AARON, MAUREEN M MD
      50 ABASSI, RASHEED, ADEDAPO. MD
     100 ABBASI, MUSTAFA AHMED MD
      50 ABDELLA, MUKEMIL F MD
      20 ABDO, SUZAN PRIME MD LLC
    2200 ABEL, DAVID LYNN

Note: We will test your code on more than just the sample datasets you’ve been given! You may assume that we will only run your code with valid, properly-formatted tsv files, but cannot assume that they will always have the same number of lines, or same pharmacy names, or same ordering, as the files in the sample datasets.

Part 3: TopK

Everything you need to do is laid out explicitly in the commends of the TopK.java we give you to start with. Basically, there are just two methods: adding a single element, and removing the k largest elements. Importantly (to make it easier for you!), the getTop() method is guaranteed to only be called once.

There are essentially two good ways to do this with a heap:

  • Maintain a small min-heap of only the k largest elements so far. Each time a new item is added after the first k, you call removeMin() to maintain only the k largest in the heap.

  • Build a single large max-heap at the end. With this idea, you let the elements be added in arbitrary order, and then only when getTop() is called, you heapify the array and then use removeMax() calls to get the k largest items.

Which way you do it is up to you! But you must use an array-based heap, the running time for add() must be \(O(\log n)\) or better, and the big-O running time for getTop() must be better than the naive \(O(nk)\) method in the starter code.

The provided JUnit testing file TopKSanityTest.java should help you get started, but you will want to add plenty of your own tests as well.

Part 4: Zips

This is where you put it all together and get some real data analysis! The goal is to identify zip codes with the highest per-person pill sales.

In order to do this, you will have to read two different tsv files: an arcos file with all the transaction data, which you will use to get the total pills sold in each county; and a zips tsv file that has zip codes, city names, and populations.

(Note, the arcos file also has city names, but they aren’t consistent. For this part of the project, only rely on the zip codes from the arcos files, and use the city names from the uszips and mdzips files.)

Your program should read data from both tsv files into multiple separate instances of your Map, again using the TsvReader class like in part 2. These files are pretty large so you definitely don’t want to read them twice — use your data structure instead!

Next, you need to combine both sources of information to get the pills/population ratios, and put those into an instance of your TopK class from Part 3. To combine the maps together, the best way is to iterate over the keys() of one map, and then perform lookups into both maps inside your loop to get the information you need.

Finally, using your heap-based TopK class, your program should print out the largest pills per capita ratios, along with the corresponding city, state, and zip code.

Testing your code on the small Annapolis dataset just has 3 zip codes. Here is what that should look like:

$ java Zips arcos-annapolis.tsv mdzips.tsv 10
  447.34 Annapolis, MD 21401
   72.03 Annapolis, MD 21403
    6.12 Annapolis, MD 21409

The larger dataset will take a little bit to process, but not more than a minute if you have good data structure implementations. You can check the top 50 zip codes output from the provided file zips-dmv-50.txt.

Note: We will test your code with different input files besides the two arcos and two zipcode files provided to you. Your can assume that the input files provided will always exist and be in the correct format, but cannot assume anything about how many lines they have, how the lines are ordered, what particular zip codes are included, etc.

Bonus: FastZips

For up to 5% bonus, write a new program FastZips.java (without breaking anything you already wrote!!) that runs the same as the Zips program but finishes faster. You might consider:

  • Using a different Map data structure that is more efficient
  • Tweaking your TopK class to make it run faster.
  • Integrating the reading and processing better so you don’t have to make so many maps and heaps.
  • Using faster reading from the file (hint: I hear the java.nio package has some fast stuff).

Do not expect to get any bonus credit unless your code already works perfectly for the 4 main parts. No one wants their code to be faster if it’s not correct!