Project 2: Pills
- Due by: 23:59 on Thursday, November 3
- Submit as proj2
Overview
Pharmacies have to have a stock of opiates on hand for filling prescriptions. The DEA tracks every purchase of opiate pills made by pharmacies to replenish their stock. Recently, due to the legal efforts of the Washington Post and the Charleston Gazette-Mail, this database of opiate purchases was made public, and was published for general download and analysis. The database covers 178.5 million purchases of oxycodone and hydrocodone pills from 2006-2012, for a total of more than 76 billion pills. You can read more about this dataset here.
As policymakers confront America’s opiate problem, this dataset provides important clues to where the problem exists. You can see a list of local news stories from around the country based on this dataset here.
We will write programs to analyze this data and attempt to answer two questions: which pharmacies are buying the greatest number of opioid pills, and which zip codes have the highest per-capita purchases.
And of course, you will need to write some data structures to make this happen! Here is how the project will break down:
Part 1 (30%): Write an implementation of the
Map
interface using a balanced search tree, in a fileTreeMap.java
.Part 2 (15%): Use your Map to write a main() method in
Pharmacies.java
that will read in an arcos file and report the total number of pills purchased by each pharmacy, in alphabetical order.Part 3 (25%): Use a heap to implement the methods in the
TopK.java
class, which allows you to efficiently list out the k largest items inserted, for any value of k.Part 4 (30%): Use everything you’ve developed so far to write the
Zips.java
program, which reads in arcos data, zip code data, and a number k, and reports the k zip codes with the highest per-capita pill purchases.Bonus (up to 5%): If you get everything working perfectly, it still will probably take at least 10 seconds or more to run
Zips
on the larger files. Your challenge is to make this faster, using better data structures and any Java tricks you can think of.
Grading and collaboration
For consideration of full credit, besides getting the code working perfectly, you must also:
- Include a completed
README.txt
file in your submission - Use a balanced search tree for your
TreeMap
and a heap for yourTopK
, both of which work perfectly and with the correct big-O running times. - Not use anything from
java.util
exceptList
,ArrayList
, and possiblyIterator
. - Use good variable names, spacing, design, and appropriate documentation. See the coding style rubric for this class.
Review the late policy for projects. The grade penalty for (any fraction of) \(k\) days late is \(5^{k-1}\).
Review the course policy on collaboration for projects. No outside help is allowed from any human other than your instructor and the MGSP leader. This includes both giving and receiving assistance to your classmates or other Midshipmen.
Your instructors and MGSP leaders are happy to help you — please reach out if you get stuck or have any questions.
Data and starter code
Some of the data files for this project are unusually large (around 800MB). So, you have to download a tarball from this link: proj2.tar.gz, and then extract it by running
tar -xzvf proj2.tar.gz
Data (tsv files)
The main data is in two files, arcos-dmv.tsv
and
arcos-annapolis.tsv
. The first one contains about 9 million drug
purchases from DC, Maryland, Virginia, and West Virginia, while the
second one focuses only on the roughly 60,000 purchases in Annapolis.
Each line of these arcos files is a purchase made by a
buyer. The first columns BUYER_NAME
, which is the name of the company
(e.g. pharmacy) making the purchase. The second column,
BUYER_ADDL_CO_INFO
, is often simply the string null
, but sometimes
contains things like store numbers which may be unique when the
BUYER_NAME
isn’t. For the purposes of this project, when we refer to
the “name” of the pharmacy, we mean:
- If the
BUYER_ADDL_CO_INFO
isnull
, then the name is just theBUYER_NAME
- Otherwise, the name is the
BUYER_NAME
and theBUYER_ADDL_CO_INFO
, separated by a single space.
The sixth column, BUYER_ZIP
, has the zip code of where the pharmacy is
located. The seventh column, DOSAGE_UNIT
, is the number of pills
purchased.
Of course, there’s lots of other interesting columns as well, those are just the ones we’ll use in this project. You can see descriptions of the rest of the columns here.
The directory you downloaded also contains zip code population info from
the 2010 U.S. Census. As before, we give you a larger file for bigger
testing (uszips.tsv
), and a smaller one to help you as you test and
develop your code (mdzips.tsv
). Just like the arcos files, these are
tab-separated files with a single header line.
README
You need to complete and answer all questions in the README.txt file before submitting.
Java files
The directory you downloaded contains 5 normal Java files:
Map.java is an interface that you must implement with a balanced search tree (Part 1). You must not modify this file.
TsvReader.java is a class provided for you to conveniently read tsv files. All you need to do is change one line (look for the “TODO”) to add your Map implementation so that each line of the tsv file can be returned as a Map from column names to corresponding values on that line.
Pharmacies.java is the program you need to complete for Part 2, to list total pill counts by pharmacy.
TopK.java is the class you need to change for Part 3 so that it’s more efficient using a heap. Note, this is already complete as-is and works perfectly correctly, but with a very inefficient implementation. So fix it!
Zips.java is the program you need to complete for Part 4, to list the highest pills-per-population ratios by zip code.
Testing and sample runs
As usual, you are expected to thoroughly test your own code before submitting.
To help you get started, we are giving you two “sanity test” checks for parts 1 and 3 respectively, in TreeMapSanityTest.java and TopKSanityTest.java.
Look back at HW7 for a refresher on how to run (and modify!) JUnit tests. Remember these are only a few basic tests — you are strongly encouraged to add your own!
We are also providing two sample outputs for Parts 2 and 4 respectively,
in pharmacies-annapolis.txt
and zips-dmv-50.txt
. For example, to
check your code on part 2, you can run
java Pharmacies arcos-annapolis.tsv | diff - pharmacies-annapolis.txt
If you don’t see any output, then that means everything was a match!
(This is running your code and piping the output (|
) to the diff
command. The diff
command compares two files, and the special command
line argument -
to diff
tells it to use standard input, which is
piped from java
, and comparing to the sample output file we are
providing you.)
Requirements and tips for each part
Part 1: TreeMap
Implement the Map
interface in a new file TreeMap.java
, using some
kind of balanced search tree such as an AVL tree or 2-3-4 tree.
The two most important methods are get(k)
and put(k,v)
, both of
which must have worst-case \(O(\log n)\) running time. The size()
method
should be O(1).
The keys()
method produces a list of all keys in the map, in any
order. For your balanced search tree, you probably want to do an
in-order traversal so the keys come out in alphabetical order (this is
useful for the next part!). The required running time is O(n).
Notice that you are allowed to use the built-in
ArrayList
now in order to return the
list from the keys()
method. You can use this because you’ve “earned
it” by doing the previous homeworks and projects where we implemented
this ourselves!
Part 2: Pharmacies
Complete the main method in Pharmacies.java
so that it reads
transactions from the given arcos file and adds up the total pill counts
according to each pharmacy name.
In the starter code we give you, the logic is already there to either read the filename from the command line or from the terminal. You shouldn’t need to change that part.
To actually read in the arcos file, you will want to use the TsvReader
class we provide for you. Because that class implements Iterable
, you
can put it in a for-each loop, like this:
TsvReader tr = new TsvReader(somefile);
for (Map<String,String> aLine : tr) {
String buyer = aLine.get("BUYER_NAME");
System.out.println(buyer);
}
Notice that each line of the TsvReader
is returned as a Map (which
should be the map you made in part 1!) to associate each column header
such as BUYER_NAME
or DOSAGE_UNIT
to the corresponding part of that
line.
To complete the Pharmacies
program, you will want to create a Map
(obviously!) to associate pharmacy names with their total pill counts.
Note that even for the smaller Annapolis-only test file, there are about
60,000 transactions but only 64 different pharmacies. So it’s important
that your TreeMap
is efficient for doing lots and lots of get
and
set
operations!
The folder you downloaded contains a file pharmacies-annapolis.txt
that you can use to check your output on the arcos-annapolis.tsv
dataset.
On the larger arcos-dmv.tsv
dataset, even a great implementation will
take a few seconds to run because it’s a really big file! But if you
implemented things well, it should finish under half a minute. The first
few lines of output look like this:
$ java Pharmacies arcos-dmv.tsv | head
431200 4 S PHARMACY INC DBA MACPHAIL PHARMACY AT UPPER CHESAPEAKE
498500 4 S PHARMACY INC. MACPHAIL PHARMACY
47900 A & M MEDICAL SUPPLY & PHARMAC
14700 A+ CARE PHARMACY
30 AARON, MAUREEN M MD
50 ABASSI, RASHEED, ADEDAPO. MD
100 ABBASI, MUSTAFA AHMED MD
50 ABDELLA, MUKEMIL F MD
20 ABDO, SUZAN PRIME MD LLC
2200 ABEL, DAVID LYNN
Note: We will test your code on more than just the sample datasets you’ve been given! You may assume that we will only run your code with valid, properly-formatted tsv files, but cannot assume that they will always have the same number of lines, or same pharmacy names, or same ordering, as the files in the sample datasets.
Part 3: TopK
Everything you need to do is laid out explicitly in the commends of the
TopK.java we give you to start with. Basically,
there are just two methods: adding a single element, and removing the k
largest elements. Importantly (to make it easier for you!), the
getTop()
method is guaranteed to only be called once.
There are essentially two good ways to do this with a heap:
Maintain a small min-heap of only the k largest elements so far. Each time a new item is added after the first k, you call
removeMin()
to maintain only the k largest in the heap.Build a single large max-heap at the end. With this idea, you let the elements be added in arbitrary order, and then only when
getTop()
is called, you heapify the array and then useremoveMax()
calls to get the k largest items.
Which way you do it is up to you! But you must use an array-based heap,
the running time for add()
must be \(O(\log n)\) or better,
and the big-O running time for getTop()
must be better than the naive \(O(nk)\) method
in the starter code.
The provided JUnit testing file TopKSanityTest.java
should help you
get started, but you will want to add plenty of your own tests as well.
Part 4: Zips
This is where you put it all together and get some real data analysis! The goal is to identify zip codes with the highest per-person pill sales.
In order to do this, you will have to read two different tsv files: an arcos file with all the transaction data, which you will use to get the total pills sold in each county; and a zips tsv file that has zip codes, city names, and populations.
(Note, the arcos file also has city names, but they aren’t consistent. For this part of the project, only rely on the zip codes from the arcos files, and use the city names from the uszips and mdzips files.)
Your program should read data from both tsv files into multiple
separate instances of your Map, again using the TsvReader
class like in part 2.
These files are pretty large so you definitely don’t want to read them
twice — use your data structure instead!
Next, you need to combine both sources of information to get the
pills/population ratios, and put those into an instance of your TopK
class from Part 3. To combine the maps together, the best way is to
iterate over the keys()
of one map, and then perform lookups into both
maps inside your loop to get the information you need.
Finally, using your heap-based TopK
class, your program should print
out the largest pills per capita ratios, along with the corresponding
city, state, and zip code.
Testing your code on the small Annapolis dataset just has 3 zip codes. Here is what that should look like:
$ java Zips arcos-annapolis.tsv mdzips.tsv 10
447.34 Annapolis, MD 21401
72.03 Annapolis, MD 21403
6.12 Annapolis, MD 21409
The larger dataset will take a little bit to process, but not more than
a minute if you have good data structure implementations. You can check
the top 50 zip codes output from the provided file zips-dmv-50.txt
.
Note: We will test your code with different input files besides the two arcos and two zipcode files provided to you. Your can assume that the input files provided will always exist and be in the correct format, but cannot assume anything about how many lines they have, how the lines are ordered, what particular zip codes are included, etc.
Bonus: FastZips
For up to 5% bonus, write a new program FastZips.java
(without
breaking anything you already wrote!!) that runs the same as the Zips
program but finishes faster. You might consider:
- Using a different Map data structure that is more efficient
- Tweaking your
TopK
class to make it run faster. - Integrating the reading and processing better so you don’t have to make so many maps and heaps.
- Using faster reading from the file (hint: I hear the
java.nio
package has some fast stuff).
Do not expect to get any bonus credit unless your code already works perfectly for the 4 main parts. No one wants their code to be faster if it’s not correct!