This is the archived website of SY 301 from the Fall 2016 semester. Feel free to browse around; you may also find more recent offerings at my teaching page.

Lab 6: Text analysis

Due before 23:59 Wednesday, October 5
Submit using the submit program as 301 lab 06.

1 Text Analysis with BSTs

Communities in America (unfortunately) have a long history of banning controversial books. For example, this tarball contains the full text of several often-banned books which now lie in the public domain. (Note: If you download this file and then run tar xzvf books.tgz, the book contents will be extracted into a directory called "books".)

Books still get banned, but our online lives have caused our public speech to be both much more plentiful, and much more closely watched. For example, here is a list of words and phrases, revealed by a Freedom of Information Act request in 2011, which was used by the Department of Homeland Security to flag suspicious online text. I wonder which of our frequently-banned books most use words and phrases from this list?

How the lab works

This lab proceeds in multiple parts, and the first parts are by far the most important and the things you absolutely have to complete. Make sure you follow the specifications so your programs work correctly. You should turn in a solution for every portion you complete. In other words, if you complete part 4, I should get from you a bst.py, a one_word.py, a phrases.py, AND a search_all.py.

2 Part 1: a BST to store the bad words

We're going to need a way of identifying if a phrase from our book appears on the list. We could load up a list with the phrases, and then use if bookphrase in badWordList:, but that would be $O(n)$ for each search - it will take too long with so many words in the list and in each book! Instead, let's take advantage of the fact that strings are comparable (meaning < and > works, based on alphabetical order), in order to arrange our phrases into a Binary Search Tree.

Create a file called bst.py, which contains a Node class and a TreeSet class. The Node class should just contain a constructor, which accepts only the data for the Node. The TreeSet class should have:

a constructor (which takes no arguments other than self)
a method called insert(self,phrase), which adds phrase to the tree, and
a method called __contains__(self,phrase), which returns true or false, based on if phrase appears in the tree.

If you need helper methods to make these work, that is fine, but these should be the only public methods.

A note on contains

Some of you probably recognize that the double-underscores means the __contains__ method has a special purpose, and you'd be right. This method overrides Python's in operator. By this, I mean that if you have an object myTreeSet, and you call someString in myTreeSet, it will call your __contains__ method just like you wrote myTreeSet.__contains__(someString). We'll see more operator overloading in the next lab.

3 Part 2: Check individual words from a given book

Create a program called one_word.py which imports your classes from bst.py, and fills a tree with each line from a file called badwords.txt (you should probably download the one linked above!). Many of these lines are multiple words long; that is fine, add the whole phrase as a single data point to the tree.

Your program should then take a file name from the command-line arguments, and search for each individual word from the book in the tree of bad words; if it appears, it should print the word. At the end, it should print the number of words that appeared.

For example:

$ python3 one_word.py books/huckleberryFinn.txt
gang
pork
exercise
sick
shooting
(...lots more...)
who
facility
help
Count: 362

Doing this will require that you strip all punctuation from your words from the books, and make them lower-case. Google can probably help you figure out how to do that!

(In case it's helpful, here is a list of all the punctuation characters that appear in the sample book files: !"#$%&'()*+,-./:;?@[]_)

4 Part 3: Check entire phrases

Create a program called phrases.py which does all of part 2, as well as searching for phrases. The longest phrases in the list of bad words are five words long.

For example:

$ python3 phrases.py books/ulysses.txt
shooting
china
who
swine
who
(...more...)
foot and mouth
(...more...)
secret service
(...more...)
who
facility
help
Count: 1289

5 Part 4: Check multiple books

The rest of this lab is optional for your enrichment and enjoyment. Be sure to work on this only after getting everything else perfect. You should still submit this work, but make sure it doesn't break any of the parts you already had working correctly!

Create a program called rank.py which displays the number of offending phrases in each of several command-line arguments. The filenames on the command-line should be ordered according to the "offense ratio", which is the number of bad phrases divided by the total number of words in the book. List the most offensive book first.

For example:

$ python3 rank.py books/*.txt
books/uncleTomsCabin.txt: 936/191115
books/ulysses.txt: 1289/272816
books/throughTheLookingGlass.txt: 104/33924
books/huckleberryFinn.txt: 366/120070
books/originOfSpecies.txt: 265/160071

1 Text Analysis with BSTs

How the lab works

2 Part 1: a BST to store the bad words

A note on __contains__

3 Part 2: Check individual words from a given book

4 Part 3: Check entire phrases

5 Part 4: Check multiple books

A note on contains