Lab 6: Text analysis
- Due before 23:59 Wednesday, October 5
- Submit using the submit program
as
301 lab 06
.
1 Text Analysis with BSTs
Communities in America (unfortunately) have a long history of banning
controversial books. For example, this tarball
contains the full text of several often-banned books which now lie in the
public domain. (Note: If you download this file and then run
tar xzvf books.tgz
, the book contents will be extracted
into a directory called "books".)
Books still get banned, but our online lives have caused our public speech to be both much more plentiful, and much more closely watched. For example, here is a list of words and phrases, revealed by a Freedom of Information Act request in 2011, which was used by the Department of Homeland Security to flag suspicious online text. I wonder which of our frequently-banned books most use words and phrases from this list?
How the lab works
This lab proceeds in multiple parts, and the first parts are by far the most important and the things you absolutely have to complete. Make sure you follow the specifications so your programs work correctly. You should turn in a solution for every portion you complete. In other words, if you complete part 4, I should get from you a bst.py, a one_word.py, a phrases.py, AND a search_all.py.
2 Part 1: a BST to store the bad words
We're going to need a way of identifying if a phrase from our book appears
on the list. We could load up a list with the phrases, and then use
if bookphrase in badWordList:
, but that would be \(O(n)\) for each
search - it will take too long with so many words in the list and in each book!
Instead, let's take advantage of the fact that strings are comparable
(meaning < and > works, based on alphabetical order), in order to
arrange our phrases into a Binary Search Tree.
bst.py
, which
contains a Node
class and a TreeSet
class. The Node class
should just contain a constructor, which accepts only the data for the Node.
The TreeSet class should have:
- a constructor (which takes no arguments other than
self
) - a method
called
insert(self,phrase)
, which addsphrase
to the tree, and - a method called
__contains__(self,phrase)
, which returns true or false, based on ifphrase
appears in the tree.
If you need helper methods to make these work, that is fine, but these should be the only public methods.
A note on __contains__
Some of you probably recognize that the double-underscores means
the __contains__
method has a special purpose, and you'd be right.
This method
overrides Python's in
operator. By this, I mean that if you have an object
myTreeSet
, and you call someString in myTreeSet
, it will
call your __contains__ method just like you wrote
myTreeSet.__contains__(someString)
.
We'll see more operator overloading in the next lab.
3 Part 2: Check individual words from a given book
Create a program called one_word.py
which imports your classes from bst.py
, and fills a tree with each
line from a file called badwords.txt
(you should probably download the one linked above!).
Many of these lines are
multiple words long; that is fine, add the whole phrase as a single data point
to the tree.
Your program should then take a file name from the command-line arguments, and search for each individual word from the book in the tree of bad words; if it appears, it should print the word. At the end, it should print the number of words that appeared.
For example:
$ python3 one_word.py books/huckleberryFinn.txt gang pork exercise sick shooting (...lots more...) who facility help Count: 362
Doing this will require that you strip all punctuation from your words from the books, and make them lower-case. Google can probably help you figure out how to do that!
(In case it's helpful, here is a list of all the punctuation characters that
appear in the sample book files: !"#$%&'()*+,-./:;?@[]_
)
4 Part 3: Check entire phrases
Create a program calledphrases.py
which does all of part 2, as well as searching for phrases. The longest
phrases in the list of bad words are five words long.
For example:
$ python3 phrases.py books/ulysses.txt shooting china who swine who (...more...) foot and mouth (...more...) secret service (...more...) who facility help Count: 1289
5 Part 4: Check multiple books
The rest of this lab is optional for your enrichment and enjoyment. Be sure to work on this only after getting everything else perfect. You should still submit this work, but make sure it doesn't break any of the parts you already had working correctly!
Create a program called rank.py
which displays the number of offending phrases in each of several command-line
arguments. The filenames on the command-line should be ordered according to the
"offense ratio", which is the number of bad phrases divided by the total number of words
in the book. List the most offensive book first.
For example:
$ python3 rank.py books/*.txt books/uncleTomsCabin.txt: 936/191115 books/ulysses.txt: 1289/272816 books/throughTheLookingGlass.txt: 104/33924 books/huckleberryFinn.txt: 366/120070 books/originOfSpecies.txt: 265/160071