hw24: DNA searching
- Due before the beginning of class on Friday, March 31
The Data
A DNA sequence is written as a (very) long string of A, G, T, C representing the different nucleobases that can occur.
Download and unzip the file dna.txt.gz which contains a DNA sequence of length 10 million. You can download directly on the command line with:
wget "https://roche.work/courses/s23sd212/hw/parallel/dna.txt.gz"
After downloading, unzip to create a file dna.txt
by running:
gunzip dna.txt.gz
Sequence matching
It is a very common and important task to look for a specific gene or mutation within a larger sequence. For example, there are some known genes that indicate a significantly increased risk for certain cancers. If you submit your own DNA to various commercial and medical providers, they will search for these known genes and tell you if you are at increased risk for those cancers.
But getting a 100% perfect match is too restrictive — there may be some small differences in one or two letters, which still indicate an overall “match”. So the question really becomes, how close some short sequence to matching part of the larger one.
For the purposes of this homework, the matching works by computing the smallest number of different letters in all the possible positions where the search string can occur in the larger sequence.
For example, imagine the entire DNA sequence is:
TTAAAGCCAGTTCTCCATCGGTAAGCTCACGTATAACCGC
and we are searching for this pattern:
CCGTA
If we try the pattern at position 7 in the sequence, it looks like this and there are two mismatches as indicated:
TTAAAGCCAGTTCTCCATCGGTAAGCTCACGTATAACCGC
CCGTA
^ ^
2 mismatches
But later on at position 18, there is only one mismatching letter:
TTAAAGCCAGTTCTCCATCGGTAAGCTCACGTATAACCGC
CCGTA
^
1 mismatch
It turns out that is the fewest number of mismatches at any position, so the best match here is a score of “1”.
Your task
Write a program genesearch.py
which uses the dna.txt
provided,
inputs a search string from the user, and reports the best matching
score for any position of that search string in the larger sequence.
Here are two example runs:
roche@ubuntu$
python3 genesearch.py
enter search pattern:
ACACATTCTTTTCTATAGTC
4
roche@ubuntu$
python3 genesearch.py
enter search pattern:
GATCCGTAGGTAGTGCCCCACA
1
Of course, we want you to use multiprocessing to make the search more efficient! Your program should use at least 4 and at most 10 parallel processes to perform the computation.
Starter code
Fortunately, you don’t have to start from scratch!
This is a fully working genesearch.py program which you should copy/paste and start with.
Your actual task is just to modify it so that it uses multiprocessing to split up the work of searching through the large sequence.
Hints
Review the examples from the end of our notes on multiprocessing using SimpleQueue
The full
dna.txt
file has exactly 10 million letters. You need to divide this into smaller sub-ranges to search.Because we are using multiprocessing and need to communicate between the children and parent processes, you should use a
SimpleQueue
which is created in the parent and sent as a function argument to the children processes.Each process will search only part of the total dna sequence, should
.put()
into the shared SimpleQueue a single number representing the smallest number of mismatches found in that part.The parent process will
get()
each of these numbers from the SimpleQueue and report the smallest one.
Submit command
To submit files for this homework, run one of these commands:
submit -c=sd212 -p=hw24 genesearch.py
club -csd212 -phw24 genesearch.py