hw18: DNA searching
- Due before the beginning of class on Wednesday, March 25
The Data
A DNA sequence is written as a (very) long string of A, G, T, C representing the different nucleobases that can occur.
Run git pull in your sd212 directory.
You should see a new folder for this HW with two files:
dna.txtis a DNA sequence of length 10 milliongenesearch.pyis a Python program that you will modify and submit
Sequence matching
It is a very common and important task to look for a specific gene or mutation within a larger sequence. For example, there are some known genes that indicate a significantly increased risk for certain cancers. If you submit your own DNA to various commercial and medical providers, they will search for these known genes and tell you if you are at increased risk for those cancers.
But getting a 100% perfect match is too restrictive — there may be some small differences in one or two letters, which still indicate an overall “match”. So the question really becomes, how close some short sequence to matching part of the larger one.
For the purposes of this homework, the matching works by computing the smallest number of different letters in all the possible positions where the search string can occur in the larger sequence.
For example, imagine the entire DNA sequence is:
TTAAAGCCAGTTCTCCATCGGTAAGCTCACGTATAACCGC
and we are searching for this pattern:
CCGTA
If we try the pattern at position 7 in the sequence, it looks like this and there are two mismatches as indicated:
TTAAAGCCAGTTCTCCATCGGTAAGCTCACGTATAACCGC
CCGTA
^ ^
2 mismatches
But later on at position 18, there is only one mismatching letter:
TTAAAGCCAGTTCTCCATCGGTAAGCTCACGTATAACCGC
CCGTA
^
1 mismatch
It turns out that is the fewest number of mismatches at any position, so the best match here is a score of “1”.
Provided code
The genesearch.py
program which you have uses the dna.txt provided,
inputs a search string from the user, and reports the best matching
score for any position of that search string in the larger sequence.
Here are two example runs:
roche@ubuntu$python3 genesearch.pyenter search pattern:ACACATTCTTTTCTATAGTC4
roche@ubuntu$python3 genesearch.pyenter search pattern:GATCCGTAGGTAGTGCCCCACA1
Your task
You need to modify the genesearch.py program to make it run faster using multiprocessing. Your program should use at least 4 and at most 10 parallel processes to perform the computation.
Hints
Review the examples in using
concurrent.futures.ProcessPoolExecutorthat we went through in class, especially the prime number counting example.The full
dna.txtfile has exactly 10 million letters. You need to divide this into smaller sub-ranges to search.Recall the main steps that you need to do in ANY program to use multiprocessing with the ProcessPoolExecutor:
- Write a function to do what each task will do.
- Add a with statement to launch the worker processes
- Write a loop or something to start each task by calling
submitand passing in your function name as well as the arguments for each task - Write a separate loop to get the results of each task by
calling
.result()on each task - Combine the results in the way that makes sense for this problem, to get the final answer.
Submit command
To submit files for this homework, run one of these commands:
submit -c=sd212 -p=hw18 genesearch.py
club -csd212 -phw18 genesearch.py