hw18: DNA searching

Due before the beginning of class on Wednesday, March 25

The Data
Sequence matching
Provided code
Your task
Hints
Submit command

The Data

A DNA sequence is written as a (very) long string of A, G, T, C representing the different nucleobases that can occur.

Run git pull in your sd212 directory.

You should see a new folder for this HW with two files:

dna.txt is a DNA sequence of length 10 million
genesearch.py is a Python program that you will modify and submit

Sequence matching

It is a very common and important task to look for a specific gene or mutation within a larger sequence. For example, there are some known genes that indicate a significantly increased risk for certain cancers. If you submit your own DNA to various commercial and medical providers, they will search for these known genes and tell you if you are at increased risk for those cancers.

But getting a 100% perfect match is too restrictive — there may be some small differences in one or two letters, which still indicate an overall “match”. So the question really becomes, how close some short sequence to matching part of the larger one.

For the purposes of this homework, the matching works by computing the smallest number of different letters in all the possible positions where the search string can occur in the larger sequence.

For example, imagine the entire DNA sequence is:

TTAAAGCCAGTTCTCCATCGGTAAGCTCACGTATAACCGC

and we are searching for this pattern:

CCGTA

If we try the pattern at position 7 in the sequence, it looks like this and there are two mismatches as indicated:

TTAAAGCCAGTTCTCCATCGGTAAGCTCACGTATAACCGC
       CCGTA
        ^  ^
       2 mismatches

But later on at position 18, there is only one mismatching letter:

TTAAAGCCAGTTCTCCATCGGTAAGCTCACGTATAACCGC
                            CCGTA
                            ^
                            1 mismatch

It turns out that is the fewest number of mismatches at any position, so the best match here is a score of “1”.

Provided code

The genesearch.py program which you have uses the dna.txt provided, inputs a search string from the user, and reports the best matching score for any position of that search string in the larger sequence.

Here are two example runs:

roche@ubuntu$ python3 genesearch.py
enter search pattern: ACACATTCTTTTCTATAGTC
4

roche@ubuntu$ python3 genesearch.py
enter search pattern: GATCCGTAGGTAGTGCCCCACA
1

Your task

You need to modify the genesearch.py program to make it run faster using multiprocessing. Your program should use at least 4 and at most 10 parallel processes to perform the computation.

Hints

Review the examples in using concurrent.futures.ProcessPoolExecutor that we went through in class, especially the prime number counting example.
The full dna.txt file has exactly 10 million letters. You need to divide this into smaller sub-ranges to search.
Recall the main steps that you need to do in ANY program to use multiprocessing with the ProcessPoolExecutor:
1. Write a function to do what each task will do.
2. Add a with statement to launch the worker processes
3. Write a loop or something to start each task by calling submit and passing in your function name as well as the arguments for each task
4. Write a separate loop to get the results of each task by calling .result() on each task
5. Combine the results in the way that makes sense for this problem, to get the final answer.

Submit command

To submit files for this homework, run one of these commands:

submit -c=sd212 -p=hw18 genesearch.py
club -csd212 -phw18 genesearch.py