hw26: DNA searching

Due before the beginning of class on Wednesday, April 3

The Data
Sequence matching
Your task
Starter code
Hints
Submit command

The Data

A DNA sequence is written as a (very) long string of A, G, T, C representing the different nucleobases that can occur.

Download and unzip the file dna.txt.gz which contains a DNA sequence of length 10 million. You can download directly on the command line with:

wget "https://roche.work/courses/s24sd212/hw/parallel/dna.txt.gz"

After downloading, unzip to create a file dna.txt by running:

gunzip dna.txt.gz

Sequence matching

It is a very common and important task to look for a specific gene or mutation within a larger sequence. For example, there are some known genes that indicate a significantly increased risk for certain cancers. If you submit your own DNA to various commercial and medical providers, they will search for these known genes and tell you if you are at increased risk for those cancers.

But getting a 100% perfect match is too restrictive — there may be some small differences in one or two letters, which still indicate an overall “match”. So the question really becomes, how close some short sequence to matching part of the larger one.

For the purposes of this homework, the matching works by computing the smallest number of different letters in all the possible positions where the search string can occur in the larger sequence.

For example, imagine the entire DNA sequence is:

TTAAAGCCAGTTCTCCATCGGTAAGCTCACGTATAACCGC

and we are searching for this pattern:

CCGTA

If we try the pattern at position 7 in the sequence, it looks like this and there are two mismatches as indicated:

TTAAAGCCAGTTCTCCATCGGTAAGCTCACGTATAACCGC
       CCGTA
        ^  ^
       2 mismatches

But later on at position 18, there is only one mismatching letter:

TTAAAGCCAGTTCTCCATCGGTAAGCTCACGTATAACCGC
                            CCGTA
                            ^
                            1 mismatch

It turns out that is the fewest number of mismatches at any position, so the best match here is a score of “1”.

Your task

Write a program genesearch.py which uses the dna.txt provided, inputs a search string from the user, and reports the best matching score for any position of that search string in the larger sequence.

Here are two example runs:

roche@ubuntu$ python3 genesearch.py
enter search pattern: ACACATTCTTTTCTATAGTC
4

roche@ubuntu$ python3 genesearch.py
enter search pattern: GATCCGTAGGTAGTGCCCCACA
1

Of course, we want you to use multiprocessing to make the search more efficient! Your program should use at least 4 and at most 10 parallel processes to perform the computation.

Starter code

Fortunately, you don’t have to start from scratch!

This is a fully working genesearch.py program which you should copy/paste and start with.

Your actual task is just to modify it so that it uses multiprocessing to split up the work of searching through the large sequence.

Hints

Review the examples from the end of our notes on multiprocessing using SimpleQueue
The full dna.txt file has exactly 10 million letters. You need to divide this into smaller sub-ranges to search.
Because we are using multiprocessing and need to communicate between the children and parent processes, you should use a SimpleQueue which is created in the parent and sent as a function argument to the children processes.
Each process will search only part of the total dna sequence, should .put() into the shared SimpleQueue a single number representing the smallest number of mismatches found in that part.
The parent process will get() each of these numbers from the SimpleQueue and report the smallest one.

Submit command

To submit files for this homework, run one of these commands:

submit -c=sd212 -p=hw26 genesearch.py
club -csd212 -phw26 genesearch.py