|
CPSC 320: Analysis of Algorithms
|
Spring 2026
|
Programming project: DNA sequence alignment
Due Wednesday, April 22*
A DNA (or genetic) sequence is a string defined over the
alphabet $\{ A, C, G, T \}$ (where $A$, $C$, $G$ and $T$ stand for the four
nucleotide bases of a DNA, namely, adenosine, cytidine,
guanosine and thymidine, respectively). In molecular biology,
it is understood that such a sequence encodes instructions for building
protein molecules that form your cells and, by extension, your body (for more
on DNA sequencing and its importance, see, e.g.,
NHGRI's educational resources).
For this programming project, with another student as a pair, you are to
implement a program to determine the similarity of two given DNA sequences by
finding the “best” alignment (which will be defined below) of the
two.
Edit distance.
In this project, we will measure the similarity of two DNA sequences by
their edit distance, which is defined by the total penalty costs for
gaps and mismatches that arise when two sequences are aligned
together. The penalty cost for any single gap is usually defined by a
constant $\delta$, whereas the penalty cost for a mismatch may be different
for each mismatched pair. For the alphabet $\{ A, C, G, T \}$, the penalty
costs for mismatches may be defined by a similarity matrix such as
A C G T
A 0 1 3 0
C 1 0 1 2
G 3 1 0 4
T 0 2 4 0
In this matrix, the penalty cost for the mismatch $(A, C)$ is $1$, whereas the
penalty cost for the mismatch $(A, G)$ is $3$. In general, a similarity
matrix is usually symmetric in the sense that the penalty cost for
$(X, Y)$ is equal to that for $(Y, X)$ for any letters $X$ and $Y$. Also, its
diagonal entries are usually $0$, indicating that there is no penalty for
matching pairs. For example, using this matrix with the gap penalty
$\delta = 2$, two possible alignments of the sequences AAAGTCTGAC and
AACGTTTAC are:
| Sequence 1: |
A A A G T C T G A C |
A A A G T C T G A C |
| Sequence 2: |
A A C G T - T T A C |
A A C G T T T - A C |
| Penalty: |
0 0 1 0 0 2 0 4 0 0 |
0 0 1 0 0 2 0 2 0 0 |
The best alignment of two sequences is an alignment that yields the
minimum edit distance (i.e., an alignment that minimizes the total penalty for
gaps and mismatches). In this example, the second alignment in fact yields
the minimum edit distance of $5$.
1. Algorithm design
First, for given DNA sequences $s$ and $t$ of lengths $m$ and $n$,
respectively, both defined over the alphabet $\{ A, C, G, T \}$, design an
$O(mn)$-time algorithm to find the best alignment of $s$ and $t$ and justify
its correctness and running time. You may assume that the penalty cost for
any single gap is a constant $\delta$ and the penalty costs for mismatches
are specified in a similarity matrix.
2. Implementation
Now, implement your algorithm in Java. Your program must be a command-line
Java application and able to read input from any text file in the following
format:
a value of $\delta$, a $4 \times 4$ similarity matrix, followed by two
sequences.
For example, if your Java application is named DNAAlign, and input
is given in a file named input.txt, then it should run in command
line with the command:
$ java DNAAlign input.txt
Here is an example of an input file for the example given above:
2
0 1 3 0
1 0 1 2
3 1 0 4
0 2 4 0
AAAGTCTGAC
AACGTTTAC
Assume that a similarity matrix is indexed in the order of $A$, $C$, $G$, $T$
for both rows and columns. You may also assume that such a matrix is
symmetric, its diagonal entries are $0$, and the remaining entries are in the
range of $0$ through $9$. Your program must output in command line in the
following format:
The best alignment is
A A A G T C T G A C
A A C G T T T - A C
0 0 1 0 0 2 0 2 0 0
with the minimum edit distance of 5.
Simplified version for less credit.
If you run into difficulty with a similarity matrix, you may simplify the
problem in the following way to earn up to 80% of the full credit: Assume
that the penalty cost for any single gap is $2$, the penalty costs for
mismatches are all $1$, and there is no penalty for matching pairs
(thus, input has just two sequences).
3. Compatibility and APIs
Your implementation must be well-designed, well-documented and
compatible with Java Platform, Standard Edition (Java SE) 8
available in the GNU/Linux system in the department. You are free and
expected to use any of the core APIs,
but you must write everything else on your own. In your documentation,
clearly state exactly what APIs you are using.
4. Documentation
Javadoc.
Starting with a file header with your and your partner's names, document
every method in every source file following the Javadoc formant.
Javadoc is a documentation production program that comes with the standard
JDK. This program takes a Java source file that has been commented using
Javadoc tags and produces an HTML document. For more details about
Javadoc, see theJavadoc Tool Home Page.
README.
You must also create a README file, in plain text, explaining
how one can compile and execute your program. It should begin with your and
your partner's names. In this file, you should also provide a brief overview
of your implementation, including how you implemented your algorithm using
what data structures and what Java classes.
5. What to hand in
Archive all .java files and a README file in a single JAR
file named with your and your partner's initials (e.g., am-tm.jar).
Before you submit, make sure to test and confirm that your JAR file has all
required files. A late resubmission for any missing file will result in a
late penalty. For a tutorial on JAR files, see
Lesson: Packaging Programs in JAR Files.
While this is a pair project, each student must submit the JAR file to this Moodle site:
https://moodle.trincoll.edu/mod/assign/view.php?id=628524
Also, each student must handwrite and submit a solution to Part 1 in paper.
Plagiarism and academic dishonesty.
Remember, under any circumstance, you must not copy part or all of another's
work and present it as your own. For more details, read
our course policy on plagiarism and
academic dishonesty.
*If you will be giving a presentation on Monday, April 20, or Wednesday,
April 22, the deadline is Monday, April 27.