CPSC 320: Analysis of Algorithms Spring 2026

Programming project: DNA sequence alignment

Due Wednesday, April 22*

A DNA (or genetic) sequence is a string defined over the alphabet $\{ A, C, G, T \}$ (where $A$, $C$, $G$ and $T$ stand for the four nucleotide bases of a DNA, namely, adenosine, cytidine, guanosine and thymidine, respectively). In molecular biology, it is understood that such a sequence encodes instructions for building protein molecules that form your cells and, by extension, your body (for more on DNA sequencing and its importance, see, e.g., NHGRI's educational resources). For this programming project, with another student as a pair, you are to implement a program to determine the similarity of two given DNA sequences by finding the “best” alignment (which will be defined below) of the two.

Edit distance. In this project, we will measure the similarity of two DNA sequences by their edit distance, which is defined by the total penalty costs for gaps and mismatches that arise when two sequences are aligned together. The penalty cost for any single gap is usually defined by a constant $\delta$, whereas the penalty cost for a mismatch may be different for each mismatched pair. For the alphabet $\{ A, C, G, T \}$, the penalty costs for mismatches may be defined by a similarity matrix such as
  A C G T
A 0 1 3 0
C 1 0 1 2
G 3 1 0 4
T 0 2 4 0
In this matrix, the penalty cost for the mismatch $(A, C)$ is $1$, whereas the penalty cost for the mismatch $(A, G)$ is $3$. In general, a similarity matrix is usually symmetric in the sense that the penalty cost for $(X, Y)$ is equal to that for $(Y, X)$ for any letters $X$ and $Y$. Also, its diagonal entries are usually $0$, indicating that there is no penalty for matching pairs. For example, using this matrix with the gap penalty $\delta = 2$, two possible alignments of the sequences AAAGTCTGAC and AACGTTTAC are:
Sequence 1: A A A G T C T G A C A A A G T C T G A C
Sequence 2: A A C G T - T T A C A A C G T T T - A C
Penalty: 0 0 1 0 0 2 0 4 0 0 0 0 1 0 0 2 0 2 0 0
The best alignment of two sequences is an alignment that yields the minimum edit distance (i.e., an alignment that minimizes the total penalty for gaps and mismatches). In this example, the second alignment in fact yields the minimum edit distance of $5$.

1. Algorithm design

First, for given DNA sequences $s$ and $t$ of lengths $m$ and $n$, respectively, both defined over the alphabet $\{ A, C, G, T \}$, design an $O(mn)$-time algorithm to find the best alignment of $s$ and $t$ and justify its correctness and running time. You may assume that the penalty cost for any single gap is a constant $\delta$ and the penalty costs for mismatches are specified in a similarity matrix.

2. Implementation

Now, implement your algorithm in Java. Your program must be a command-line Java application and able to read input from any text file in the following format: a value of $\delta$, a $4 \times 4$ similarity matrix, followed by two sequences. For example, if your Java application is named DNAAlign, and input is given in a file named input.txt, then it should run in command line with the command:
$ java DNAAlign input.txt
Here is an example of an input file for the example given above:
2
0 1 3 0
1 0 1 2
3 1 0 4
0 2 4 0
AAAGTCTGAC
AACGTTTAC
Assume that a similarity matrix is indexed in the order of $A$, $C$, $G$, $T$ for both rows and columns. You may also assume that such a matrix is symmetric, its diagonal entries are $0$, and the remaining entries are in the range of $0$ through $9$. Your program must output in command line in the following format:
The best alignment is

A A A G T C T G A C
A A C G T T T - A C
0 0 1 0 0 2 0 2 0 0

with the minimum edit distance of 5.
Simplified version for less credit. If you run into difficulty with a similarity matrix, you may simplify the problem in the following way to earn up to 80% of the full credit: Assume that the penalty cost for any single gap is $2$, the penalty costs for mismatches are all $1$, and there is no penalty for matching pairs (thus, input has just two sequences).

3. Compatibility and APIs

Your implementation must be well-designed, well-documented and compatible with Java Platform, Standard Edition (Java SE) 8 available in the GNU/Linux system in the department. You are free and expected to use any of the core APIs, but you must write everything else on your own. In your documentation, clearly state exactly what APIs you are using.

4. Documentation

Javadoc. Starting with a file header with your and your partner's names, document every method in every source file following the Javadoc formant. Javadoc is a documentation production program that comes with the standard JDK. This program takes a Java source file that has been commented using Javadoc tags and produces an HTML document. For more details about Javadoc, see theJavadoc Tool Home Page.

README. You must also create a README file, in plain text, explaining how one can compile and execute your program. It should begin with your and your partner's names. In this file, you should also provide a brief overview of your implementation, including how you implemented your algorithm using what data structures and what Java classes.

5. What to hand in

Archive all .java files and a README file in a single JAR file named with your and your partner's initials (e.g., am-tm.jar). Before you submit, make sure to test and confirm that your JAR file has all required files. A late resubmission for any missing file will result in a late penalty. For a tutorial on JAR files, see Lesson: Packaging Programs in JAR Files. While this is a pair project, each student must submit the JAR file to this Moodle site:
https://moodle.trincoll.edu/mod/assign/view.php?id=628524
Also, each student must handwrite and submit a solution to Part 1 in paper.

Plagiarism and academic dishonesty. Remember, under any circumstance, you must not copy part or all of another's work and present it as your own. For more details, read our course policy on plagiarism and academic dishonesty.

*If you will be giving a presentation on Monday, April 20, or Wednesday, April 22, the deadline is Monday, April 27.


* CPSC 320 home page
Valid HTML 4.01!