CPSC-115 Fall 2008
Project 3 - 100 points
Professor Heidi Ellis

Algorithm due 1:00 p.m. Wednesday October 8, 2008 (20 points)
Program due 11:00 a.m. Thursday October 16, 2008 (80 points)

Deliverables:

You must provide the following: This project will help you gain skills using nested loops, string functions and file I/O. This project will be completed using the following pairs:

Teammate 1Teammate 2   Teammate 1Teammate 2
Kristen AndersonCorazon Irizarry   Jake ElderGreg Vaughan
Jin Feng LiuRyan Ersland   Catherine DoyleJesse Vazquez
Nick DraguJohn Wilsterman   Chelsea Bainbridge-DonnerJeff Young

This project will take significant thought so plan ahead. Develop in a very incremental manner and save a copy of versions of code that work at various steps in the process. You will want to leave yourself time to ask questions of the instructor or the TA.

Overview

Proteins are large organic molecules constructed out of chains of 20 different types of amino acids. Each protein chain folds into a 3-D structure which determines the functionality of the molecule. Proteins perform critical functions of life including acting as enzymes and hormones, maintinaing the shape of cells, muscles, tendons and ligaments, defending the body against antigens (antibodies), storage of amino acids, carriers to move molecules from one place to another in the body and aiding in movement. The sequence of amino acids determines the shape of the protein. The table below shows the 20 amino acids and their identifiers:

Amino Acid NameSymbol Amino Acid NameSymbol
Alanine A LeucineL
ArginineR LysineK
AsparagineN MethionineM
Aspartic acidD PhenylalanineF
CysteineC ProlineP
Glutamic acidE SerineS
GlutamineQ ThreonineT
GlycineG TryptophanW
HistidineH TyrosineY
IsoleucineI ValineV

The structure of a protein can be represented as a string of these upper-case characters where each character represents the amino acid in the protein's sequence. For example:
ADQLTEEQIAEFKEAFSLFDKDGDGTITTKELGTVMRSLGQNPTEAELQDMI
NEVDADGNGTIDFPEFLTMMARKMKDTDSEEEIREAFRVFDKDGNGYISAAEL
RHVMTNLGEKLTDEEVDEMIREADIDGDGQVNYEEFVQMMTAK
Scientists are interested in how proteins are similar because similar proteins may have similar functions. One simple way to measure the similarity of two proteins is by determining the number of the same amino acids that the two proteins have in the same location in their sequences. Consider the snippets of two protein sequences shown below:
Protein 1:  AEKEAFSQVNEEF
Protein 2:  QVKNAYMGEGEEPFM
Protein 1 and 2 share four amino acides in common (K, A, E, E). In addition, the common amino acids are in three subsequences (K, A, EE). The two proteins share four out of a maximum of fifteen possible amino acids. Therefore 4/15 = 26.6666666666667% similiarity.

You must write a program to determine the similarity between protein sequences. The protein sequences are to be input from a file. The input file must contain a series of contiguous strings (i.e., strings with no blanks in them). The first sequenece in the file is to be read in as the base protein. All other proteins in the file are to be compared to the base protein. Your program must:

Input: The input to your program is a text file that contains a series of protein representations. An example input file is:
EIREAFREEFVGTITTEIR
GDLLFSGNPTIKKEFSQLTIFSLQIAE
SLREAFREEFVGPNNMI
EMREAFLEEFQGTITLEIF
EIREAFREEFVGTITTEIR
The corresponding example output file looks like:
Base Protein:
EIREAFREEFVGTITTEIR
******************** Next Protein *****************
GDLLFSGNPTIKKEFSQLTIFSLQIAE
Shortest identical sequence: 
Length of shortest identical sequence: 0
Longest identical sequence: 
Length of longest identical sequence: 0
Number of matching sequences: 0
Number of matching amino acids: 0
Percentage match between the two strings: %0.0
******************** Next Protein *****************
SLREAFREEFVGPNNMI
Shortest identical sequence: REAFREEFVG
Length of shortest identical sequence: 10
Longest identical sequence: REAFREEFVG
Length of longest identical sequence: 10
Number of matching sequences: 1
Number of matching amino acids: 10
Percentage match between the two strings: %52.63157894736842
******************** Next Protein *****************
EMREAFLEEFQGTITLEIF
Shortest identical sequence: E
Length of shortest identical sequence: 1
Longest identical sequence: REAF
Length of longest identical sequence: 4
Number of matching sequences: 5
Number of matching amino acids: 14
Percentage match between the two strings: %73.68421052631578
******************** Next Protein *****************
EIREAFREEFVGTITTEIR
Shortest identical sequence: 
Length of shortest identical sequence: 0
Longest identical sequence: EIREAFREEFVGTITTEIR
Length of longest identical sequence: 19
Number of matching sequences: 1
Number of matching amino acids: 19
Percentage match between the two strings: %100.0
Another example input file is provided for you to test your program.

You must abide by the following:

Grading:

Project 3 will be graded on: