Peptide Identification

Peptide Identification


We are now ready to solve the Peptide
Identification Problem, which is: Find a peptide from a proteome with
a maximum score against a spectrum. The input to the problem
is a spectral vector and an amino acid string Proteome, and the
output is an amino acid string Peptide that maximizes the score
between peptide and spectral vector on all
substrings of Proteome. However, even if we solve this problem, how will it help us, since we don’t
know what the T-Rex Proteome is? However, we can approximate
T-Rex Proteome. Note that 90% of proteins making
up animal bones are collagens, and collagens are very conserved across
species, and therefore, collagens in T-Rex and will likely be similar to
collagens in some present day species, the question is to which ones. And therefore, Asara formed
a database of collagen proteins, but as a sanity check, he actually
compared the T-Rex spectra against the entire UniProt database of
all proteins known, which currently consists of 200 million
amino acids from hundreds of species. Asara also included some mutated version
of collagens from present day species. We will call the augmented
database UniProt+. Afterwards, Asara searched all T-Rex
spectra against the UniProt+ database, and it turns out that most of the high scoring
peptides identified in UniProt+ were chicken collagens, or
were very similar to chicken collagens, supporting the hypothesis that
birds evolved from dinosaurs. For example, this is one of the peptides
identified by Asara for a dinosaur spectrum, and this peptide is
only one mutation away, shown in red,
from the chicken collagen peptide. But how can we be sure
that this DinosaurPeptide is the correct interpretation
of DinosaurSpectrum? Shouldn’t we analyze the statistical
significance of DinosaurPeptide? In fact, DinosaurPeptide, indeed,
is the highest scoring peptide for DinosaurSpectrum among all
peptides in UnitProt+. But the surprising fact is that
there are billions of peptides not occurring in UnitProt+
that outscore DinosaurPeptide. Does this concern you? It looks like we need
to develop a method for evaluating the statistical
significance of identified peptides. And to help us with evaluating the statistical
significance of DinosaurPeptide, we need to recall the small print. The match between Peptide and
Spectrum was considered significant if the resulting score is sufficiently high, and therefore, we defined the notion
of Peptide Spectrum Match as follows: Given a parameter “threshold,”
a peptide, and a spectral vector, form a Peptide-Spectrum Match,
abbreviated as PSM, if Peptide is a highest-scoring peptide against
Spectrum among all peptides in Proteome. But that’s not all. In addition, we will require
the score between Peptide and Spectrum to be larger or
equal to the threshold. And after we define the notion of PSM,
we will define the set of PSMs derived from Proteome and
SpectralVector and a given threshold as the set of all
Peptide-Spectrum matches resulting from a set of SpectralVectors for
a given Proteome and a given threshold. Our PSM search problem now is to
identify all Peptide-Spectrum Matches scoring above a threshold for
a set of spectra and a proteome. Input is a set SpectralVectors,
an amino acid string Proteome, and a score threshold. And the Output is the set of
Peptide-Spectrum Matches define for Proteome, SpectralVectors,
and the parameter threshold.

Leave a Reply

Your email address will not be published. Required fields are marked *