![]() |
Secondary Structure Element Alignment |
Scoring and Benchmarking | ||||||||||||||||||||||||||||||||
| Scoring System | ||||||||||||||||||||||||||||||||
|
Each secondary structure element is represented by a letter for the state
(H, E, C), and a number corresponding to the length of the region. Matches
(H -> H, E -> E, C -> C) are scored by the length of the shorter fragment.
Mismatches (H -> E, E -> H) do not contribute to the score. Structure to coil
matches ({H, E} -> C) are weighted half the length of the shorter segment. Unlike
the original method (Przytycka et al., Nat Struct Biol 6:672-682; 1999),
we do not split secondary structure elements to obtain better matches.
The similarity score is normalized in the range 0-100, dividing by half the length of the two sequences. A Z-score is calculated over all predictions in database alignment mode to estimate the statistical significance.
| ||||||||||||||||||||||||||||||||
| Example | ||||||||||||||||||||||||||||||||
Seq A = CCCCCHHHHHHHHCCCCHHHHHHHHHHCCCCCCC
ssea_rep = C4, H8, C4, H10, C7
Seq B = CCEEECCCHHHHHHCCCCHHHHHHHHCCCEEECCCC
ssea_rep = C2,E3,C2,H6, C4, H8, C3,E3,C4
SSEA global alignment:
-----CCCCCHHHHHHHHCCCCHHHHHHHHHH------CCCCCCC
CCEEECCC--HHHHHH--CCCCHHHHHHHH--CCCEEECCCC---
score = 3 + 6 + 4 + 8 + 4 = 25
normalized score = 25 / ((34+36)/2) * 100 = 71.4286
| ||||||||||||||||||||||||||||||||
| Benchmarking | ||||||||||||||||||||||||||||||||
|
A subset of the McGuffin and Jones set of 252 "known" domains
(Proteins 48:44-52; 2002) was used to benchmark SSEA. In
order to avoid conflicting domain definitions, only those domains with an
identical definition in both the test set and SCOP 1.65 were selected,
yielding a subset of 98 query domains.
The Z-score distribution and % correct hits were calculated over
all predictions made for each of the query sequences with DSSP secondary
structure and global alignment against the SCOP-95 fold library.
Self-hits were excluded. Correct hits were defined as those sharing the
same SCOP fold class between query and subject. This ensures a Z-score
distribution which is similar to a "typical" user request, where the
query may be similar to a known protein.
An alternate estimate for the % correct hits can be produced by employing the
full McGuffin
set of 252 "known" proteins. This is a difficult test case,
because not all structures have a good representative in the SCOP-95 fold
library. Hence a 100.0 accuracy Z-score cutoff should not be expected. It is
also more representative of the case in which the user may want to estimate
the probability of encountering a novel fold.
| ||||||||||||||||||||||||||||||||
| Statistical Significance | ||||||||||||||||||||||||||||||||
|
The results for the 98 domain benchmarking set are summarized in this
table:
An alternate estimate on the difficult McGuffin-252 fold recognition set follows:
|