BioComputing

  Secondary Structure Element Alignment
(SSEA)


Scoring and Benchmarking

Scoring System
Each secondary structure element is represented by a letter for the state (H, E, C), and a number corresponding to the length of the region. Matches (H -> H, E -> E, C -> C) are scored by the length of the shorter fragment. Mismatches (H -> E, E -> H) do not contribute to the score. Structure to coil matches ({H, E} -> C) are weighted half the length of the shorter segment. Unlike the original method (Przytycka et al., Nat Struct Biol 6:672-682; 1999), we do not split secondary structure elements to obtain better matches.

The similarity score is normalized in the range 0-100, dividing by half the length of the two sequences. A Z-score is calculated over all predictions in database alignment mode to estimate the statistical significance.

Example
Seq A =    CCCCCHHHHHHHHCCCCHHHHHHHHHHCCCCCCC    
ssea_rep = C4,  H8,     C4, H10,      C7

Seq B =    CCEEECCCHHHHHHCCCCHHHHHHHHCCCEEECCCC  
ssea_rep = C2,E3,C2,H6,  C4, H8,     C3,E3,C4

SSEA global alignment:
         -----CCCCCHHHHHHHHCCCCHHHHHHHHHH------CCCCCCC
         CCEEECCC--HHHHHH--CCCCHHHHHHHH--CCCEEECCCC---
score =         3  +   6  +  4  +   8      +    4      =  25

normalized score = 25 / ((34+36)/2) * 100 = 71.4286

Benchmarking
A subset of the McGuffin and Jones set of 252 "known" domains (Proteins 48:44-52; 2002) was used to benchmark SSEA. In order to avoid conflicting domain definitions, only those domains with an identical definition in both the test set and SCOP 1.65 were selected, yielding a subset of 98 query domains. The Z-score distribution and % correct hits were calculated over all predictions made for each of the query sequences with DSSP secondary structure and global alignment against the SCOP-95 fold library. Self-hits were excluded. Correct hits were defined as those sharing the same SCOP fold class between query and subject. This ensures a Z-score distribution which is similar to a "typical" user request, where the query may be similar to a known protein.

An alternate estimate for the % correct hits can be produced by employing the full McGuffin set of 252 "known" proteins. This is a difficult test case, because not all structures have a good representative in the SCOP-95 fold library. Hence a 100.0 accuracy Z-score cutoff should not be expected. It is also more representative of the case in which the user may want to estimate the probability of encountering a novel fold.

Statistical Significance
The results for the 98 domain benchmarking set are summarized in this table:

Z-score cutoff % correct hits
1.5 6.2
2.0 8.8
2.5 16.3
3.0 26.9
3.5 36.3
4.0 47.1
4.2 51.4
4.4 68.4
4.6 100.0

An alternate estimate on the difficult McGuffin-252 fold recognition set follows:

Z-score cutoff % correct hits
1.5 24.7
2.0 29.4
2.5 28.6
3.0 66.7
>= 3.1 N/A



Silvio Tosatto   08 / 2004