|
|
Quick Help and References |
| Description |
|
TESE is a web server that can be used to derive curated sets of protein structures to be used in a variety of situations. The most typical is to construct representative sets of protein sequences and/or structures for the benchmarking of novel methods.
The benchmarking process usually requires a set of structurally non-redundant proteins, ideally covering the whole known protein universe, from which to infer the accuracy and generality of the new method, e.g. in fold recognition or when deriving statistical potentials. Alternatively, the user may be interested in a specific class of proteins, e.g. solenoid repeats or the Rossman fold, for which to derive a specific test set. Last but not least, given the exponential growth of available structural data, it can be desirable to update an existing test set previously used in the literature to improve the statistical significance of the results. TESE was construct to answer all of these questions. The basic idea of TESE is to use the CATH structural classification to control the level of structural/sequence similarity contained in a set of protein structures. CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels: Class (C), Architecture (A), Topology (T) and Homologous superfamily (H) of increasing structural similarity. Subsequent levels define sequence similarity thresholds at 35% (S), 65% (O), 95% (L) and 100% (I) pairwise sequence identity. By restricting the proteins to be selected to a given CATH level, it is therefore possible to elegantly limit redundancy beyond sequence-based approaches.
The server has been designed to make the selection process as easy as possible. It currently offers three different search modes to initiate the data collection process, described in the following sections: query, PDB ensemble and key word. |
| Search Mode |
| Query |
|
The Query mode allows the user to select structural and quality filters to generate a test set at any given level of residual sequence and structural similarity. To this end, it uses the CATH structural classification scheme. A detailed description of the CATH classification scheme can be found here. Briefly put, the CATH level allows the user to control the residual level of sequence and/or structural similarity. For instance, the "H" level corresponds to homologous superfamilies related on the structural level but beyond recognition for pairwise sequence-based methods. Subsequent CATH levels represent gradually less stringent redundancy filters at 35% ("S"), 60% ("O"), 95% ("L") and 100% sequence identity ("I"). Additionally, it implements several quality checks for the structures to be included, e.g. X-ray resolution and R-free cutoffs. For the Output see below. The input form consists of two parts. The left half, termed "Query" contains the actual search statements in a script language. Since this would be unnecessarily complex for the average user, the right half of the input form contains the "Wizard". The lower left part contains the visualization options. It is possible to choose between interactive and non interactive visulaization. The interactive is useful for test sets with less than six hundred protein domains and allows the user to directly manipulate the protein domains to include and/or exclude from the test set. It is also possible to choose whether to include images for each protein domain (slower, but easier identify differences) or not (faster, but more difficult). The non interactive option is designed for large scale test sets where the user is interested in selecting thousands of structures without checking each one by hand.
The "Wizard" on the right half contains four consecutive blocks of selection statements. These are:
|
| PDB Ensemble |
|
The PDB Ensemble mode can be used to seed the structural search form a limited number of PDB codes of proteins sharing the desired structural and/or sequence features. The server will then present a list of CATH codes from which to choose the desired set in analogy to the Query mode, see Output below. This can be useful if the intention is to extend a previously published test set.
In the input form, valid PDB codes are to be inserted one per line. Once satisfied with the query, press the "start search" button.
|
| Key word |
|
The Key word mode initiates a structural search from key words contained in the header and compound records of all PDB structures. A list of matching proteins is then presented and can be manipulated in analogy to the Query mode, see Output below. This can be useful if the user has no specific idea about the PDB codes of relevant proteins or their structural classification.
In the input form, each line represents key words to be connected with boolean "AND", whereas key words on different lines are connected with boolean "OR". An optional CATH filter level serves to limit output redundancy. Once satisfied with the query, press the "start search" button.
|
| Output |
| Interactive mode |
|
The TESE output for the interactive mode consists of two parts: the protein structure listing (top part) and actions menu (bottom).
The protein structure listing is a clickable table containing the protein structures classified by CATH matching the search criteria. It lists the first 30 structures (200 without images of protein domains) in a central table. Where more than 30 (resp. 200) structures match the search criteria, multiple pages are defined and can be selected by a numbered button menu above and below the central table. If visualization with images was selected, each structure is shown with a colored diagram and the PDB code in the center column.
The actions menu comprises the lower part of the output page. It is divided into three parts, allowing the user to perform different actions in order to manipulate and save the results. |
| Non interactive mode |
|
The TESE output for the non interactive mode is specifically designed for large scale test sets of several hundred or thousands of protein domains. It presents a simplified interface in which the user may download the "TAR.GZ" compressed set. This set contains a file named "tese.list" listing the protein domains, plus all relevant PDB structures and FASTA sequence files.
|
| Examples |
|
Below is the link to the output for an example showing the TESE output for some typical cases. Additional examples can be found from the precompiled sets page. (NB: The action links are disabled from the example page) Example 1 - All CATH. The entire CATH database is filtered at the "T" (topology) level, excluding NMR and segmented structures and limiting the X-ray structures to reasonable resolution and R-free values (i.e. <= 2.5 and <= 0.3 respectively). Results are shown interactively without images.
Example 2 - Solenoid proteins. Known solenoid architectures are selected from the CATH database filtered at the "H" (homologous superfamily) level, excluding NMR and segmented structures and limiting the X-ray structures to reasonable resolution and R-free values (i.e. <= 2.5 and <= 0.3 respectively). Results are shown interactively with images.
|
| References |
|
If you use the server in work leading to publications, please cite:
|