BioComputing

 TESE     Test Set Generator


Quick Help and References

Description
TESE is a web server that can be used to derive curated sets of protein structures to be used in a variety of situations. The most typical is to construct representative sets of protein sequences and/or structures for the benchmarking of novel methods.

The benchmarking process usually requires a set of structurally non-redundant proteins, ideally covering the whole known protein universe, from which to infer the accuracy and generality of the new method, e.g. in fold recognition or when deriving statistical potentials. Alternatively, the user may be interested in a specific class of proteins, e.g. solenoid repeats or the Rossman fold, for which to derive a specific test set. Last but not least, given the exponential growth of available structural data, it can be desirable to update an existing test set previously used in the literature to improve the statistical significance of the results. TESE was construct to answer all of these questions.

The basic idea of TESE is to use the CATH structural classification to control the level of structural/sequence similarity contained in a set of protein structures. CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels: Class (C), Architecture (A), Topology (T) and Homologous superfamily (H) of increasing structural similarity. Subsequent levels define sequence similarity thresholds at 35% (S), 65% (O), 95% (L) and 100% (I) pairwise sequence identity. By restricting the proteins to be selected to a given CATH level, it is therefore possible to elegantly limit redundancy beyond sequence-based approaches.

Tese Overview

The server has been designed to make the selection process as easy as possible. It currently offers three different search modes to initiate the data collection process, described in the following sections: query, PDB ensemble and key word.
For a more detailed technical description of the methods used, see the Methods page.

Search Mode
Query
The Query mode allows the user to select structural and quality filters to generate a test set at any given level of residual sequence and structural similarity. To this end, it uses the CATH structural classification scheme. A detailed description of the CATH classification scheme can be found here. Briefly put, the CATH level allows the user to control the residual level of sequence and/or structural similarity. For instance, the "H" level corresponds to homologous superfamilies related on the structural level but beyond recognition for pairwise sequence-based methods. Subsequent CATH levels represent gradually less stringent redundancy filters at 35% ("S"), 60% ("O"), 95% ("L") and 100% sequence identity ("I"). Additionally, it implements several quality checks for the structures to be included, e.g. X-ray resolution and R-free cutoffs.
For the Output see below.

The input form consists of two parts. The left half, termed "Query" contains the actual search statements in a script language. Since this would be unnecessarily complex for the average user, the right half of the input form contains the "Wizard". The lower left part contains the visualization options. It is possible to choose between interactive and non interactive visulaization. The interactive is useful for test sets with less than six hundred protein domains and allows the user to directly manipulate the protein domains to include and/or exclude from the test set. It is also possible to choose whether to include images for each protein domain (slower, but easier identify differences) or not (faster, but more difficult). The non interactive option is designed for large scale test sets where the user is interested in selecting thousands of structures without checking each one by hand.

The "Wizard" on the right half contains four consecutive blocks of selection statements. These are:

  1. Select node. Contains statements related to specific CATH nodes to include and/or exclude. It is possible to select directly the CATH classes and/or architectures to include or exclude from the test set.
  2. Redundancy reduction. Contains statements related to the CATH level to be used for filtering out redundant structures. A detailed description of the CATH classification scheme can be found here. For each level, it is possible to select the number of representatives to use and whether to choose them randomly.
  3. Quality. Contains statements related to experimental and computational quality parameters. The structures can be filtered for minimum and maximum cutoffs on several parameters. X-ray resolution and R-free are experimental quality measures. TAP score and WHAT_CHECK quality are computational parameters that estimate the "quality" of a structure from a theoretical point of view.
  4. Other parameters. Contains statements allowing the exclusion of X-ray, NMR and/or segmented proteins. It also allows the selection of proteins deposited in the PDB in a given year range, e.g. between 2006 and 2007.
Context dependent help texts are available by moving the mouse cursor over the numerical field next to a selection statement. Each selected statement is inserted into the query form by pressing the relevant "insert" button. Once satisfied with the query, press the "start search" button.

PDB Ensemble
The PDB Ensemble mode can be used to seed the structural search form a limited number of PDB codes of proteins sharing the desired structural and/or sequence features. The server will then present a list of CATH codes from which to choose the desired set in analogy to the Query mode, see Output below. This can be useful if the intention is to extend a previously published test set.

In the input form, valid PDB codes are to be inserted one per line. Once satisfied with the query, press the "start search" button.

Key word
The Key word mode initiates a structural search from key words contained in the header and compound records of all PDB structures. A list of matching proteins is then presented and can be manipulated in analogy to the Query mode, see Output below. This can be useful if the user has no specific idea about the PDB codes of relevant proteins or their structural classification.

In the input form, each line represents key words to be connected with boolean "AND", whereas key words on different lines are connected with boolean "OR". An optional CATH filter level serves to limit output redundancy. Once satisfied with the query, press the "start search" button.

Output
Interactive mode
The TESE output for the interactive mode consists of two parts: the protein structure listing (top part) and actions menu (bottom).

The protein structure listing is a clickable table containing the protein structures classified by CATH matching the search criteria. It lists the first 30 structures (200 without images of protein domains) in a central table. Where more than 30 (resp. 200) structures match the search criteria, multiple pages are defined and can be selected by a numbered button menu above and below the central table. If visualization with images was selected, each structure is shown with a colored diagram and the PDB code in the center column.
The right part of each row contains structural information: structure length, X-ray resolution (where available) and experimental method (i.e. X-ray or NMR). The last three columns on the right link to the FASTA formatted protein sequence, the PDB structure and a more detailed summary page for each protein.
The left part of each row is a clickable CATH classification scheme. Depending on the previously chosen filter level, the user has the option to select and/or deselect parts of the classification tree. This can be useful to eliminate certain structures in an interactive fashion. Positioning the mouse cursor above the different classification levels displays the context sensitive classification information. The selected nodes can then be excluded by clicking on the "refresh" button in the lower part of the page (see actions menu below).

The actions menu comprises the lower part of the output page. It is divided into three parts, allowing the user to perform different actions in order to manipulate and save the results.
The top part of the menu contains the "refresh" button which is used to generate a new test set where certain nodes have been manully deselected from the protein structure listing by the user. Activating it generates a new output page.
The right part of the actions menu allows the user to formulate a new (and even totally different) test set query by using the same wizard procedure used in the initial Query mode. It is executed by pressing the "new list" button.
The left part of the actions menu is used to save the generated test set. It has various options, dictating whether to include the single FASTA sequences, a multiple FASTA sequence file as well as the PDB structures and even images of the protein structures used by the server. In all cases, a HTML formatted index file is generated. Since the output can be quite large, it will be compressed by the server in either "TAR.GZ" or "ZIP" formats. Download of the test set is initiated by pressing the "save" button.
Of particular interest may be the option to automatically split the test set into different subsets, e.g. for training and testing or ten-fold crossvalidation. The user may specify up to 10 subsets and optionally a granularity level. The granularity dictates the level at which proteins are placed into the same directory. This can be useful for instance if the user wishes to have one representative per "O" level (<60% seq.id.) yet prefers to group all similar structures ("H" level) into the same directory.

Non interactive mode
The TESE output for the non interactive mode is specifically designed for large scale test sets of several hundred or thousands of protein domains. It presents a simplified interface in which the user may download the "TAR.GZ" compressed set. This set contains a file named "tese.list" listing the protein domains, plus all relevant PDB structures and FASTA sequence files.

Examples
Below is the link to the output for an example showing the TESE output for some typical cases. Additional examples can be found from the precompiled sets page.
(NB: The action links are disabled from the example page)

Example 1   -    All CATH. The entire CATH database is filtered at the "T" (topology) level, excluding NMR and segmented structures and limiting the X-ray structures to reasonable resolution and R-free values (i.e. <= 2.5 and <= 0.3 respectively). Results are shown interactively without images.

Example 2   -    Solenoid proteins. Known solenoid architectures are selected from the CATH database filtered at the "H" (homologous superfamily) level, excluding NMR and segmented structures and limiting the X-ray structures to reasonable resolution and R-free values (i.e. <= 2.5 and <= 0.3 respectively). Results are shown interactively with images.


References

If you use the server in work leading to publications, please cite:
  • TESE:
    Francesco Sirocco and Silvio C.E. Tosatto.
    TESE: Generating specific protein structure test set ensembles.
    Bioinformatics, 24(22):2632-2633.   (2008)

  • CATH:
    Lesley H. Greene, Tony E. Lewis, Sarah Addou, Alison Cuff, Tim Dallman, Mark Dibley, Oliver Redfern, Frances Pearl, Rekha Nambudiry, Adam Reid, Ian Sillitoe, Corin Yeats, Janet M. Thornton1 and Christine A. Orengo.
    The CATH domain structure database: new protocols and classification levels give a more comprehensive resource for exploring evolution.
    Nucleic Acids Research 35:D291-D297.   (2007)


Francesco Sirocco   11 / 2008