![]() |
Test Set Generator |
Methods | |
|
This page contains a more detailed description of the methods implemented in TESE. It is meant as a reference for the user interested in understanding the technical details and biases. If you are interested in a more general description of the approach, please refer to the Quick Help page instead. If you have any further questions or suggestions, please contact the author at silvio@cribi.unipd.it.
| |
| Overview | |
|
TESE works in three different steps, as shown in the figure below: The user input is first transformed into a MySQL query that serves to search a local database including information from CATH and PDBFINDERII. The results are shown interactively (unless specified by the user) for visual inspection and validation. When requested, a download is prepared by combining the selected results with data from local copies of PDB structure files and their corresponding FASTA sequences.
| |
| Databases | |
|
The core functionality of TESE is provided by a MySQL database derived from the CATH structural classification. The classification data is extracted, enriched with quality parameters (e.g. X-ray resolution) extracted from PDBFINDERII and the in-house TAP server. The resulting MySQL database can be used to simultaneously search for proteins matching all criteria indicated in the input form. The PDB database is updated weekly and used to derive the structural and sequence data files, according to the CATH domain definitions. In-house scripts are used to parse the PDB files and generate relevant FASTA sequence files.
| |
| Selection Process | |
|
The selection process of TESE is important to understand potential biases introduced during test set generation. Throughout the application, two different modes of operation are supported: systematic and random. The first yields reproducible and repetitive results, whereas the second will produce different test sets if used more than once. During the initial MySQL database query, if a fixed number of proteins per hierarchy level are chosen in systematic (i.e. non-random) mode, these will be the ones satisfying the criteria and having the best X-ray resolution and R-free values. This is important to bear in mind when submitting multiple queries, as it will always choose the same protein subset. Using the random selection function will remove this bias. The second main application for systematic vs. random selection is when deciding to split the data set into two or more subsets. This situation is illustrated in the figure on the right. Systematic sampling will divide the test set following a fixed order, starting from the first protein. Random sampling will randomly divide the data among the different subsets, thus producing different ensembles if repeated more than once. When producing multiple subsets it is also possible to choose the granularity of the test set. This feature dictates the level at which proteins are placed into the same (sub-)directory. This can be useful when grouping several sequences into the same cluster.
|
|
| Limitations | |
|
As with all servers, TESE has a number of intrinsic limitations of whivch the user should be aware. The most important one derives from relying entirely on the CATH classification scheme. While CATH is a widely used and overall reliable classification, it may at times present some inconsistencies compared to other classifications (e.g. SCOP). This should not pose a problem in general, but may hamper certain specific tasks where a continuous classification (e.g. QSCOP) would be preferable. The fixed sequence redundancy levels may also be an issue for certain applications. In this case, the expert user may want to use subsequently use a redundancy reduction tool like CD-HIT, PISCES or UniqueProt to adjust the redundancy level more finely. | |