10547 ftp, bioinformatyka, artykuly

[ Pobierz całość w formacie PDF ]
//-->PROTEINS: Structure, Function, and Genetics 53:418 – 423 (2003)Application of 3D-Jury, GRDB, and Verify3D in FoldRecognitionMarcin von Grotthuss, Jakub Pas, Lucjan Wyrwicz, Krzysztof Ginalski, and Leszek Rychlewski*BioInfoBank Institute, Poznan, PolandABSTRACTIn CASP5, the BioInfo.PL grouphas used the structure prediction Meta Server andthe associated newly developed flexible meta-predic-tor, called 3D-Jury, as the main structure predictiontools. The most important feature of the meta-predictor is a high (86%) correlation between thereported confidence score and the quality of theselected model. The Gene Relational Database(GRDB) was used to confirm the fold recognitionresults by selecting distant homologues and subse-quent structure prediction with the Meta Server. Afragment-splicing procedure was performed as afinal processing step with large fragments extractedfrom selected models using model quality controlprovided by Verify3D. The comparison of submittedmodels with the native structure conducted afterthe CASP meeting showed that the GRDB-sup-ported structure prediction led to a satisfactorytemplate fold selection, whereas the fragment-splicing procedure must be improved in the future.Proteins 2003;53:418 – 423.©2003 Wiley-Liss, Inc.Key words: CASP; protein structure prediction; 3D-Jury; Meta Server; Verify3D; GRDBINTRODUCTIONEvery 2 years the CASP1experiment provides a strongincentive for the developers of protein structure predictionstrategies to spend many weeks trying to get the most outof their methods to generate dozens of models with thehighest possible quality. From the biological point of view,these models are most often useless, partially becauseother groups produced better models, but mainly becausethe correct structure will be probably known at the timethe CASP meeting will take place. Nevertheless, thisexperiment is one of the main sources of experience for theimprovement of current protein structure prediction strat-egies. At no other time around the biannual CASP cycle isthere so much detailed attention paid to each model. Theresults of the predictions are compared with native struc-tures after the meeting, and the lessons learned aretranslated into added or improved prediction modules.The main lesson we learned from the CASP4 experimentconcerned the potential power of the consensus structureprediction approach. CASP4 was the first meeting in theCASP series where the fully automated version of theCASP experiment, CAFASP 2,2was conducted simulta-neously. In CAFASP 2, developers of automated predic-tions are asked to couple their servers to the CAFASP©meta-server, so that the servers are able to submit pre-dicted models 2 days after a target is released. All modelscollected in the CAFASP experiment are public and canprovide a lot of hints for the human groups, which have todeposit their models at CASP usually several weeks later.One of the attempts to profit from this data was initiatedby the CAFASP organizers. A CAFASP consensus groupconsisting of four humans who analyzed the models depos-ited in CAFASP mainly in terms of their confidence scoresand the abundance of fold families represented in the set oftemplates was created. Some biological knowledge wasalso used in a few cases. The selected models weresubmitted to CASP, usually without modifications. Thegroup has obtained a very good rank 7 in the FoldRecognition category, which was much better than therank obtained by any server participating in CAFASP. Thelessons learned from this event led to the development ofthe currently fast growing field of meta-predictors. Incontrast to the original Meta Server3used in CAFASP fordata collection, the meta-predictors provide additionalselection and processing of the models.An additional important lesson from the CASP4 experi-ment stems from the fact that our own group, registered asBioInfo.PL, obtained rank 5 in the Fold Recognition cat-egory, higher than the rank of the CAFASP consensusgroup. The main difference between our prediction strat-egy and the CAFASP consensus strategy involved theexpansion of the search for suitable structural templatesusing homologues of the target protein. The GRDB system(https://grdb.bioinfo.pl/) was used for this task. The systemuses the ORFeus4program as the engine to find distanthomologues. ORFeus combines the profile–profile align-ment technology with the prediction of secondary struc-ture, used very successfully in the threading field to boostthe sensitivity of the sequence-structure matching func-tions. The distant homology detection algorithm startswith PSI-Blast searches to collect multiple alignments ofsequences of proteins belonging to the family of the target.The usually large alignment is translated into a profile inthe same spirit as conducted by the PSI-Blast program butwith many minor procedural differences. The profilescreated by PSI-Blast are used to predict local secondarystructure preferences of the target protein. The prefer-*Correspondence to: Leszek Rychlewski, BioInfoBank Institute, Ul.Limanowskiego 24A, 60-744, Poznan, Poland. E-mail: leszek@bioinfo.plReceived 17 February 2003; Accepted 17 March 20032003 WILEY-LISS, INC.FOLD RECOGNITION419Fig. 1. Correlation between 3D-Jury score and correct positioned residues in the model. Thexaxis showsthe confidence scores reported by 3D-Jury for the models collected within the LiveBench 6 experiment. Theyaxis shows the corresponding number of correctly positioned residues (within 3 Å from the native position) inthe model. The high correlation (86%) between both values helps to estimate the quality of the model.ences are expressed for each residue as three probabilitiesto obtain a helical, extended, or coiled conformation. Theresulting profile of three values per residue and thesequence profile computed from the multiple alignmentrepresent the data characterizing the target protein andused to search for distant homologues. The GRDB systemcontains the characteristic profiles computed for manyprotein families collected from both sections of Pfam5andfrom COGS6but also from representatives of the PDB7orfrom other genomic sources. The system facilitates thecomparison of the target family with 100,000 otherfamilies, using the fact that ORFeus, in contrast to manyother methods used in fold recognition, does not requireany information about any native structure to conduct thecomparison. GRDB also contains simple PSI-Blast searchprocedures on large databases, which include amino acidsequences translated from open reading frames of unfin-ished genomes. The application of the GRDB system andthe consecutive application of the Meta Server to obtainfold recognition results from state-of-the-art structureprediction servers represented the essence of our success-ful prediction strategy at CASP4.MATERIALS AND METHODSSince CASP4, the main progress was achieved in thearea of meta-predictors. Other benchmarking experimentsconducted within the LiveBench program have proved thesignificant improvement in prediction accuracy and reliabil-ity provided by the meta-predictors. Before CASP 5 started,several versions of the Pcons/Pmodeller8meta-predictorseries, several versions of the ShotGun9meta-predictorseries, and the Robetta10meta-predictor signed up for theparallel CAFASP 3 experiment. In addition, the number ofparticipating autonomous servers doubled relative to theprevious event. To take full advantage of the large numberof available models, a new, flexible, and simple meta-predictor called 3D-Jury11was developed. The 3D-Jurysystem provided the possibility to select which servers areto be used for consensus building (meta-prediction) andthe option to take into account one (closest) or all modelsper server. Initial benchmarking results conducted withthe ToolShop12program indicated that, despite its simplic-ity, the system is actually very powerful, accurate, andable to match the performance of other evaluated meta-predictors (i.e., the ShotGun and Pcons series). Especiallythe specificity of the server (the reliability of the confidencescore) proved to be a very useful feature for fold selection(Fig. 1). The 3D-Jury system also allows operating as ameta-meta-predictor if models collected from other meta-predictors are used in consensus calculations. Because ofits “versatility,” the results of the 3D-Jury system (com-piled by using “settings” to enable the meta-meta process-ing) were posted on the main CAFASP pages and weremade available to all CASP participants.Knowing that many of the human predictor groups,highly ranked in the previous CASP4 fold recognitioncategory, have acknowledged to take into account thepredictions collected by the CAFASP 2 meta-server andknowing that the performance of the meta-predictors waswell known to at least all LiveBench participants, weexpected that using the straight 3D-Jury models wouldplace us in the middle of the “CASP-field.” Thus, we420M. von GROTTHUSS ET AL.decided to try to use an experimental, not previouslyevaluated approach of combining our structure predictionprocedure with model assessment tools. We gathered someexperience with the application of Verify3D13(a proteinstructure assessment tool) in homology modeling, and wedecided to apply it in our CASP5 effort.The Verify3D program provides assessment of struc-tures on the residue level, which enables the user to locateparts of the protein that are likely to have the correctconformation or to look for misfolded regions. We decidedto take advantage of this feature and apply Verify3D notonly to guide our model selection but also to cut correctlyfolded sections of models and splice them into the finalstructures. The splicing procedure was conducted by super-imposing all chosen models and selecting model fragmentsof structurally diverged regions that showed the highestVerify3D scores. Structurally conserved regions were usu-ally taken from the model with the highest 3D-Jury scoreand represented the crossing points for splicing the finalmodel. The inserted fragments were usually of the size ofsupersecondary structures.The idea of splicing models from fragments wasinspired by the known progress in the ab initio method-ology attributed to the application of fragment insertionmethods and from the benchmarking of different ver-sions of the ShotGun meta-predictors, which also in-clude a fragment-splicing module. The ShotGun meta-predictor operating with the fragment-splicing modeclearly outperformed “the standard version,” which didselect the same models without modifying them (withoutsplicing). The improvement in scores resulted from alarger number of correctly positioned residues in themodel, which affects positively almost any evaluationmethod. The application of Verify3D and the subsequentsplicing of positively evaluated fragments gave us thepossibility to compete with the ShotGun meta-predic-tors and possibly with the (feared) Robetta server (alsopublished in this journal issue), which used very success-ful fragment insertion-based ab initio procedures toimprove models collected partially from other meta-predictors. (We were not able to benchmark the Robettaserver before or during the CASP procedure because ofhigh CPU costs of predictions).The final protein structure prediction strategy appliedin CASP5 is presented in Figure 2. The strategy met ourprior requirements of being sufficiently novel and havingthe potential to automation. The GRDB system, whichhelped us to rank higher than the manual consensusprototype participating in CASP4, was used in very diffi-cult cases for general fold recognition. Subsequent modelsgenerated by the meta-server and the 3D-Jury for distanthomologues of the query protein selected with GRDB werediscarded, but the confidence scores assigned to particularfolds were used as guidance for the fold selection. Only themodels (alignments) collected for the original target se-quence were used for the fragment-splicing procedure andthe final submission.Fig. 2. General flowchart of the protein structure prediction strategyapplied by the BioInfo.PL group in CASP5. The target was submitted tothe Meta Server, and the 3D-Jury system was used to select suitablemodels for subsequent fragment splicing conducted with the help ofVerify3D. In difficult cases, the GRDB system was used to select distanthomologues for subsequent structure prediction with the Meta Server andthe 3D-Jury system. The result affected the selection of models for thefragment-splicing procedure.RESULTSIn CASP 5, the BioInfo.PL group managed to submit alltargets. Table I presents our assessment of the resultscompiled for four meta-predictors (servers) and two hu-man groups (i.e., BioInfo.PL and the group registered byKrzysztof Ginalski as Ginalski). Both groups, as manyother predictors in CASP5, have used the Meta Server andthe 3D-Jury system for fold recognition. However, Ginal-ski used a completely different modeling protocol, whichinvolved selection of consensus alignment regions backedby multiple alignment of protein families (partially guidedby biological information about important conserved resi-dues collected from the literature) and subsequent manualevaluation of the 3D-models.Comparison of the results obtained by both groupsshows that the fragment-splicing method that we imple-mented was clearly inferior to the manual, sequenceinformation-guided alignment and modeling conducted byGinalski. Even more disappointing is the fact that with useof standard model evaluation procedures as applied inLiveBench (using a cutoff of 40 correctly positioned resi-dues in a model) our results are worse than the results ofsome of the meta-predictors, including the results of the3D-Jury system, obtained with the same setting as postedon the CAFASP site during the prediction season. Ourprior assumption was that those results will be used bymany predictors, which would position us far below the top10 groups.If our assumption would turn out correct, this manu-script would probably not have a chance to get published.Fortunately for our ranking, the assessment conducted bythe CASP assessor in the fold recognition category is muchless restrictive on the model quality than the thresholdsFOLD RECOGNITION421TABLE I. Performance of the BioInfo.PL on the Fold Recognition Targets Compared With Results Obtained by theGinalski Group and Five Selected Meta-PredictorsThe “target” column lists the ids of the domains selected by the CASP assessors for the fold recognition category (black indicates easy foldrecognition; red indicates distant fold recognition). The “len” column lists the lengths of the domains. The seven left columns of the table show theresult obtained by the Ginalski group (KGIN), the BioInfo.PL groups (B-PL), and five meta-predictors: Robetta (RBTA), two 3D-Jury versions(3JCa and 3JAa, results of 3JCa were posted on the CAFASP page), ShotGun-on-5 (3DS5), and Pmodeller3 (PMO3). The values in the leftcolumns correspond to the numbers of correctly positioned residues in the model (within 3 Å from the native position). Black numbers are 39;yellow values are 30. The left four bottom rows summarize the performance of the selected groups. The “sum_40” row gives the sum of thecorrect residues in models with at least 40 correct residues (black values). The “num_40” row lists the number of such models. The “sum_30” and“num_30” rows show the same summary for the cutoff of 30 correct residues in a model. The right part of the table shows the summary obtained byusing the sequence-independent (alignment-independent) evaluation of models conducted with the LGscore program. The scores correspond tothe log multiplied by 100 of the probability that the similarity between the model and the native structure is random. In LiveBench, the cutoff of15 is used to separate wrong and correct models. The two right rows at the bottom list the number of correct models (num_40) and the totalnumber of points summed over correct models (sum_40).used in LiveBench. In CASP, points can be obtained for allmodels if they are better than models produced by mostpredictors. In LiveBench, models assessed as incorrect donot affect the total score. Table I shows the scores thatwould be obtained if the default cutoff of a correct modelwould be lowered from 40 correct residues to 30. By usingthis less stringent cutoff, a difference between the perfor-mance of our group and the meta-predictors appears. TheBioInfo.PL group gets as many correct models as the bestmeta-predictor in this assessment (i.e., Robetta). The total422M. von GROTTHUSS ET AL.Fig. 3. Modeling of the target T0156. The initial application of the Meta Server and the 3D-Jury system in the case of the target T0156 wasinconclusive. The GRDB system was used to select 10 other sequences from the target family, which were subsequently submitted to the Meta Serveragain. None of the homologues obtained confident structure prediction results, but the highest score was obtained for the representative sequence of theCOG0684 family. The prediction suggested the template with the PDB code “1dik” or “2dik” as the most promising modeling candidate. Subsequentreevaluation of initial fold recognition results obtained for the target sequence indicated three models, where the same PDB structure was used astemplate, obtained from 3D-PSSM,15INBGU,16and ORFeus. The three models were evaluated by using Verify3D. The plots shown in the figurecorrespond to the residue-based quality assessment results. Values below 0.2 indicate suspect regions in the model. The highest values are obtained forthe 3D-PSSM prediction over large sections of the model (with the exception of the region 110 –130 where the INBGU model obtains higher qualityscores). The structure submitted by us was spliced from two fragments of the two models (fragment 110 –130 from INBGU and the other fragment from3D-PSSM) and processed with Modeller. The quality graph of the final BioInfo.PL model is plotted in the figure as a dotted line. The comparison with thenative structure revealed that despite the highest quality assessment obtained with Verify3D, the 3D-PSSM model had the lowest number of correctlypositioned residues (27 residues within 3 Å from the native position) compared with models obtained from INBGU (33 residues) or ORFeus (50 residues).The alignment shown at the bottom of the figure indicates the residues that were correctly positioned in all four models.number of correctly positioned residues is 5% higherthan that obtained by Robetta. A gap of additional 15%remains between BioInfo.PL and Ginalski.This gap illustrates the superiority of human experts infinding a better alignment. If the alignment would beignored, our results would be more competitive. Table Ishows also the results obtained by using sequence-independent (alignment-independent) assessment of themodels. Using the standard setting as applied in Live-Bench (using the LGscore-214program), the BioInfo.PLgroup obtained the highest number of 20 correct models,with the best meta-predictors obtaining only 50 – 65%(10 –13) of that number. The group got points for 80% (20 of25) of the fold recognition targets using sequence-indepen-dent assessment. We conclude that our procedure is verypromising in finding suitable modeling targets, but thequality of the models remains low. This could be partiallyattributed to our attempt to use Verify3D on distant foldrecognition templates, where the structural divergence isprobably too high for correct assessment of the sequence tostructure fitting. Figure 3 shows an example in whichVerify3D failed, in our hands, to select suitable fragmentsfrom chosen templates, which shared sufficient similarityto the native structure.DISCUSSIONThe application of GRDB to select distant homologues toconfirm or guide the fold assignment remains unreliable.It is not difficult to be driven to wrong conclusions whenapplying iterative PSI-Blast searches even if conservativeconfidence thresholds are applied. The ORFeus systemmay seem slightly more robust. A higher reliability of theconfidence score is reported for ORFeus in LiveBenchexperiments than for the evaluated PSI-Blast version.Nevertheless, it is quite seldom to obtain a clearly signifi-cant ORFeus hit completely missed by PSI-Blast. Thereliability of linking proteins with unknown structure ismuch lower than the reliability of linking structures ofproteins where advanced meta-predictors can be applied.Because of this, it will be difficult to automate the fold [ Pobierz całość w formacie PDF ]

  • zanotowane.pl
  • doc.pisz.pl
  • pdf.pisz.pl
  • marucha.opx.pl