10545 ftp, bioinformatyka, artykuly

[ Pobierz całość w formacie PDF ]
//-->PROTEINS: Structure, Function, and Genetics 53:369 –379 (2003)A “FRankenstein’s Monster” Approach to ComparativeModeling: Merging the Finest Fragments of Fold-Recognition Models and Iterative Model Refinement Aidedby 3D Structure EvaluationJan Kosinski, Iwona A. Cymerman, Marcin Feder, Michal A. Kurowski, Joanna M. Sasin, and Janusz M. Bujnicki*Bioinformatics Laboratory, International Institute of Molecular and Cell Biology, Trojdena 4, 02-109 Warsaw, PolandABSTRACTWe applied a new multi-step proto-col to predict the structures of all targets duringCASP5, regardless of their potential category. 1) Weused diverse fold-recognition (FR) methods to gener-ate initial target-template alignments, which wereconverted into preliminary full-atom models by com-parative modeling. All preliminary models wereevaluated (scored) by VERIFY3D to identify well-and poorly-folded fragments. 2) Preliminary modelswith similar 3D folds were superimposed, poorly-scoring regions were deleted and the “averagemodel” structure was created by merging the re-maining segments. All template structures reportedby FR were superimposed and a composite multiple-structure template was created from the most con-served fragments. 3). The average model was super-imposed onto the composite template and thestructure-based target-template alignment was in-ferred. This alignment was used to build a new(intermediate) comparative model of the target,again scored with VERIFY3D. 4) For all poorlyscoring regions series of alternative alignments weregenerated by progressively shifting the “unfit” se-quence fragment in either direction. Here, we consid-ered additional information, such as secondarystructure, placement of insertions and deletions inloops, conservation of putative catalytic residues,and the necessity to obtain a compact, well-foldedstructure. For all alternative alignments, new mod-els were built and evaluated. 5) All models weresuperimposed and the “FRankenstein’s monster”(FR, fold recognition) model was built from best-scoring segments. The final model was obtainedafter limited energy minimization to remove stericclashes between sidechains from different frag-ments. The novelty of this approach is in the focuson “vertical” recombination of structure fragments,typical for theab initiofield, rather than “horizon-tal” sequence alignment typical for comparativemodeling. We tested the usefulness of the “FRanken-stein” approach for non-expert predictors: only theleader of our team had considerable experience inprotein modeling - he registered as a separate group(020) and submitted models built only by himself. Atthe onset of CASP5, the other five members of theteam (students) had very little or no experience©with modeling. They followed the same protocol in adeliberately naıve way. In the fourth step they used¨solely the VERIFY3D criterion to compare theirmodels and the leader’s model (the latter regardedonly as one of the many alternatives) and generatedthe hybrid or selected only one model for submis-sion (group 517). In order to compare our protocolwith the traditional “one target-one template-onealignment” approach, we submitted (as a separategroup 242) models selected from those automati-cally generated by all CAFASP servers (i.e. obtainedwithout any human intervention). Here, we com-pare the results obtained by the three “groups”,describe successes and failures of the “FRanken-stein” approach and discuss future developments ofcomparative modeling. The automatic version ofour multi-step protocol is being developed as ameta-server; the prototype is freely available at©2003 Wiley-Liss, Inc.Key words: homology modeling; bioinformatics;GeneSilico; consensus generation;model evaluationINTRODUCTIONAssessments of protein structure prediction (CASP,1CAFASP,2Livebench3) have demonstrated that fold-recognition (FR) methods can identify remote similaritieswhen standard sequence search methods fail, but thereported target-template alignments are often only par-tially correct, leading to models with misfolded parts. Theuse of additional information, such as secondary structure(SS), and/or localization of ligand-binding residues canhelp to improve the target-template alignments. More-over, models constructed from multiple parents were oftenGrant sponsor: Polish State Committee for Scientific Research andEMBO & HHMI Young Investigator Programme; Grant numbers:6P04A01124 and 3P05A02024.*Correspondence to: Janusz M. Bujnicki, Bioinformatics Labora-tory, International Institute of Molecular and Cell Biology, Trojdena 4,02-109 Warsaw, Poland. E-mail: iamb@genesilico.plReceived 14 February 2003; Accepted 8 April 20032003 WILEY-LISS, INC.370J. KOSINSKI ET AL.found to be more accurate than models constructed fromsingle parents only. The final prediction accuracy can betherefore improved if the best fragments obtained fromvarious FR alignments can be judiciously combined togenerate a consensus model. Interestingly, in CASP4,several FR methods were shown to perform quite well alsoin the comparative modeling (CM) category.4The perfor-mance difference between the human experts and com-puter predictors continues to narrow, which suggests thatmost of the refinement procedures used by humans can befully automated. These observations led us to consider thatthe FR-based consensus approach may be applicable notonly to modeling based on remotely related templates,where the critical issue is to identify the correct template,but to _classical_ comparative modeling as well, where thepossible templates can be identified relatively easily, butthe real challenge is to obtain a perfect sequence align-ment.In CASP4, one of us (J.M.B.) participated as a member oftheBioInfo.PLduumvirate, as well as one of four expertsof theCAFASP-consensusgroup. WithinBioInfo.PL,J.M.B.was responsible for building and refinement of models formost of the targets in the HM and FR categories, while inCAFASP-consensus,he participated in the identification ofthe best models generated automatically by FR serversand in inference of a rational consensus between them.While the unrefined predictions gaveCAFASP-consensusthe overall rankings of 7thand 26thin the FR and CMcategories, respectively, the refined predictions gaveBioIn-fo.PLeven better score (5thin FR, 4thin CM).4,5However,it was not always clear if the improvement stemmed fromapplication of different criteria for selection of the bestautomated models as the starting points for modeling bythe two groups or from different degree of model refine-ment.In CASP5, we attempted to assess the applicability ofthe FR-based consensus approach in targets beyond the“core” FR category, i.e. also in the CM (easy modeling) andFR/NF (extremely hard modeling) categories. Since it isimpossible to distinguish between NF (novel fold) andFR/NF targets a priori, we applied the same modelingprotocol to all targets, regardless of their apparent diffi-culty. Because our modeling protocol was quite complex(see below), we were also interested in testing if it can beuseful in the hands of non-experts, compared to simpleselection of one of the crude models generated by theindividual FR servers. Previously, comparisons of human-refined models and crude automatic models have beenmade.2,6Fully automated protein structure prediction isusually applied in large-scale analyses for instance on agenome scale, where human intervention is not feasible forpractical reasons (refined modeling of thousands of pro-teins would take thousands of person-hours). However,many researchers are interested in protein structureprediction only for one particular protein at a moment andthey usually have some knowledge about the predictiontarget (for instance knowledge of the catalytic residues),which can be applied to selection of the potentially bestmodel from several alternatives. Hence, in our opinion itwould be informative to compare the quality of predictionsmade by non-experts, who either select crude models fromthe CAFASP set based on their agreement with data fromthe literature or generate refined models according to anelaborate, multi-step “expert” modeling protocol.METHODSAt the outset of CASP5, our group comprised one proteinmodeling expert (J.M.B.) and five students with back-ground in experimental biology, only fundamental knowl-edge of protein structure and function, and little or noexperience with protein structure prediction and model-ing. We attempted to mimic three possible real-life sce-narios:i) a group of biologists with at most basic knowledge ofprotein modeling attempts to obtain the possibly mostuseful model from the set of alternatives provided bypublicly available automatic servers,ii) a protein modeling expert uses a refined protocol andhis intuition to build and refine the model,iii) the same biologists as in scenario i) are provided withthe model built by the expert in scenario ii), but they donot fully trust his prediction and attempt to use theexpert’s refinement protocol and select the “best” modelfor submission based not on intuition, but on anacclaimed objective method for evaluation of proteinstructures.Hence, we submitted predictions as three independentgroups, and applied the same sets of protocols for submis-sion of all targets, irrespective of their potential classifica-tion to the CM, FR and NF categories:Group 242GeneSilico-servers-only(unrefined FR mod-els selected from the CAFASP results)Group 020,Bujnicki-Janusz(models refined by a singleexperienced predictor),Group 517GeneSilico(consensus obtained after objec-tive evaluation and comparison of models generated inde-pendently by six members of our team).Selection of the best automated model byGeneSilico-servers-onlyin CASP5 was carried out in a similar mannerto that ofCAFASP-consensusin CASP4, but in a moredisciplined way. For instance, inCAFASP-consensus,preparation of the model involved limited human interven-tion when homology modeling programs failed to generateany reasonable structures from the FR alignments be-cause of large deletions or insertions in the protein core (inthese cases gaps in the alignment were shifted to thesurface-exposed regions). On the other hand, human inter-vention of theGeneSilico-servers-onlygroup in CASP5involved only selection of one of the FR alignments or oneof the atomic models generated by HM orab initioservers.The criteria for selection included: the formation of acompact globular structure (assessed visually using RAS-MOL),7the functional similarity between the target andthe template according to the literature and the SCOPdatabase,8the alignment of the functionally importantBUILDING MODELS FROM PARTS371and/or conserved residues, and the agreement between thesecondary structure in the target and in the template.Similar criteria could be easily applied in the real-lifescenario, by real researchers interested in obtaining acrude model of their protein. Automated FR models submit-ted byGeneSilico-servers-onlywere all based on singletemplates. Most of them were submitted in the AL formatwithout explicit modeling of sidechains and insertions ordeletions (indels) to avoid the inevitable distortion of theraw data by automatic homology modeling in cases, suchas disruption of the protein core. Occasionally, full-atommodels built byab initioand homology modeling serverswere selected for submission.Previously, in J.M.B.’s hands the carefully refined mul-tiple sequence alignments gave much better results as FRqueries compared to single sequences, even in those FRservers which used their own BLAST utility to constructalignments (our unpublished data and CASP4 results: theperformance ofBioInfo.PLvsCAFASP-consensusgroup).In many cases we observed significant improvement of theprediction quality if sequences from unfinished genomeswere included in the alignment, when the position of indelswas manually refined and when highly diverged parts ofthe alignment or long insertions in the target were deleted.However, none of the available FR metaservers9 –11offeredsatisfactory options to build and process user-definedalignments. In order to reduce the workload and simplifysubmission of different variants of prediction jobs for thesame target to multiple servers, we developed a novel12It serves as agateway to many of the FR servers available via theCAFASP metaserver (at the time of the writing: PDB-BLAST, 3DPSSM,13BIOINBGU,14FFAS-03,15FUGUE2.0,16MGENTHREADER,17RAPTOR,18and SAM-T026)and offers a few options not available elsewhere, includingsubmission of user-defined sequence alignments and gen-eration of many variants of the consensus sequence.12Figure 1 shows the flow-chart of our sequence analysisand modeling strategy. As a prerequisite (step 0) for therefined modeling analysis carried out byBujnicki-JanuszandGeneSilicoin CASP5, as many homologs of the targetsequence as possible were identified and included in thealignment. For this purpose, we created a database ofputative translation products (length 20aa) of all unfin-ished genomes, whose sequences were publicly available.Combined with the non-redundant database (NCBI), thisallowed a roughly two-fold increase of the size of thedatabase used in local PSI-BLAST19searches. In a fewcases, it allowed to increase the number of homologs of thetarget from ca. 5 to over 20 and hence, much betterdelineation of conserved and variable regions. The PSI-BLAST output (aligned sequence fragments) was saved asa multiple sequence alignment following the removal of allcolumns with30% gaps. Full-length sequences wereretrieved from the ENTREZ database, and realigned byCLUSTALX,20using the “align sequences to the profile”option (the PSI-BLAST output served as the “profile”).Alignments were refined manually and used to divide thequery sequence into domain-size fragments, which weresubmitted to our new metaserver as independent predic-sequences, the FR results (target-template alignments inthe AL format) as well as comparative andab initiomodels(full-atom structures in the TS format) were obtained fromserver was also used to carry out FR analysis for thefull-length alignments and for the alignment sectionscorresponding to the individual domains. Three optionswere used: i) columns with 30% of gaps were deleted (i.e.only the core regions were analyzed), ii) gaps were treatedas unknown characters (X) (i.e. the variable regions of thetarget sequence were “extended” to the size of the entirealignment, using the longest insertions present in homolo-gous sequences as the reference), iii) the consensus se-quences were generated for submission as additionalsingle-sequence jobs, using the majority-rule criterion forselection of the most frequently observed amino acids ineach position or the BLOSUM matrix criterion. All resultsfor each CASP5 target (regardless of its category), itsparts, corresponding multiple sequence alignments andconsensus sequences were collected. For consensus and“core only” models, the original length and the amino acidsequence of the prediction target was restored by introduc-ing insertions and deletions (indels) into the correspondingFR alignments. All alignments were converted to a com-mon format, including the sequence of the target (or one ofits domains) and the template.In the first step of the modeling protocol, all FR align-ments were converted into preliminary full-atom modelsby comparative modeling using MODELLER 6v121withdefault parameters. For each CASP5 target, a database ofmodels was created using the FR-based homology modelsand full-atomab initioand homology models obtainedfrom the CAFASP website. All these models were evalu-ated by VERIFY3D,22using the atypically small windowsize of 5 in order to identify well- and poorly-foldedfragments of the size comparable to the smallest secondarystructure elements.In the second step, preliminary models were divided intoclusters based on the relationship of the template folds,according to SCOP.8Within each major cluster ( 5 FRalignments), all models were superimposed with SWISSP-DBVIEWER and the superposition was used to generate amultiple sequence alignment. Structurally superimpos-able regions were identified and analyzed for the consis-tency of the alignment and the quality of the sequence-structure fit, according to VERIFY3D. A hybrid “consensus”model was created from the well-scored fragments ( 10aa) of models corresponding to the most frequently re-ported alignments. If there were several alternative clus-ters (alternative folds), the consensus model composed ofbest scoring fragments over the entire length of the targetsequence was selected for further analysis (step 3). If norecurring fold could be identified among the FR results (nomajor clusters with5 superimposable models), thebest-scoring preliminary model was selected for furtheranalysis (step 4, see below). Alternatively, the analysiswas halted with the conclusion that the target most likely372J. KOSINSKI ET AL.Fig. 1. A flowchart illustrating the major stages of construction of the “FRankenstein’s monster”–from thetarget sequence to successful model submission. The individual steps are color coded: step 0 (pre-modeling,sequence analysis)– gray; step 1 (generation of crude models)–yellow; step 2 (generation of the first hybridmodel)– green; step 3 (generation of the composite template and construction of the first “real” comparativemodel based on the structural alignment between the composite template and the first hybrid model)– cyan;step 4 (sampling of the alignment space)–pink; step 5 (creation of the final hybrid model and unleashing the“FRankenstein’s monster”)– cornsilk.represents a novel fold and neither a suitable templateexists in PDB nor any attractive models were generated byab initioservers for CAFASP.In the third step, a multiple structure alignment wascreated by pairwise superposition of all template struc-tures and the consensus model of the target. The templatemost frequently reported by FR was used as a referencestructure. The most diverged elements of the templatestructures (for instance large insertions not present inother templates and in the consensus model) were re-moved and the remaining parts of the template structureswere regarded as a composite multiple-structure template.Based on this superposition, a sequence alignment wasinferred. The alignment of the target sequence to thecomposite multiple-structure template was used to build anew (intermediate) comparative model of the target, usingMODELLER21and SWISSMODEL.23We used both pro-grams, because we observed that SWISSMODEL intro-duces less distortion into the protein core (compared to thestructure of the template), while it sometimes has prob-lems with insertions or ligation of loose ends generated bydeletions. On the other hand, MODELLER is more “sloppy”in modeling the protein core, but has an extraordinaryability to accommodate indels in virtually any region of theprotein structure, including “prohibited” regions of theprotein core (our unpublished observations). This was alsothe reason for using only Modeller for generation of thepreliminary models based on FR alignments (step 1),which often have indels in “prohibited” regions of thetemplate structure.In the fourth step, the quality of the local structure of theintermediate model (both the MODELLER and SWISS-MODEL versions) was again evaluated using VERIFY3D.For all regions with unsatisfactory VERIFY3D scores inBUILDING MODELS FROM PARTS373either version of the intermediate model, series of alterna-tive alignments were generated by progressively shiftingthe “unfit” sequence fragment in either direction. Here,additional information, such as secondary structure, place-ment of insertions and deletions in loops, conservation ofputative catalytic residues, and the necessity to obtain acompact, well-folded structure was taken into account.Thereby, the sequence/structure space could be exploredbeyond the alignment variants reported by FR servers.Importantly, only one region was shifted at a time to avoidinterference of effects from different parts of the modelstructure. Usually, the limits of the shift in either directionwere dictated by the criterion of at least partial overlap ofpredicted vs observed secondary structure elements. Forall these alternative alignments, new models were built(again, using both SWISSMODEL and MODELLER) andevaluated using VERIFY3D.In the fifth step, all models were superimposed usingSWISSPDBVIEWER and the “FRankenstein’s monster”model was built from best-scoring segments. An additionalcriterion for selection of loops was the similarity of theirconformation to the corresponding loops in at least one ofthe template structures. Typically, the core elements weretaken from the models built using SWISSMODEL, whilethe loops (especially the problematic ones) were oftentaken from the models built by MODELLER. The finalmodel was obtained after manual adjustment of conforma-tion of selected side-chains followed by limited energyminimization with GROMOS9624(200 cycles of steepestdescent) to remove steric clashes between sidechains fromdifferent fragments.The same multistep modeling protocol was used by thefive novice members of the team. They used the same set ofFR alignments, but conducted their analysis indepen-dently, especially with respect to the choice of fragments toconstruct hybrid models (in all steps) and generation ofalternative alignments (in step 4). The only difference wasthat the manual adjustment of side-chains in the final“FRankenstein’s monster” model (carried out by the ex-pert, step 5) was replaced by automated rotamer selectionusing SCWRL.25As a result, up to 5 alternative modelswere generated by the student members of theGeneSilicogroup. They were compared with the model obtained bythe expert and all were assessed with VERIFY3D. Allmodels were treated as equal, i.e. no higher weight wasassigned to the model built by the expert. If one model withan outstanding score emerged, it was selected for submis-sion. Otherwise, a hybrid model was constructed from thebest-scoring parts and the side-chains were re-modeledusing SCWRL. It is noteworthy that in all cases, the foldselected by the 5 novice members agreed with that identi-fied by the expert, however for some targets either theexpert or some members of the team concluded (in step 2)that modeling is unfeasible and no good candidate forsubmission can be proposed. In these cases, the selection ofthe best model or the construction of the hybrid for thefinal submission by theGeneSilicogroup was made basedon less than 6 models. In a few cases, theGeneSilicogroupsubmitted more than one model, if no confident decisioncould be made based on the VERIFY3D evaluation.Furthermore, the final model and all the well-scoringparts of the intermediate models were used to calculate theaverage residue-residue separation distances (submittedas the RR prediction category). The secondary structure ofthe final models was inferred according to DSSP andcombined with the independent sequence-based predic-tions (obtained from CAFASP or calculated in-house) togenerate the output in the SS format. For our ownalignments, we made SS predictions based on consensus ofJPRED,26PSIPRED,27SSPRO,28PROF,29and SAMT02.6For targets with no reliable models of the tertiary struc-ture, the independent SS prediction was based solely onalignment-based predictions.Bujnicki-JanuszandGen-eSilicocalculated their RR and SS predictions indepen-dently.Bujnicki-Janusz(group 020) has also submittedorder/disorder (DR) predictions, based on combination ofSS prediction (in particular the presence of long “coil”regions), analysis of structural divergence, R-factors andconformational variability in template structures, andidentification of compositionally-biased sequence regions.RESULTS AND DISCUSSIONWe were interested in analyzing the relative ranking ofthe three groups (242, 020, and 517) and the officialassessment of their performance in relation to each otherand to the other groups. It is noticeable that in allcategories (CM, FR, and NF) both the expert (020) and thegroup of non-experts using the expert’s protocol and hav-ing access to the expert’s models (517) consistently outper-formed group 242 i.e. CAFASP-consensus-like selection ofcrude FR models. The models built according to theelaborate refinement protocol (Figure 1) were very good,much better than the unrefined models selected frombetween the pre-computed CAFASP results, suggestingthat this protocol may be a valuable tool for proteinpredictors and is probably worth automating in the future.It should be mentioned thatBujnicki-Janusztended todepart from the rigorous protocol more often thanGen-eSilicoand sometimes selected the models for submissionbased on his intuition rather than the VERIFY3D score.SinceGeneSilicohad the access toBujnicki-Janusz’smod-els and more consistently used the refinement procedure,we expectedGeneSilicoto consistently outperformBujnicki-Janusz.We also expected that our strategy, focused ongeneration of possibly best FR alignments, will result insuperior performance in the FR category rather than inthe CM category. However, in the CM categoryBujnicki-JanuszandGeneSilicoscored among the top groups, whilein the FR category both groups obtained quite good scores(with the mutual position in the ranking depending on theevaluation method), but not as good as the absolute topgroups (see the assessors’ papers in this issue of Proteins).It was very surprising for us that we performed remark-ably well in the CM category rather than in the FRcategory. We find it noteworthy that in the FR category wewere outperformed, among the others, by two other groups(453 and 006), who also used the consensus FR approach [ Pobierz całość w formacie PDF ]

  • zanotowane.pl
  • doc.pisz.pl
  • pdf.pisz.pl
  • marucha.opx.pl