10211836, bioinformatyka, artykuly
[ Pobierz całość w formacie PDF ]
//-->Protein Science~1999!,8:897–904.Cambridge University Press. Printed in the USA.Copyright © 1999 The Protein SocietyProtein structural topology: Automated analysisand diagrammatic representationDAVID R. WESTHEAD,1,5TIMOTHY W.F. SLIDEL,1TOMAS P.J. FLORES,2andJANET M. THORNTON1,3,4The European Bioinformatics Institute, EMBL Outstation, Wellcome Trust Genome Campus,Hinxton, Cambridge CB10 1SD, United Kingdom2Synomics Ltd., P.O. Box 71, Royston SG8 8TE, United Kingdom3Biomolecular Structure and Modelling Unit, Department of Biochemistry and Molecular Biology, University College,London WC1E 6BT, United Kingdom4Laboratory of Molecular Biology, Department of Crystallography, Birkbeck College, University of London,London WC1E 7HX, United Kingdom~ReceivedJune 30, 1998;AcceptedDecember 21, 1998!1AbstractThe topology of a protein structure is a highly simplified description of its fold including only the sequence of secondarystructure elements, and their relative spatial positions and approximate orientations. This information can be embodiedin a two-dimensional diagram of protein topology, called a TOPS cartoon. These cartoons are useful for the under-standing of particular folds and making comparisons between folds. Here we describe a new algorithm for the pro-duction of TOPS cartoons, which is more robust than those previously available, and has a much higher success rate.This algorithm has been used to produce a database of protein topology cartoons that covers most of the data bank ofknown protein structures.Keywords:protein structure; topological diagram; topological representation; topology; TOPSIn recent years, the experimental techniques of NMR and X-raycrystallography have delivered a large number of protein 3D struc-tures. Knowledge of these structures is of central importance tostudies of protein function and evolution, particularly since it hasbecome apparent that structure is much more strongly conservedthrough evolution than sequence. With the increasing number ofstructures has come a need for better tools for automated structuralanalysis and visualization.Visualization of the 3D folds of proteins can be difficult. Three-dimensional models can be viewed using graphics programs likeRASMOL~Sayle& Milner-White, 1995!, and folds can be madeclearer by display options that show only the peptide backbonewith secondary structures represented by ribbons. However, theserepresentations still have to be rotated to find a good viewingangle, and manual comparison of different structures quickly be-comes very difficult when more than a few are involved.Reprint requests to: Dr. David Westhead, Lecturer in Bioinformatics,School of Biochemistry and Molecular Biology, University of Leeds, Leeds,Yorkshire LS2 9JT, United Kingdom; e-mail: westhead@bmb.leeds.ac.uk.5Current address: School of Biochemistry and Molecular Biology, Uni-versity of Leeds, Leeds, Yorkshire LS2 9JT, United KingdomAbbreviations:2D, two-dimensional; 3D, three-dimensional; DSSP, dic-tionary of secondary structure of proteins; SSE, secondary structure ele-ments; TIM, triosephosphate isomerase.Comparison of the protein folds is much easier when they arereduced to a topological level, at which details like the lengths andprecise orientations of secondary structures, and structures of con-necting loops, are ignored. Such a representation is embodied in a2D TOPS cartoon. Some examples are shown in Figure 1 alongwith the 3D structures for comparison. The cartoons show thesecondary structure elements~SSEs!and how they are connectedin sequence. Also represented are the relative spatial positions andapproximate orientations of the SSEs. Strands that are linked byhydrogen bond ladders are adjacent to each other in the cartoon,and SSEs that are otherwise spatial neighbors in the fold are plot-ted close together. Orientations of the SSEs are shown in theapproximation that they have one of two directions, “up”~outofthe page! or “down”~intothe page!. Figure 1 clearly shows howTOPS cartoons simplify the understanding of the folding topologyof a single structure, and enable comparison between related struc-tures. The folds in Figure 1 all contain a “jelly roll” folding motif~Richardson,1981; Stirk et al., 1992! highlighted as shaded trian-gles, a fact that is much clearer from the TOPS cartoons than the3D structures.The first TOPS cartoons were drawn manually~Levitt& Chothia,1976; Nagano, 1977; Sternberg & Thornton, 1977!, and automaticgeneration was first attempted by Flores and coworkers~Floreset al., 1994!. This original algorithm, which was implemented inthe program TOPS, has now been tested on many diverse examples897898D.R. Westhead et al.and found to be lacking in robustness. In a large proportion ofcases, the algorithm either fails completely, or produces a verypoor cartoon, as a consequence of problems in the topologicalanalysis or cartoon layout phases. Most of the problems reflect theinability of the algorithm to deal with irregularities that are presentin a large proportion of protein structures. Here we describe a newalgorithm, which is much more robust and has a much highersuccess rate. This algorithm is implemented in a new version of theTOPS program.The improvements in the algorithm have enabled the generationof a database of TOPS cartoons~Westheadet al., 1998!, whichcovers most of the data bank of protein structures~Bernsteinet al.,1977; Abola et al., 1987!. To avoid the generation of many highlysimilar cartoons, the structural data bank was first clustered at asequence identity threshold of 95% over the full length of bothsequences, using a standard single linkage clustering algorithm. Asingle representative cartoon was generated from the best structurein each cluster, resulting in a cartoon database containing morethan 2,000 protein chains. The best structure in a cluster wasdeemed to be the highest resolution X-ray crystal structure, if onewas available, or an NMR structure if not.ResultsImproved TOPS cartoon generationFigure 1 shows some example cartoons generated by our newalgorithm for some protein structures taken from the Brookhavendata bank~Bernsteinet al., 1977; Abola et al., 1987!. Each cartoonshows a single structural domain. The first example, the viral coatprotein~2stv!,is essentially a classic jelly roll. The other examplesare more complex folds with extra strands sometimes extendingthebsandwich structure associated with the jelly roll. The finalexample~1dlcdomain 2! is particularly complex. The strand near-est in sequence to the C~carboxy!terminus C5 is at the origin ofa bifurcation of the lower sheet in the sandwich. To the left of thisstrand~inthe cartoon!, this sheet is shown as two layers, onedrawn slightly above the origin of the bifurcation and the otherbelow. While the new algorithm generates correct cartoons foreach of these domains, we found that the previous algorithm pro-duced a satisfactory cartoon for only the first example, 2stv.Correct TOPS cartoons can be drawn in four possible orienta-tions, corresponding to rotations of61808aboutxandyaxes. Tomake structural relationships clearer, cartoons can be re-oriented toproduce equivalent views, as in the examples in Figure 1. Oursoftware permits this procedure to be carried out manually. Analternative automatic procedure requires the user to input an align-ment between the secondary structure elements in each structure.Such an alignment could be obtained by sequence alignment~inthe case of sufficiently similar sequences! or from a structuralalignment algorithm. Further work will focus on full automation ofthis procedure using a topological alignment algorithm.Our approach to automated generation of TOPS cartoons hastwo phases: topological analysis and cartoon layout~seeMaterialsand methods!. Topological analysis is concerned with the deter-mination of all the relevant topological features of the structure.The cartoon layout phase aims to express these features in a to-pologically correct and aesthetically pleasing cartoon. It is vitallyimportant that the first phase produce a correct topological analy-sis, because otherwise production of a correct cartoon is im-possible. The cartoon layout phase depends upon a process ofFig. 1.Some example TOPS cartoons~left!along with 3D structures~right!for comparison. Four letter codes are accession numbers from the Brook-haven protein structure data bank~Bernsteinet al., 1977; Abola et al.,1987!. From top to bottom, the virus coat protein~2stv,Jones & Liljas,1984!, the seed storage protein canavalin~1cauchain A, Ko et al., 1993!,human tumor necrosis factora ~1tnfchain A, Eck & Sprang, 1989!, andbacterial delta endotoxin domain 2~1dlc,Li et al., 1991!. The 3D structureswere drawn with Molscript~Kraulis,1991!; in these diagrams, arrowsrepresentb-strandsand helical ribbons represent helices. In the cartoons,contiguous pieces of peptide chain run from amino~N!terminus Ni tocarboxy terminus Ci 1, following the connecting lines between symbols.Domains can be composed of more than one peptide chain~e.g.,1dlcdomain 2!. Triangular symbols indicateb-strandsand circular ones helices,with smaller symbols indicating shorter helices~fiveresidues or less! andsmaller sheets~sixresidues or less in the sheet!. The approximate directionof the corresponding secondary structure element is either “up”~outof thepage! or “down”~intothe page!. These directions are indicated in thecartoon by the way connecting lines are drawn to the symbols: connectionsdrawn to the edge of the symbol connect to the base, while those drawn tothe center connect to the top. This information is duplicated in the case ofstrands: “up” strands are drawn as upward pointing triangles and “down”strands as downward pointing triangles. Each structure contains a “jellyroll” topology, highlighted as filled triangles in the cartoon. Note that onemanual edit has been made to the cartoon for 1dlc domain 2: a connectionhas been redrawn to avoid the triangle representing the C-terminal strandnearest C5.Analysis and representation of protein structural topologyoptimization that aims to produce the best possible cartoon, subjectto a number of different criteria. This process is frustrated whenthese criteria are in conflict, and an ideal solution~cartoon!maynot exist.In cases where the automatically produced cartoon is not ideal,it is useful to be able to edit it manually, and we have developedsoftware for this purpose. It is occasionally necessary to modifycartoons for aesthetic reasons. An example is 1dlc domain 2 inFigure 1, where the connecting line between two symbols has beenre-drawn because the original line was drawn through anothersymbol. Manual editing is sometimes useful for other reasons:cartoons can be modified to emphasize symmetry in the fold,re-oriented to emphasize similarity to another cartoon, or altered inthe rare cases where the topological analysis fails.A large-scale test was carried out to determine the proportion ofautomatically produced cartoons requiring manual editing~West-head et al., 1998!. Cartoons were generated for more than 2,000nonidentical protein chains. These were checked manually, andit was judged that modification was necessary in 18% of cases.Judgment of the aesthetic properties of a cartoon is clearly highlysubjective, and many of these cases were modified for aestheticreasons. However, there were a small number of cases where topo-logical analysis failed. These were cases of complexb-architectures,like trefoils~Murzinet al., 1992!, propellers, and solenoids, whichare not adequately described in terms of simpleb-sheets,-barrels,and -sandwiches.The new topological analysis and cartoon layout algorithmsshow numerous improvements with respect to their predecessors~Floreset al., 1994!. Full details of the new algorithms are givenin Materials and methods. The most important improvements re-late to• calculation of the relative direction of strands• calculation of the relative position of strands in barrels, sheets,and sandwiches• chirality calculation• the recognition of sheet curvature• improvement of the score function used in cartoon layout• improvement of the layout optimization algorithm.Due to lack of robustness, large scale testing of the old algo-rithm proved to be impossible. In view of this, we chose to illus-trate the improved performance of the new algorithm with sometypical examples, which are shown in Figure 2. The first exampleis rhodanese~1rhddomain 1!, which has a single five-strandedparallelb-sheet,flanked by helices making right-handed connec-tions between the strands. While the new algorithm produces acorrect cartoon, the old algorithm was unable to calculate thecorrect positions and directions of the strands in the sheet, andmade errors in the calculation of the chiralities of somebab-units.These errors in the topological analysis resulted in an incorrectcartoon.In the enolase structure of Figure 2~1pdzdomain 2!, the classicTIM barrel fold is modified, the second canonical strand beingantiparallel to the rest. In this example there is a further compli-cation. The first canonical strand of the barrel is split in two,according to the DSSP~Kabsch& Sander, 1983! secondary struc-ture assignments. The first part of it is hydrogen bonded to canon-ical strand eight and the second half to strand two. For this reasonthere is no cycle in the hydrogen bonding graph~seeMaterials andmethods!, and the algorithm is unable to detect the barrel. This is899Fig. 2.Example cartoons illustrating some of the improvements made inthe algorithm for automatic production of topology cartoons. Cartoonsproduced with the old algorithm are shown on the left and those producedwith the new on the right. The examples are~fromtop to bottom!: rhoda-nese~1rhd,Ploegman et al., 1978! domain 1, enolase~1pdz,Duquerroyet al., 1995! domain 2, and diphtheria toxin~1ddt,Bennett et al., 1994!domain 3. Detailed discussion is given in the text.catastrophic for the old algorithm, which simply plots the barrel asa flat sheet. The new algorithm is saved by the fact that the cur-vature of the sheet is detected, resulting in the more reasonable plotwith the strands on a circular arc. The two parts of the first ca-nonical strand are plotted on either end of the arc, the larger spacebetween these symbols indicating the lack of hydrogen bonds be-tween them. It is also clear that the old algorithm did not producea correct calculation of the relative direction of the strands, or thechiralities of thebab-units.The final example of Figure 2 is theb-sandwichstructure of thethird domain of diphtheria toxin~1ddt!.This is a difficult structurebecause the long twisted strand at the right hand end~inthe car-toon! is hydrogen bonded to strands in both sheets of the sandwich.The old algorithm is unable to cope with this, but the new algo-rithm treats the whole structure as a single sheet that is bifurcated~theorigin of the bifurcation being the edge strand in this case! andis able to produce a good cartoon.DiscussionWe have shown that our new algorithms are able to produce correctTOPS cartoons with much higher success rates than those previ-ously available~Floreset al., 1994!. This is due to improvementsin both topological analysis and cartoon layout phases. The diffi-culty of generating TOPS cartoons automatically became apparentas the old algorithms were applied to progressively more complexstructures, which was the motivation for this work.900The 82% success rate of our algorithm is still less than ideal.Some of the failures are simply cases where the algorithm pro-duced a topologically correct cartoon that was not aestheticallypleasing. These cases can often be improved by changing the pa-rameters of the drawing algorithm~seeMaterials and methods!,which relies on a process of optimization. The cartoon for do-main 2 of 1pdz in Figure 2 is a typical example. Clearly thecartoon produced by the new algorithms is topologically correct,but it is not very neat around the N- and C-terminal helices. Changesof optimization parameters can improve matters, but we find itquicker in practice to generate the cartoon using default parametersand then do any necessary tidying up manually.There are still some more serious failures. The most commoncauses of these are complicatedb-architectures.The topologicalanalysis recognizes sheets, barrels, and sandwiches, and drawsthem in an appropriate way. Ideally it would recognize otherb-architectures,such as trefoils~Murzinet al., 1992!, propellers,and triangular prisms. However, recognition of such structures isnot straightforward and will be the subject of further work.The topological approach to protein structure has the virtue ofsimplicity. Topology cartoons make the visualization of folds andtheir relationships much easier, as illustrated in Figure 1. Thestructural relationships between these folds are very distant. Theyall contain a jelly roll, but there are many indels of secondarystructures, and the lengths of secondary structure elements andconnecting loops differ significantly. Since these proteins havevery different functions, it seems unlikely that the structural sim-ilarity results from divergence from a common ancestor. These areanalogous folds, sharing a common folding pattern that is advan-tageous from a physical–chemical point of view. Although topo-logical similarity does not guarantee an evolutionary relationship,it is a necessary requirement.The utility of the topological approach extends to any proteinfold with a substantial content ofb-structure. b-strandshave well-defined relative directions, parallel or antiparallel, which can bededuced from the patterns of hydrogen bonds linking them. Withinsuch folds it is often the case that the up0down approximation torelative directions is also reasonably good for the helices. Withinthe realm of alla-folds,where packing angles between helices aremore variable, this approximation is poor and the topological ap-proach is much less powerful.Another weakness of the topological approach is the reliance onautomatic methods of secondary structure assignment, which donot always produce the answer expected by visual inspection of astructure. An example is the cartoon for 1pdz domain 2 in Figure 2.Most experts would consider the structure to be an eight-strandedbarrel, yet DSSP, and in consequence the cartoon, shows ninestrands. The first strand is split because of an irregularity thatdisrupts the ideal pattern of hydrogen bonds. This type of problemis not uncommon and can result in closely related structures ap-pearing, from the TOPS cartoons, to be more different than theyactually are. Within our program we have implemented some heu-ristic algorithms to detect such situations and make amends, forinstance, merging the first two strands of 1pdz into one biggerstrand based on geometrical criteria. These options are often use-ful, but are not turned on by default.Future work would need to focus on the problems associatedwith automatic secondary structure assignment highlighted above.In addition, we plan to develop algorithms capable of recognizingmore complexb-architectures.In a slightly different vein we haveimplemented database search algorithms~Gilbertet al., 1999!.D.R. Westhead et al.These algorithms, which are accessible from the WWW, enablesearches of our protein topology database for cartoons containinguser-defined topological patterns. Another service enables a topo-logical search for protein structural similarity.The software described in this paper is available for use via aWWW interface at http:00tops.ebi.ac.uk0tops. It can also be down-loaded free for local installation. The database of TOPS cartoons isalso available at this address.Materials and methodsThis work has three main aspects: the algorithm for topologicalanalysis, the algorithm that uses this analysis in the layout of TOPScartoons, and the software for manual editing of cartoons. Eachaspect is described in a separate section below.The topological analysis algorithmThe main steps of the algorithm are shown in the form of a flowdiagram in Figure 3 and explained in detail below. The first step ofthis process is assignment of secondary structure. For this, we relyon the established secondary structure assignment methods DSSP~Kabsch& Sander, 1983!, which we use by default, or STRIDE~Frishman& Argos, 1995!. These methods both require a proteinstructure in which positions of all backbone atoms are defined, andprovide, in addition to secondary structure, information about hy-drogen bonds between main-chain atoms. This information, alongwith Ca coordinates, is the required input to our algorithm.The algorithm calculates the following quantities, which consti-tute our definition of a topological representation of a proteinstructure: the sequence of secondary structure elements~ b-strandsanda-helices!;the grouping of strands into sheets, barrels, andsandwiches; the exact position and orientation of strands withinFig. 3.A flow diagram showing the main steps of the topological analysisalgorithm.Analysis and representation of protein structural topologytheseb-structures;whether sheets are curved or flat; the approx-imate relative orientation~upor down! of helices, relative to otherhelices andb-structures;spatial neighbor relationships, other thanthose implied by adjacency inb-structures;and the chirality ofselected supersecondary structures.A major problem in the topological analysis is robustness in theface of the wide variety of irregularities observed in a large pro-portion of real protein structures. For example,b-strandsoftenhave bulges, bends, and twists; sheets can have bifurcations; andbarrels can have “satellite” strands. Procedures in the new algo-rithm to handle such irregularities are responsible for many im-provements in performance.Step 1. Initialization of list of SSEsFollowing Figure 3, the first step of the algorithm is to initializethe list of SSEs. These are defined to be sequentially contiguousgroups of residues having either helix~H!or extended~Eor strand!secondary structure type. There are user options to select whether310orphelices are to be included withahelices.Step 2. Annotation of SSEsSSEs of extended~strand!type are always linked to at least oneother element of the same type by hydrogen bonds. Pairs of strandslinked in this way are spatially adjacent in the fold and have awell-defined relative direction, parallel or antiparallel. We call suchpairs of strands “bridge partners”; here our terminology differsfrom the norm, which uses bridge to refer to a connection betweenpairs of residues, rather than pairs of strands. The relative directionof bridge partner strands can be determined from the pattern ofhydrogen bonds between them~see,for example, Kabsch & Sander,1983!. At this point, each strand is annotated with a list of otherstrands that are its bridge partners, and whether these relationshipsare parallel or antiparallel.The next step is the derivation of a best-fit vector approximationfor each SSE. This calculation yields start and end points for theSSE in 3D space, with the vector joining the points representingthe position of the SSE in the fold. While this vector is not itselfconsidered part of the topological description of the fold, it is usedin the calculation of topological quantities, like neighbor relation-ships and relative directions, when they are not determined by thestronger hydrogen bond related criteria. Two secondary structureelements are considered to be neighbors if the distance between themidpoints of their vector representations is less than a user-definedcutoff~typically12 Å!.Step 3. Detection and analysis ofb-sheetsandb-barrelsStrand type secondary structures form larger structures, likeb-sheetsandb-barrels,through their bridge partner relationships.Detection of these structures begins with theb-graph ~Kochet al.,1992; Flower, 1994a, 1994b!. The vertices of this graph are thestrand type secondary structure elements, and these are connectedby edges if and only if the two vertices are bridge partner strands.The algorithm recognizes barrel structures by searching for cyclesin theb-graph,using a method like that employed by Flower~Flower,1994b!. At this point, eachconnectedcomponent of theb-graphis labeled as either abarrel,if it contains a cycle, or asheetotherwise.The algorithm for barrel detection is far from ideal. The reasonfor this is that many protein structures contain a substructure thata human expert would call ab-barrel,but do not have cycles in901theirb-graphs.Usually this is because the barrel is slightly dis-torted, and the hydrogen bonds that are normally present are so farfrom ideal that they are missed by standard methods for theirdetection. Nevertheless, TOPS retains the cycle in theb-graphasthe defining feature of ab-barrel,leaving the “broken” barrels tobe considered as sheets of high curvature, which are dealt with instep 4.The next step for the algorithm is to determine the position ofindividual strands within sheets and barrels. By position we mean,in the case of barrels, the order of appearance of each individualstrand as we circumnavigate the barrel, and in the case of sheets,the order of appearance as we move from one edge of the sheet tothe other. The positions of strands within a barrel are determinedby the algorithm that detects cycles in theb-graph.In some pro-teinsb-barrelstructures are found to have “satellite” strands; thatis, strands that are hydrogen bonded to strands within the barrel butdo not form part of the cycle. Such strands are assigned a positionin the barrel that is the same as that of the nearest barrel strand towhich they are connected by a path in theb-graph.To determine the relative position of strands within sheets, thefirst step is to determine the local structure of the sheet aroundeach strand. This involves taking each strand in turn and splittingits bridge partner strands into two groups, those lying on one sidein the first group, those on the other side in the second. Thiscalculation uses the hydrogen bond overlap and geometric criteriashown in Figure 4. The hydrogen bond overlap criterion is thestronger, and robustness of the algorithm requires that all possiblebridge overlaps be used before resorting to the geometrical calcu-lation. When this is completed for each strand, the sheet is repre-sented by the graph illustrated in Figure 5A. Here nodes~numbered1–8! represent strands. The arrowsleavingeach node connect it toits bridge partners, and labels on these arrows indicate in which ofthe above groups~or!the bridge partner lies. For strand 3, forinstance, the bridge partners are 1 and 2 in group , and 4 ingroup . For strands with only one bridge partner, there is onlyone group~ !.Arrows always occur in pairs: if strand n is a bridgepartner of strand m, then strand m is a bridge partner of strand n.ABFig. 4.Criteria used to determine, for two strands, each a bridge partner ofa reference strand, whether they lie on the same or opposite sides of thatreference strand. Residues in each strand are labeled with integers 1– 6.~A!H bonds between the strands overlap~i.e.,both of the outside strandsform H bonds to residues 1, 2, and 3 of the central strand!. In this case, thetwo outside strands must be on opposite sides of the central strand. Whenthere are no such overlaps, it is necessary to resort to the geometric criteriaillustrated in~B!.Here the strands are on opposite sides if the angleubetween the planes ACB and ACD is in the range 908u2708. Other-wise they are on the same side.
[ Pobierz całość w formacie PDF ]