Thesis (Index)   <-  Sean Forman   <-  You Are Here



Next: Overview of HOPS Up: TORSION ANGLE SELECTION AND Previous: Why We Care Subsections


Protein and Protein Folding Background

Proteins are constructed from the set of twenty naturally occurring amino acids. Amino acids, and hence proteins, are organic compounds, formed from carbon, hydrogen, nitrogen, oxygen, and sulfur. Amino acids are present in much of the food we eat, and many amino acids can also be synthesized by our bodies as needed. These amino acids then form proteins as they are strung together in our cells by special constructor proteins called ribosomes.


Protein and Amino Acid Structure

Figure 2.1: A typical tri-peptide sequence with torsion angles.
\begin{figure}\begin{center}\leavevmode
\begin{picture}(0,0)%
\epsfig{file=GRAPHICS/xfig_amino.ps}%
\end{picture}%%
\end{center}\end{figure}

The structures of all twenty amino acids are similar (see Appendix A for a list of amino acids and Figures A.1 to A.5 for their residues, three and one-letter abbreviations). We will refer to all amino acids by their three-letter abbreviations. All amino acids have a backbone and a residue. The backbone is the same for every type of amino acid and contains a nitrogen, denoted by $ \mathbf{N}$, followed two carbons labeled the $ \alpha$-carbon, $ \mathbf{C_{\alpha}}$, and the prime-carbon, $ \mathbf{C'}$, so the backbone atoms are ordered from left-to-right $ \mathbf{N}$- $ \mathbf{C_{\alpha}}$- $ \mathbf{C'}$. In addition, the $ \mathbf{N}$ has a hydrogen, $ \mathbf{H}$, attached to it, the $ \mathbf{C'}$ has an oxygen, $ \mathbf{O}$, attached to it (known as the carbonyl $ \mathbf{O}$), and the $ \mathbf{C_{\alpha}}$ has a $ \mathbf{H}$ and the residue attached to it. To form the protein, the amino acid backbones come together in a long chain by forming peptide bonds between the $ \mathbf{C'}$ of an amino acid and the $ \mathbf{N}$ of the following amino acid (see Figure 2.1). Hence, this bonding of amino acids end-to-end results in the protein being a long chain.

The other part of the amino acid is the residue (also known as the sidechain). The residue is a set of atoms attached to the central $ \mathbf{C_{\alpha}}$. Each type of amino acid is distinguished by its residue. The residues vary from the simple (Gly, with a single hydrogen atom), to the complicated, (Trp, with a double carbon ring). The atoms of the residue are labeled using the Greek alphabet, beginning with the backbone's $ \alpha$-carbon and followed by the residue's $ \beta$-carbon, $ \gamma$-carbon, $ \epsilon$-carbon, etc.


Covalent Bonding

Covalent bonding between atoms is the primary definer of protein structure. Among the parameters of a bond are the bond length, $ L$,7 the bond angle, $ \kappa$,8 and the torsion angle, $ \theta$.9 See Figure 2.2 for examples of how these three parameters are calculated. The values of the parameters for a covalent bond depend upon the atoms forming the bond and the processes that went into creating the bond. We will mention the characteristics of four types of covalent bonding prevalent in proteins.

Figure 2.2: Examples of how the bond length, bond angle and torsion angle are determined for a bond.
\includegraphics[width=5.5in]{GRAPHICS/bonding.pstex}<tex2html_comment_mark>81

Single bonds are the most common bonds seen in proteins. In this type of bond, the name refers to the single electron pair the two atoms need to complete their electron shells [18]. Atoms bonded in this manner are roughly equivalent to two balls connected by a single rigid stick. While the stick can be stretched or flexed only slightly, the balls are allowed to rotate freely on the end of the stick. This means that when considering these bonds, we have essentially only one degree of freedom, their torsion angles. The length of a single bond depends on the atoms involved: $ \mathbf{C}$- $ \mathbf{C}$ bonds have a length of $ 1.52 $Å (angstroms), while $ \mathbf{C}$- $ \mathbf{N}$ bonds have a length of $ 1.45 $Å [20]. These values can vary somewhat, but they do not vary significantly from their standard values.

Double bonds involve the sharing of two electron pairs. These bonds do not occur on the backbone of the protein, but are common in some sidechain conformations: His, Asn, Gln, and Trp. Unlike single bonds, these bonds are not free to rotate. They tend to have fixed torsion angles. As with single bonds, the length of a double bond depends on the atoms involved: $ \mathbf{C}$= $ \mathbf{C}$ bonds have a length of $ 1.33 $Å, while $ \mathbf{C}$= $ \mathbf{N}$ bonds have a length of $ 1.38 $Å [18].

Peptide bonds connect amino acids. Peptide bonds are neither single or double bonds; they have a partial double bond character. On the backbone, this means that the molecule oscillates between two states. The first state is a single bond between the $ \mathbf{C'}$ and its $ \mathbf{O}$ and a double bond between $ \mathbf{C'}$ and the following amino acid's $ \mathbf{N}$. The other state is a double bond between the $ \mathbf{C'}$ and its $ \mathbf{O}$ and a single bond between $ \mathbf{C'}$ and the following amino acid's $ \mathbf{N}$. Due to its partial double bond character, the peptide bond between $ \mathbf{C'}$ and  $ \mathbf{N}$ is slightly shorter ( $ 1.33 $Å) than a single $ \mathbf{C}$- $ \mathbf{N}$ bond and slightly longer than a double $ \mathbf{C}$= $ \mathbf{N}$ bond. It has also been observed that the $ \mathbf{C'}$- $ \mathbf{N}$ bond's torsion angle is very close to one of two values. An orientation of approximately $ 180^\circ$ (known as Trans) is the most common and occurs over 99% of the time.10 The orientation of $ 0^\circ$ (known as Cis) accounts for all the other outcomes. The characteristics of this bond further restrict the freedom of motion in the backbone. Off the protein's backbone, partial double bonds occur in some sidechain conformations as well: Asp, Glu, Arg, Phe, Tyr and Trp.

Disulfide bridges are a rarely occurring, but important subset of single covalent bonds. The amino acid Cys (see Figure A.2) features a sulfur ( $ \mathbf{S}$) atom with an attached $ \mathbf{H}$ at the end of the residue. When two Cys amino acids are within close proximity to each other,11 an oxidation reaction will occur and the two $ \mathbf{S}$ atoms will release their $ \mathbf{H}$ and form a strong covalent bond [20]. These bonds are known as disulfide bridges and will dramatically reduce the maneuverability of the protein. Cys is not a very common amino acid, and there is also no guarantee that a Cys will take part in a disulfide bridge. However, even with just eight Cys in a protein there are well over a hundred, $ \displaystyle{\frac{1}{4!}\binom {8}{2}\binom {6}{2}\binom {4}{2}\binom {2}{2}}$, possible bonding configurations forming four disulfide bridges, and many more if we consider configurations with 0 to 4 disulfide bridges formed.

In addition to the bond lengths, the backbone bond angles are also relatively fixed. A bond angle is the planar angle formed by the bonds of three sequential atoms. For instance, the bond angle formed by $ \mathbf{N}$- $ \mathbf{C_{\alpha}}$- $ \mathbf{C'}$ is $ 109.5^\circ$, by $ \mathbf{C_{\alpha}}$- $ \mathbf{C'}$- $ \mathbf{N}$ is $ 116^\circ$, and by $ \mathbf{C'}$- $ \mathbf{N}$- $ \mathbf{C_{\alpha}}$ is $ 122^\circ$ [20].

As mentioned before, the main degree of freedom within the bonds is the torsion angle. Canonical names have been given to the three torsion angles related to the backbone bonds. The rotational angle between the $ \mathbf{N}$ and $ \mathbf{C_{\alpha}}$ atoms is called $ \phi$, the rotational angle between the $ \mathbf{C_{\alpha}}$ and $ \mathbf{C'}$ atoms is called $ \psi$, and the rotational angle between the $ \mathbf{C'}$ and the following amino acid's $ \mathbf{N}$ is called $ \omega$ (see Figure 2.1). As we noted before, $ \omega$ is a peptide bond and is essentially restricted to angles $ 180^\circ$ and $ 0^\circ$. The $ \phi$ and $ \psi$ angles, however, are allowed to rotate much more freely.

In addition to the three backbone torsion angles, there are rotamer angles associated with the torsion angles found in the single bonds of the amino acid sidechains. These angles begin with $ \chi_1$, the torsion angle of the $ \mathbf{C_{\alpha}}$- $ \mathbf{C_{\beta}}$ bond, and continue on with $ \chi_2, \chi_3$, etc. Changes in rotamer and backbone torsion angles have different effects on the protein's conformation. A change in a rotamer angle will only affect the location of a single sidechain's atoms, but a change in $ \phi$, $ \psi$, or $ \omega$ will affect the location of every backbone and sidechain atom following it.

A protein is, therefore, a long chain that varies in shape primarily because of the rotation of each amino acid's $ \phi$ and $ \psi$ angles, and their choice of two possible $ \omega$ values.


Torsion Angles

Each atom has an electron shell which delineates the perimeter of its atomic sphere. This shell prevents atoms (not engaged in a covalent bond) from occupying the same volume. A molecular conformation that places two atoms into the same volume is said to have a steric clash. Steric clashes violate basic atomic properties and do not occur because the energy of the conformation in this position will not be at a minimum. Our program does not allow steric clashes to occur and disqualifies any potential conformations with a steric clash.

As mentioned before, the rotation of the bonds are the primary determinant of a protein's conformation. When the $ \phi - \psi$ angle pairs for known protein structures are studied, one finds that these angle pairs are not distributed uniformly among all possible angle choices. Specific $ \phi$ angles tend to occur with specific $ \psi$ angles and vice-versa.

Why does this happen? Certain $ \phi - \psi$ pairs will, however, swing two atoms into the same volume causing a steric clash making that $ \phi - \psi$ pair very unlikely. The most common $ \phi - \psi$ rotation angles are those that conveniently space the atoms a safe distance away from each other.

Ramachandran plots [64] are two-dimensional plots of $ \phi - \psi$ angle pairs with the $ \phi$ angle along the $ x$-axis and the $ \psi$ angle along the $ y$-axis. The angle combinations typically plotted come from $ \phi - \psi$ angle data found in the Protein DataBank. The patterns of angles mentioned earlier are very easy to see when looking at Ramachandran plots of various amino acids. Note that the space being graphed is a periodic space $ [\ensuremath{-180^\circ},
\ensuremath{180^\circ}] \times [\ensuremath{-180^\circ}, \ensuremath{180^\circ}]$. This is, in fact, a torus.

The densely populated areas on the Ramachandran plot (see Appendix C) represent angle pairings that are most likely to occur. As one can see from Figures C.1 to C.4, the sidechain conformation has a significant impact on the distribution of $ \phi - \psi$ angle pairs. Larger sidechains tend to be more restrictive, while Gly, which has only a solitary $ \mathbf{H}$ atom for a sidechain has much wider variety of observed $ \phi - \psi$ combinations. Interestingly, these common angle conformations also correspond to regular patterns within the backbone known as secondary structure.


Secondary Structure

A secondary structure is a repeating three-dimensional structure with a fixed bonding pattern. The most common structures are helices and sheets. These structures are not formed by strong covalent bonding, but by weaker hydrogen bonding between atoms on different amino acids.

$ \mathbf{N}$ atoms (both backbone and otherwise) often have attached $ \mathbf{H}$ atoms which they are willing to share with other atoms. These $ \mathbf{N}$ atoms are known as donors. Acceptors, mostly $ \mathbf{O}$, are attracted to these donated $ \mathbf{H}$ atoms because of their respective opposite charges. This interaction can occur between atoms in any part of the protein, but we are largely concerned with interactions where the donor is a backbone $ \mathbf{N}$, and the acceptor is the carbonyl $ \mathbf{O}$ (the $ \mathbf{O}$ bonded to $ \mathbf{C'}$) from a different amino acid. These atoms are typically on amino acids four or more amino acids apart in the protein's sequence. Hence, these interactions are long range, rather than local interactions as with covalent bonding. All of the common secondary structure is caused by patterned formation of hydrogen bonds between the backbone atoms.

\includegraphics[width=5.0in]{GRAPHICS/hbonds.ps}
[$ \alpha$-helix hydrogen bonding pattern.]These images are produced using Rasmol [62].

The most common protein secondary structures are $ \alpha$-helices. As can be seen in Figure 2.3, the hydrogen bonds in an $ \alpha$-helix occur between the $ \mathbf{N}$ of the $ i$ amino acid and the carbonyl $ \mathbf{O}$ of the $ i+4$ amino acid. Since this forms an extremely regular helical pattern, the $ \phi$ and $ \psi$ torsion angles usually are near a characteristic set of values (Table 2.1).


Type of Helix $ \phi$-angle $ \psi$-angle Frequency Bond Interval
$ \alpha$-helix -57 -47 98% 4 amino acids
3-10 helix -49 -26 1% 5 amino acids
$ \pi$-helix -57 -80 1% 3 amino acids
[Types of $ \alpha$-helices, their characteristic angles, and frequencies.][77]


Type of Strand $ \phi$-angle $ \psi$-angle  
parallel -119 113  
anti-parallel -139 135  
[Types of $ \beta$-sheets, and their characteristic angles.][77]

Certainly, one can imagine that winding the helix more tightly or more loosely would likewise produce a regular pattern of hydrogen bonds, and indeed, two other helical structures do occur. However, the $ \phi - \psi$ angles required to produce these alternate helices orient the protein in such a way that steric clashes between amino acids within the helix occur more often. The $ \alpha$-helix configuration avoids these steric clashes between amino acids. The two other types of naturally occurring helices are called $ \pi$-helices and $ 3-10$ helices. Rather than producing hydrogen bonds, between the $ i$ and $ i+4$ amino acids as with $ \alpha$-helices, $ \pi$-helices form hydrogen bonds between the $ i$ and $ i+3$ amino acids. The $ 3-10$ helical structure is formed by a series of hydrogen bonds between the $ i$ and $ i+5$ amino acids. Left-handed helices, which rotate in the opposite direction of the previous three helices, are another rare secondary structure.

The other common secondary structures are $ \beta$-sheets. $ \beta$-sheets are formed when the two relatively straight protein backbone segments lie parallel to each other and hydrogen bonds form between the two segments. The individual segments are referred to as $ \beta$-strands. There are two types of $ \beta$-sheets, anti-parallel and parallel (see Table 2.2 for standard torsion angles).

$ \beta$-sheets are typically anti-parallel. In this form, the directions of the two $ \beta$-strands run opposite to each other (as illustrated in Figure 2.4). Just as with $ \alpha$-helices, hydrogen bonds form in a regular manner between the two $ \beta$-strands. Hydrogen bonds form between the $ i$th amino acid's carbonyl $ \mathbf{O}$ and the $ j$th amino acid's backbone $ \mathbf{N}$, and between the $ i$th amino acid's $ \mathbf{N}$ and the $ j$th amino acid's carbonyl $ \mathbf{O}$. Unlike the $ \alpha$-helices, the hydrogen bonding pattern between the two $ \beta$-strands skips the $ i+1$ and $ j-1$ amino acids and continues with the $ i+2$ and $ j-2$ amino acids.

Figure 2.4: Hydrogen bonding patterns for parallel and anti-parallel $ \beta$-sheets.
\begin{displaymath}
\begin{array}{c c}
\multicolumn{1}{l}{\mbox{\bf Anti-Paralle...
...egraphics[height=2.8in]{GRAPHICS/paraPattern.pstex}
\end{array}\end{displaymath}

A less common conformation is the parallel $ \beta$-sheet. In this case, the two strands are running in the same direction (as illustrated in Figure 2.4). This orientation is less common because a much longer length of intermediary protein must occur for these two strands to align in this way. Rather than the carbonyl $ \mathbf{O}$ and backbone $ \mathbf{N}$ of one amino acid lining up with their counterparts on the other amino acid, the bonding pattern is staggered. The backbone $ \mathbf{N}$ of the $ i$ amino acid forms a hydrogen bond with the carbonyl $ \mathbf{O}$ of the $ j$ amino acid, but the $ i$ carbonyl $ \mathbf{O}$ forms a hydrogen bond with the $ j+2$ backbone $ \mathbf{N}$. Then the $ i+2$ backbone $ \mathbf{N}$ will form a hydrogen bond with the $ j+2$ carbonyl $ \mathbf{O}$ and so on.

\includegraphics[width=3.0in]{GRAPHICS/proteinG_global.eps}
[Protein G exhibits the global nature of $ \beta$-sheet formation.] It features a parallel $ \beta$-sheet between amino acids 3 through 8 and amino acids 50 through 55 which combines a pair of anti-parallel $ \beta$-sheets.

While $ \alpha$-helix formation is largely a local phenomena, with bonds forming between amino acids three to five amino acids apart on the amino acid sequence, $ \beta$-sheets can form across a large number of amino acids.12 This makes prediction of sheet structure much different than that of helix structure. Helix prediction can be thought of as a local optimization problem, while sheet prediction or formation is a global optimization problem. In optimization, global solutions are more difficult to find than local ones, so the prediction of $ \beta$-sheets is more difficult than the prediction of $ \alpha$-helices.

Often, the secondary structures are organized in larger groups or motifs. Common motifs include helix-helix, helix-loop-helix, and the Greek key motif, which is four adjacent anti-parallel $ \beta$-sheets (see Figure 2.6).

\begin{displaymath}
\begin{array}{c c}
\multicolumn{1}{c}{\mbox{\bf Carbonic Anh...
...ludegraphics[height=2.8in]{GRAPHICS/fancypdb_2.ps}
\end{array}\end{displaymath}
[Two examples of larger proteins with more complicated motifs.][43,73]


Hydrophobicity

In day-to-day experience, we may think of water ( $ \mathbf{H}_2 \mathbf{O}$) as a neutral, non-interactive medium, but it actually has a small charge distributed on either side of the water molecule. Part of the molecule has a slightly positive charge, and part of it is slightly negative. Molecules with this characteristic are called polar. This means that when a charged molecule is placed in water it will cause the water molecules to align around it and interact with those molecules. Non-polar molecules placed in water will attempt to minimize their interaction with the polar water.13

In the case of proteins, only some of the amino acids are polar and, therefore, when exposed to the water interact with it. These amino acids are called hydrophilic (an affinity for water) and include charged and polar amino acids like Ser, Thr, Asp, Tyr and Trp. Being exposed to water allows their polar atoms to hydrogen bond with the surrounding water atoms and is energetically beneficial to the protein. Other amino acids are called hydrophobic (repelled from water) and are generally seen on the interior of the protein away from the surface and the surrounding solvent. These include Ala, Val, Pro, and Met [12]. The hydrophobic and hydrophilic effect determines, to a large extent, how the protein will fold.


Protein Energetics

The hydrophobic and hydrophillic effect is one aspect of the protein's energetics. By energetics, we are referring to the protein's folding process as an attempt to minimize its thermodynamic energy. In addition to the hydrophobic and hydrophilic effect, interactions like hydrogen bonding (see Section 2.1.3) and sidechain entropy also contribute to the protein's energy.

Sidechain entropy relates to the conformation of each amino acid's sidechain. Sidechains have a variety of configurations that they can appear in. As they become buried, this freedom is reduced and entropy decreases. Energetically, it is preferable if these amino acid sidechains are free to interact with the solvent (as measured by accessible surface area) and are not buried within the protein.

The proper combination of these effects is still an open question and one that complicates the effort to effectively model protein folding.


Energy Minimization

Proteins are constructed of atoms constrained by bonds and affected by forces exerted on them by surrounding atoms, both in the protein and in the surrounding solvent. This naturally leads to the formulation of these problems as optimization problems. The energy function contains terms regarding the covalent bonds between atoms, the hydrophobic effect, hydrogen bonding, and other effects. This leads to a very large and complicated function that in theory should be solvable.

Due to the large number of atoms and forces involved in the protein, one may face over a thousand degrees of freedom. Additionally, the equations describing the inter-atomic forces are not linear or even quadratic, though they are polynomial. This leads to a vast number of local minimums, some of which can be quite deep. Therefore, all solution techniques seeking to minimize this energy function may take a long period of time if they can be computed at all.

As a sidebar, the fact that proteins are able to compute this energy function so quickly and with such ease in nature raises a number of intriguing possibilities. First, it could be (though unlikely) that proteins can solve tremendously hard NP-complete problems in a trivial amount of time. This could lead to a new class of biological problem solvers. One could code a traveling salesman problem as a amino acid sequence and then study the completed structure in order to find a solution. Another possibility is that easily folded amino acid sequences are chosen by natural selection, and researchers have not been be able to determine just what makes these proteins easier to fold.


Current Protein Folding Techniques

A wide variety of approaches from computer science and mathematics have been considered as potential solutions to the protein folding problem. This section is just a brief summary of current techniques used in protein folding. For gentle introductions, see Richards [66] and Hayes [35]. For detailed surveys of the field, see Neumaier [55],14Creighton [20,21], Duan [26],15 or Dill [24]. To frame this issue, we will concentrate on techniques entered into blind protein prediction contests. Every other year since 1994, the Protein Structure Prediction Center at Lawrence Livermore Laboratories (http://predictioncenter.llnl.gov/) has conducted a Community Wide Experiment on the Critical Assessment of Techniques for Protein Structure Prediction (CASP). December 2000 saw the completion of the fourth such assessment (CASP4) [47]. These results are published in a special issue of the journal Proteins: Structure, Function, and Genetics. The review of the 2000 conference has yet to be published at this writing, so we will limit our comments to the third assessment (CASP3) from December of 1998.

CASP takes the form of blind predictions of protein structure. The target proteins are of known amino acid sequence, but unpublished and recently determined three-dimensional structures. Since the targets are not publicly known, the actual structures can be compared with the predicted structures, and the techniques can be evaluated for their accuracy. These targets range in function, length and type. Each research group submits five predictions along with their ``best'' prediction for each target.

Each iteration of CASP has utilized a number of different techniques for evaluating the predictions. While the emphasis has recently been placed on whole atom predictions in the form of 3-D atomic coordinate predictions, the predictions can take a number of other forms as well: alignments to publicly known structures, secondary structure assignments, and residue-residue distances [83].

There are several difficulties in evaluating a prediction. The predictions must be superimposed on the target protein, so that the two models can be compared. This superposition is not always easy to find and often iterative approximations must be performed to find the ideal superposition. Suboptimal superpositions could unduly penalize a predicted fold that has some portion of the protein poorly predicted, but another portion predicted well.

Once the superposition has been settled upon the protein can be evaluated in a number of ways. The two primary methods are the root-mean square deviation (RMSD) of the model's atoms from the target's atoms. This depends on the superposition used. Another is the RMSD of the torsion and rotamer angles for each amino acid. Additionally, measures like accessible surface area, buried residues, and secondary structure found can also be calculated. For comparative models, further measures related to the alignment to known structures are utilized.

For the purpose of evaluating the different techniques, CASP3 divides the predictions into three different types: comparative modeling, fold recognition and ab initio predictions.


Comparative Modeling

Comparative models16 attempt to match the target's amino acid sequence with the amino acid sequences of proteins whose structure is known. Then the target protein's structure is assumed to be similar to that of the matched protein's known structure. This technique performs best when one can find a family of similar proteins. The basic structure of these systems is similar with a significant amount of variation at each step.

Define families or classes of folds and note the sequences that produce those folds.
Search for sequences similar to our target using some alignment technique, such as SWISS-PROT [5].
The similar sequences then provide a template fold family for the target protein.
Fit the target protein to the template using the restraints inherent in the template.

There are several difficulties at each step. Protein families can be difficult to discern. Due to evolutionary effects within different species, proteins that at one point (millions and millions of years ago) were very similar in amino acid sequence may have dramatically divergent amino acid sequences, but retained a similar conformation. Likewise, there are many examples of proteins with a high degree of sequence similarity, but different folds. In addition, since unknown structures far outnumber known structures there is a large class of proteins for which no family can be found, and, hence, no model created.

For an overview of all comparative modeling techniques attempted at CASP3, see Jones and Kleywegt's analysis of the comparative models submitted [41]. Some examples of techniques that were successful at CASP3 include those submitted by Burke [16] and Yang [82].


Fold Recognition

Fold recognition (or threading) is a cousin to sequence homology. Instead of searching for significantly similar sequences and deducing the structure of the protein, a fold recognition package will assume that the unknown structure is similar to a fold that we have seen before and then search for the existence of that similar fold. The problem is then recast as determining the correct similar fold and not the correct similar sequence. Fold recognition techniques do not require similar sequences in the protein databank, just similar folds.

Proteins of decidedly different amino acid sequence within a single protein family17 are a byproduct of evolution. Proteins within a single family are likely to have evolved from a single protein. Along the way to their present state, there were a large number of insertions and deletions in their amino acid sequence. While the sequence similarity may now be distant, the structure similarity will remain as the evolutionary process maintains the protein's function, which depends on the protein's structure. This is seen most clearly in hemoglobin which has nearly the same shape for thousands of species [70].

Fold recognition assumes that there are a limited number of core folds from which all proteins draw their structures. One can think of these core folds as templates or patterns, that the amino acid sequence is molded or fitted to. Many of these templates have not yet been determined, but the expectation of a small number of core folds is implicit in the construction of these algorithms. Because of this, the efficacy of fold recognition, like sequence homology, is limited by the size of the Protein Databank. Assuming the proper template can be determined for a given sequence, one must then align the sequence to the fold. Due to phenomena such as deletions, insertions, varying sequence length and others, there are thousands of possible ways to match a sequence to a template.

To reconcile these two areas, threading approaches generally follow the same general pattern [46]:

creation of a core fold library,
an objective function that can evaluate the quality of a sequence placement over a particular core fold (oftentimes this will be a measure using hydrophobicity, surface accessibility, and inter-amino acid interactions),
a search heuristic for finding the best alignment given a sequence and fold, commonly a dynamic programming technique, and
another search heuristic that will choose the best template from the best alignments.

Murzin [52] provides an overview of the fold recognition techniques attempted at CASP3. Bryant [60] and Jones [40] are two examples of techniques that were above average performers at CASP3.


Ab Initio Techniques

Ab initio models attempt to discern the structure of a protein without any direct structural data from proteins in the same evolutionary family. Instead, they construct the protein using general principles-many of which are thermodynamic in nature. This is a far more difficult task than the other two techniques undertake, so the standards of success are often less stringent. In order to take this into account, CASP3 used some different measurements to evaluate the various entries.

RMSD results are often very poor for ab initio techniques.18 Instead, the CASP3 evaluators used measurements like proper recognition of protein class, proper prediction of protein fragments, and proper fold architecture. Fragments are defined by taking the RMSD over 25 or 40 amino acid Lesk Window Plots. These plots match portions of the prediction to portions of the target protein. The percentage of running window RMSD's below some minimum threshold is computed, and these percentages are used to evaluate the models. See Orengo for a full description [58] and a discussion of other evaluation techniques.

The ab initio field can be further differentiated into knowledge-based ab initio techniques and classical ab initio techniques. Knowledge-based techniques employ constraints which are developed using multiple sequence alignments or fragments from known structures. These fragments can then be combined, often utilizing an optimization technique, to produce a full protein representation. The most successful technique in this class was that of Baker's group [69]. Baker's group utilized three to nine amino acid structure fragments and then combined these fragments using simulated annealing while optimizing a scoring function. Their scoring function depended on things like hydrophobic burial, disulfide bonding, $ \alpha$-helix and $ \beta$-sheet packing and formation.

Classical ab initio techniques are the most ambitious structure prediction techniques. They typically rely on basic thermodynamic equations to define the relationships between atoms and amino acids in order to build the protein that minimizes some global energy function. These models will make simplifying assumptions about the protein's structure in order to make the enumeration of the protein's conformations manageable [24]. These methods tend to mix and match from a variety of simplifying assumptions about the geometry or energetics of a protein in order to make the problem tractable.

They treat amino acids as point entities rather than multi-atom molecules by ignoring the sidechain or replacing it with a sphere the same radius as the fully defined amino acid [24],
they restrict the location of the amino acids to a 2-D or 3-D lattice and lay out the protein as a self-avoiding walk where each walk's energy is evaluated [35], or
they categorize each amino acid as either hydrophobic or hydrophilic and compute energy functions as either distances between hydrophobics and hydrophilics or as simple on/off interactions between adjacent hydrophobic and hydrophilic amino acids.

These techniques often rely on very powerful computing facilities to solve their problems to completion. Also, due to the exponential growth in the number of conformations possible as the number of amino acids grows these techniques do not scale well to large proteins.

CHARMM [15] is a widely used package for energy minimization, and Dill [24], Levitt [50] and others produced much of the early work on classical ab initio structures.19

See Orengo's summary for a listing of ab initio techniques entered at CASP3 [58]. Scheraga's work [49] was deemed to be among the best of the classical ab initio techniques. This technique was most successful on proteins that were shorter than 150 amino acids and primarily helical in nature. They first use an off-lattice simplified model which treats residues as points. These models are then solved multiple times using a simulated annealing technique. This produces a distinct set of potential fold families. Each of these distinct families is then replaced by an all-atom backbone, which is again solved to minimize a potential function. Finally, sidechains are added to the backbone, and the energy is again minimized.

Overview of Techniques

Comparative modeling currently produces the most accurate models of protein structure. However, these techniques rely on the existence of families of similar proteins, whose structures have already been found. While this class is continually growing, it is not a legitimate option for a large number of proteins. In these cases, ab initio techniques are the best bet, but are currently limited by both a lack of understanding of the thermodynamics involved in the protein folding process and a lack of computational power.


next up previous
Next: Overview of HOPS Up: TORSION ANGLE SELECTION AND Previous: Why We Care
sforman@sju.edu