Thesis (Index)
<-
Sean Forman <-
You Are Here
Torsion Angle Selection and Emergent Non-local Secondary
Structure in Protein Structure Prediction
Sean Forman
Department of Math and Computer Science
Saint Joseph's University
Working with
Alberto Segre, University of Iowa
Yinyu Ye, University of Iowa
Kenneth Murphy, University of Iowa
What use is a genome?
- The genome is a set of A, G, T, and C patterns.
- Within the genome, genes are informational sequences of DNA bases.
- They are a list of amino acids that make up different proteins.
Why do we care about proteins?
- Nearly all bodily processes are regulated by proteins.
- Knowing a protein's structure (also called
conformation or fold) can tell us what sort of drugs
will interact with that protein.
- Finding the structure is harder than it might seem.
What determines a protein's structure?
- It is widely believed that the sequence of amino acids
determines the 3-D structure of the protein, (Anfinsen
1973, [#!anfinsen73:-princ!#]).
- Sequence is relatively cheap to determine - Human Genome
Project
- We want to use this sequence to find new proteins and their
sequences, and then determine their 3-D structure.
- Expensive to determine - NMR, X-ray Crystallography
- Then we want to use the structure to determine how the protein
works, and then we want to determine how the proteins interact.
What do proteins look like?
- A protein is an open chain of amino acids, like beads on a
string.
- Amino acids are molecules of seven or more atoms. There are
twenty different types of amino acids.
What is an amino acid's structure?
- Amino acids are made of carbon, hydrogen, nitrogen,
oxygen and sulfur atoms.
- There are twenty different amino acids. Each amino acid has the
same backbone and a unique sidechain.
- Amino acid backbones plug together to form a protein.
How can the shape of the amino acid change?
- The bond lengths and angles are essentially fixed.
- The main allowable movement is in the bond torsion angles
. The rotation along the bonds.
- Also, the
angle occurs only as
or
since it is a partial double bond.
- Sidechains can rotate as well, but they are local changes.
- The primary variables in a protein's conformation are its
torsion angles.
Energetics: How does the long chain become a globular molecule?
- Folding is caused by molecular attractions and repulsions
(energetics) within the protein and with the surrounding water.
and
like to share hydrogens, but
doesn't.
- Proteins fold in water, and water is polar (shares hydrogens).
Proteins strive to expose
and
to outside or each other and
keep
on inside. These interactions are called hydrogen
bonds.
- Because of this sharing behavior some amino acids are termed
hydrophobic and some hydrophilic.
Effects of hydrogen bonding and secondary structure
- There are two common secondary structures that allow proteins
to form many hydrogen bonds.
-
are helical formations.
-
are long strands of amino acids that come
together to form
.
- There are other ways, but they occur less frequently.
- The amino acids in these structures correspond to repeating
torsion angle values.
Figure:
-helix hydrogen bonding pattern. These images
are produced using Rasmol [#!program-rasmol!#].
|
Figure:
Hydrogen bonding patterns for
parallel and anti-parallel
-sheets.
|
Problem Description
The protein folding problem is defined as the prediction of the
protein's three-dimensional (tertiary) structure from its amino acid
sequence.
MTYKLILNGKTLKGETTTEAVDAATAEK
VFKQYANDNGVDGEWTYDDATKTFTVTE
Current Techniques in Protein Folding
- Pattern Recognition
- sequence homology - match target's amino acid sequence
to sequences of known structure
- fold recognition - match target's sequence to known
structures and see how well they fit
- Molecular Modeling - ab initio techniques
- Formulate an energy function to minimize
- Search a space of potential conformations
- Simplifying assumptions
- Proven
NP-complete [#!berger98:-protein-hp-np-compl!#].
HOPS-Hybrid Optimizer of Protein Structure
- First, we reduce the continuous range of torsion angles to a
discrete set using clustering.
- Full protein representation
- We then search using the discrete set of torsion
angles and a simplified energy function.
- Search done in parallel with a variety of techniques.
- In certain cases, we will consider a continuous relaxation of
the discrete torsion angles.
Protein Representation
- The protein backbone is modeled as an open link chain, e.g. a
robot arm, with a homogeneous coordinate system.
- Updating an amino acid's position takes a series of
matrix multiplications. Position is location and orientation.
- Backbone bond lengths and angles are fixed.
- Only torsion angles can change.
- Sidechain libraries are pre-defined.
Searching
- We begin with the very first amino acid, and choose from the
and
angles found in our discretization.
- From there, we continue adding amino acids to our protein.
- This can be viewed as a search tree.
- Certain angle combinations can lead to steric clashes which
causes backtracking.
- Clashes stored in a failure cache accessed with a hash function.
@fontpicture(6316,4765)(301,-3962)
(281,-3415)(0,0)[lb]AAcid

%
(281,-3939)(0,0)[lb]AAcid

%
(5026,-3736)(0,0)[lb]at each level%
(5026,-3436)(0,0)[lb]

angle choices%
(1051,-286)(0,0)[lb]AAcid

%
(1051, 89)(0,0)[lb]AAcid

%
(2851,464)(0,0)[lb]

%
(4876,464)(0,0)[lb]

%
Continuous Optimization
Contributions to Energy Function
- The energy function and its parameters are based on laboratory
data.
- Accessible Surface Area of
atoms
- ASA of
and
atoms
- ASA of the sidechain (involves entropy)
- number of hydrogen bonds between
and
- Overestimate of the potential contributions of the
protein's unfolded portion
- Since we overestimate the best solution along this path, entire
branches can be pruned.
- A good early solution means more pruning.
A Parallel Programming Method: Nagging
- scalable (transfers well to a large number of machines)
- fault-tolerant (can survive a machine crashing)
- not a partitioning method (work isn't divided among machines)
- each machine is given the current state of the solution and is
asked to search the space in an order different from the parent
machine
- infrequent communication between processors keeps down overhead
costs
- relies on a large variance in the solution time dependent on
ordering
- Nagging has been implemented on 3SAT, TSP, First-Order Logic,
Matching Problem, Game-Tree Search
Nagging and Protein Folding
- The nagger is given the angles and amino acids set by the parent
in its search.
- It receives an energy value for the current best solution and
the current angle set from the parent and then searches the remaining
angles in a different manner from the parent.
- The nagger reports any solutions or lack thereof to the parent,
who uses that information to improve its search.
Clustering
- HOPS depends on the selection of torsion angle choices for each
type of amino acid to frame the search.
is fixed. Patterns emerge in Ramachandran plots
(
angle pairs).
- Patterns are a result of the amino acid shape and sidechains.
- Angle space is a torus.
- Should choose just enough angle pairs to represent the space.
Algorithms
- Hierarchical - build clusters of increasing or decreasing size,
MST.
- Optimization - build structures to minimize some objective,
-means.
- We use a Multiple Facilities Location technique, which is an
operations research technique.
- Cluster points are facilities and the torsion angles are demand
points.
- The Weber single facility location problem is defined as find
that minimizes,
, the total distance from the cluster
points to the torsion angles.
 |
(1) |
Weber's Solution
- For the single facility problem, we have an iterative
solution [#!wesolowsky93:-weber!#].
- We choose an initial solution, and we then use the following
iterations until we converge to a
solution [#!bricker98:-facil-locat-probl-plane!#].
 |
(2) |
 |
(3) |
Multiple Cluster Points
Clustering, other issues
- How many is enough?
- We usually strive for 2-10.
- Incremental addition of points
Clustering Samples
Tweaking
- HOPS creates many
-strands, but they rarely become
-sheets.
- In some cases, we need to continuously adjust the turn torsion angles
so that two
-strands will align.
- In our protein representation, the global positions of the atoms
are stored in
matrices (position matrices), as are the changes
from one backbone atom to the next (transform matrices).
Global Position Matrix
Backbone Transform Matrix
 |
(4) |
One atom to another
 |
(5) |
A string of atoms. To align two atoms, we adjust
intermediate transform matrices.
 |
(6) |
 |
(7) |
Each transform matrix is a function of
. We want to find
that cause an equality in Equation
.
Problem Specification
or restated using vector notation as
Solving the System of Equations
- To solve this problem, we replace the equality constraint with
its First-Order Taylor Expansion.
 |
(11) |
- This creates a linear equality-constrained least-squares problem
(in terms of
which approximates the
original problem. We can then use Lagrangian multipliers to solve our
system. We iteratively improve our constraint,
- Our matrices are generally small (10-50 variables). This can be
easily and relatively quickly solved for
using a QR factorization and Meschach [#!stewart94:-mesch!#].
Tweaking, other issues
- Hold previous secondary structure fixed
- Implemented in parallel
- Implemented with iterative deepening and optimism
Future Work
- Caching of tweaks
- Tweaking and other structures - disulfides and motifs
- Incremental tweaking
- Enter CASP V
No References!
This document was generated using the
LaTeX2HTML translator Version 99.2beta8 (1.46)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -split 8 -dir /home/sforman/thesis-html/defense defense
The translation was initiated by Sean Forman on 2001-11-21
sforman@sju.edu