Thesis (Index)   <-  Sean Forman   <-  You Are Here



Next: Tweaking Results Up: TORSION ANGLE SELECTION AND Previous: Tweaking Algorithms for -Strand Subsections


Clustering Results

This chapter presents results from an implementation of the algorithms given in Chapter 4. We collect data from a test bed of 248 proteins and run our clustering algorithm on this data. We then discuss the appropriate selection of torsion angles for HOPS and provide results from the incremental clustering algorithm.


Test Proteins

For our test bed we are using a set of 403 proteins chosen by Hobohm and Sander [37] as a large representative set from the more than 16,000 in the Protein Data Bank. These proteins have low sequence homology and have especially accurate representations in the structure database. From this set, we were able to extract torsion angles for just over 248 using PEX (see Section 3.5). PEX requires a strict adherence to the PDB standard (Appendix B), so a good number of the 403 were in a form which PEX would not read. Next, these torsion angles were divided by amino acid type. The number of data points for each amino acid type is given in Table 6.1.


Table 6.1: Amino acids and amount of torsion angle data available.
Amino Acid Angles Perc. Amino Acid Angles Perc.
Ala 4393 8.8 Arg 1937 3.9
Asn 2445 4.9 Asp 3107 6.3
Cys 792 1.6 Gln 1819 3.7
Glu 2722 5.5 Gly 4241 8.5
His 1056 2.1 Ile 2731 5.5
Leu 3817 7.7 Lys 3056 6.2
Met 852 1.7 Phe 1933 3.9
Pro 2148 4.3 Ser 3335 6.7
Thr 3201 6.4 Trp 741 1.5
Tyr 1857 3.7 Val 3468 7.0
        49651 100.0


It is interesting to note that of those 49,651 torsion angles trios, just 166 of them had an $ \omega$ angle in the cis conformation ($ 0.33\%$). We can also test our assumption that the $ \omega$ angle can be set to $ 180^\circ$. If we shift the $ \omega$ angles from a scale of $ [\ensuremath{-180^\circ}, \ensuremath{180^\circ}]$ to $ [\ensuremath{0^\circ},\ensuremath{360^\circ}]$ then we find that $ 83\%$ of all angles are between $ 175^\circ$ and $ 185^\circ$ and $ 96.6\%$ are between $ 170^\circ$ and $ 190^\circ$.


Some Clustering Results

When we generate a clustering plan, we generally begin with a set of example torsion angles (our set from Section 6.1), and a pair of parameters: the percentage of torsion angles to be within a certain distance of a cluster point and the specific distance used. For the results presented in this section, we will be using $ 30^\circ$ as our standard distance and then varying the number of torsion angles that need to be within that distance from 75% to 85% to 95%.

We will present results for two specific amino acids: Asn and Thr. The deviation of torsion angles for these two amino acids goes from high to low. For both of these amino acids, we generated clustering plans that satisfied the three constraints listed in the previous paragraph. The chosen cluster points are then graphed as circles of radius $ 30^\circ$ over top of the torsion angles used to generate the points. From Figure 6.1, you can see the number of cluster points needed increases as we expect a higher percentage of points within close proximity to a cluster point. We also see that the number of cluster points needed is dependent upon the amino acid being considered.

\begin{displaymath}
\begin{array}{c c}
\multicolumn{1}{l}{\mbox{\bf Thr}} &
\mu...
...ght=2.8in, angle=-90]{GRAPHICS/ASN_graph_3.ps} \\
\end{array}\end{displaymath} [Cluster plans of varying granularity (75%, 85%, 95%) for two aminos acids (Asn and Thr).] Cluster points are represented as circles of radius $ 30^\circ$.


Full Results

We will expand our results to a full set of amino acids. In performing these experiments, we must keep in mind that we are primarily interested in how many cluster points are needed to perform a suitable search. The larger the number of cluster points needed to represent the system, the less tractable a full search by HOPS becomes, so we wish to have ``just enough'' cluster points to represent each system. In Table 6.2, we have measured the number of cluster points needed to bring a threshold percentage of torsion angles within $ 30^\circ$ of a cluster point. As we can see, amino acids like Gly can require a large number of torsion angles to be modeled effectively.


Table: Number of clusters needed to reach a specific percentage of torsion angles within $ 30^\circ$ of a cluster point. All distances are euclidean distances.
AAcids 0.50 0.75 0.90 0.95 0.99
Pro 2 2 2 5 13
Gly 4 11 19 29 $ > 100$
Ala 1 9 9 13 55
Val 2 3 6 7 47
Leu 2 3 7 11 37
Ile 2 2 4 8 27
Met 1 4 5 9 31
Cys 2 7 10 19 36
Phe 2 4 7 11 33
Tyr 2 5 9 14 62
Trp 2 3 8 12 30
Arg 2 3 10 16 52
Lys 2 3 10 11 45
His 2 5 8 12 37
Asp 2 5 9 15 60
Glu 1 3 9 12 51
Ser 2 3 8 18 73
Thr 2 4 5 9 64
Asn 3 8 13 16 54
Gln 2 3 6 14 57


In Table 6.3, we are studying the sensitivity of the number of clusters needed and the size of our radii. Here we are requiring 90% of the torsion angles to be within our radius.


Table 6.3: Number of clusters needed to reach 90% of torsion angles within a specific number of degrees of a cluster point.
AAcids $ 50^\circ$ $ 45^\circ$ $ 40^\circ$ $ 35^\circ$ $ 30^\circ$ $ 25^\circ$ $ 20^\circ$ $ 15^\circ$
Pro 2 2 2 2 2 3 5 8
Gly 9 11 15 18 19 28 46 96
Ala 4 9 9 9 9 12 20 28
Val 2 3 5 6 6 7 10 17
Leu 2 3 5 5 7 9 18 37
Ile 2 3 3 4 4 7 10 18
Met 3 4 4 4 5 9 19 29
Cys 7 7 7 8 10 17 23 35
Phe 4 4 4 5 7 10 15 30
Tyr 5 5 5 6 9 13 16 34
Trp 3 4 4 5 8 12 24 30
Arg 4 4 4 9 10 14 28 39
Lys 4 4 4 9 10 11 14 36
His 5 5 6 7 8 11 18 35
Asp 5 5 6 8 9 13 20 48
Glu 3 7 7 7 9 11 18 42
Ser 3 4 4 7 8 13 18 38
Thr 3 4 4 5 5 9 14 29
Asn 7 8 8 11 13 16 26 50
Gln 4 4 6 6 6 8 18 28


If you look at the previously mentioned Figure 6.1, you'll notice that in the plans with large numbers of cluster points there is significant overlap in the radii of the cluster points. This occurs naturally as we aren't concerned in tiling the space with disjoint radii. We are most interested in minimizing the distances from a cluster point to the largest number of torsion angles possible. This means that the overall plan may be best served by placing more cluster points in the densely populated areas and ignoring the sparsely populated regions.

Using this reasoning, we may be better off utilizing an additional condition to determine how many clusters is sufficient. One such condition would continue to add cluster points until the average distance between the cluster points and their torsion angles is less than some distance (for instance, $ 20^\circ$ yields a reasonable number of cluster points for HOPS to search through, 2-10 for each amino acid). We could also require that the median distance be less than some cutoff, but the average is a better choice here. If we have poorly modeled a torsion angle, the average distance is important because a large number of outliers could indicate we need to add a cluster point to appropriately model that region.

If we re-run this system with a requirement that the average of torsion angle to cluster point distance is less than $ 20^\circ$, we find the number of cluster points given in Table 6.4.


Table: Amino acids and number of clusters required for an average torsion angle to cluster point distance of less than $ 20^\circ$.
Amino Acid Clusters Amino Acid Clusters
Ala 5 Arg 4
Asn 8 Asp 8
Cys 7 Gln 4
Glu 4 Gly 10
His 10 Ile 2
Leu 3 Lys 8
Met 4 Phe 6
Pro 2 Ser 6
Thr 4 Trp 4
Tyr 6 Val 3


We are also interested in the sensitivity of the number of cluster points found to the data set used. For instance, if we randomly sample from our set of data points and use only half of the data to cluster on will we find the same cluster points as with the original set? Using Arg as our test case, generated ten random data sets from the full data set of 1,937 torsion angles.46

The full data set required four cluster points, but the number required for the random subsets ranged from four (three times) to five (five times) to six (two times) cluster points (see Figure 6.2).

Figure 6.2: Clustering plans for random subsets of the full Arg data set.
\begin{displaymath}
\begin{array}{c c}
\multicolumn{1}{l}{\mbox{\bf Full Arg Dat...
...=2.8in, angle=-90]{GRAPHICS/ARG_graph_sub3.ps} \\
\end{array}\end{displaymath}

We are also interested in whether the ordering of the points entering the algorithm affects the number of cluster points found. In fact, it does. We reordered our ten random subsets and in five of the ten cases the re-ordering changed the number of clusters found. Our algorithm begins with an assignment of the torsion angles to the cluster points. This assignment is dependent on the order in which the torsion angles are entered into the algorithm. Different entry orders will result in the different cluster point descriptions. However, these results do not vary a great deal from one another (see Figure 6.3).

\includegraphics[width=3.5in, angle=-90]{GRAPHICS/ARG_manyclusters.ps}
[All the cluster points found by ten random selections from the full data set for amino acid Arg.] Despite cases where the algorithm found varying numbers of torsion angles each run, the algorithm does pick out the same underlying structure in the data set.

Our algorithm does not guarantee a global solution of the Facilities Location Problem (FLP). With such a large number of data points (1,000+) we are susceptible to being caught in one of the large number of local minima in the FLP's objective function depending on our starting solution.

While this lack of reproducibility may be somewhat disconcerting, we should also keep in mind that the data in the protein databank is not completely precise. Therefore, cluster points within $ 5^\circ$ of each other could likely be viewed as equivalent cluster points.

Incremental Addition of Points

In Chapter 4.1, we mentioned an incremental approach to clustering. While this is fully implemented in our clustering algorithm, it has not yet been utilized by HOPS. Therefore, we will only give an example of the incremental addition of points to our clustering plan, and will not evaluate the quality of solutions found by HOPS using this technique when compared to non-incremental searches in HOPS.

We will use one of the more difficult amino acids to test our incremental algorithm. In a typical run, with typical parameters, His (HIS) would require 10 cluster points to provide an average distance from torsion angle to cluster point of less than $ 15^\circ$. To demonstrate how incremental addition of points works, we first select a clustering plan with just five cluster points, and then we add five more cluster points while maintaining the five points that we already have in hand (see Figure 6.4).

\begin{displaymath}
\begin{array}{c c}
\multicolumn{1}{l}{\mbox{\bf Original His...
...t=2.8in, angle=-90]{GRAPHICS/HIS_graph_inc.ps} \\
\end{array}\end{displaymath} [Comparison of the original clustering plan for His and an incremental plan.] We are comparing the original plan with ten cluster points and an incremental plan with five original clusters (marked by circles) and additional five cluster points chosen later ($ \times$'s).

Recommendations

Since the location and number of torsion angles is sensitive to the ordering of torsion angles and the use of random subsets, our clustering should be used as a strong starting point for the selection of cluster points appropriate for the HOPS backbone angles. While the cluster points from any of the runs describe a considerable portion of the information found in the Ramachandran plots, they would likely need a post-processing step in order to be optimally utilized by HOPS. This lack of online-ness is an undesirable quality, but is a byproduct of the size of the data sets we are using and the random nature of our algorithm.


next up previous
Next: Tweaking Results Up: TORSION ANGLE SELECTION AND Previous: Tweaking Algorithms for -Strand
sforman@sju.edu