Thesis (Index) <- Sean Forman <- You Are Here
This chapter presents results from an implementation of the algorithms given in Chapter 4. We collect data from a test bed of 248 proteins and run our clustering algorithm on this data. We then discuss the appropriate selection of torsion angles for HOPS and provide results from the incremental clustering algorithm.
For our test bed we are using a set of 403 proteins chosen by Hobohm and Sander [37] as a large representative set from the more than 16,000 in the Protein Data Bank. These proteins have low sequence homology and have especially accurate representations in the structure database. From this set, we were able to extract torsion angles for just over 248 using PEX (see Section 3.5). PEX requires a strict adherence to the PDB standard (Appendix B), so a good number of the 403 were in a form which PEX would not read. Next, these torsion angles were divided by amino acid type. The number of data points for each amino acid type is given in Table 6.1.
It is interesting to note that of those 49,651 torsion angles trios,
just 166 of them had an
angle in the cis
conformation (
). We can also test our assumption that the
angle can be set to
. If we shift the
angles from a scale of
to
then we find that
of all angles
are between
and
and
are between
and
.
When we generate a clustering plan, we generally begin with a set of
example torsion angles (our set from Section 6.1),
and a pair of parameters: the percentage of torsion angles to be
within a certain distance of a cluster point and the specific distance
used. For the results presented in this section, we will be using
as our standard distance and then varying the number of
torsion angles that need to be within that distance from 75% to 85%
to 95%.
We will present results for two specific amino acids: Asn and Thr.
The deviation of torsion angles for these two amino acids goes from
high to low. For both of these amino acids, we generated clustering
plans that satisfied the three constraints listed in the previous
paragraph. The chosen cluster points are then graphed as circles of
radius
over top of the torsion angles used to generate
the points. From Figure 6.1, you can see
the number of cluster points needed increases as we expect a higher
percentage of points within close proximity to a cluster point. We
also see that the number of cluster points needed is dependent upon
the amino acid being considered.
[Cluster plans of varying granularity (75%, 85%, 95%) for
two aminos acids (Asn and Thr).] Cluster points are represented as
circles of radius
We will expand our results to a full set of amino acids. In
performing these experiments, we must keep in mind that we are
primarily interested in how many cluster points are needed to perform
a suitable search. The larger the number of cluster points needed to
represent the system, the less tractable a full search by HOPS
becomes, so we wish to have ``just enough'' cluster points to
represent each system. In Table 6.2, we have
measured the number of cluster points needed to bring a threshold
percentage of torsion angles within
of a cluster point.
As we can see, amino acids like Gly can require a large number of
torsion angles to be modeled effectively.
|
In Table 6.3, we are studying the sensitivity of the number of clusters needed and the size of our radii. Here we are requiring 90% of the torsion angles to be within our radius.
|
If you look at the previously mentioned Figure 6.1, you'll notice that in the plans with large numbers of cluster points there is significant overlap in the radii of the cluster points. This occurs naturally as we aren't concerned in tiling the space with disjoint radii. We are most interested in minimizing the distances from a cluster point to the largest number of torsion angles possible. This means that the overall plan may be best served by placing more cluster points in the densely populated areas and ignoring the sparsely populated regions.
Using this reasoning, we may be better off utilizing an additional
condition to determine how many clusters is sufficient. One such
condition would continue to add cluster points until the average
distance between the cluster points and their torsion angles is less
than some distance (for instance,
yields a reasonable
number of cluster points for HOPS to search through, 2-10 for each
amino acid). We could also require that the median distance be less
than some cutoff, but the average is a better choice here. If we have
poorly modeled a torsion angle, the average distance is important
because a large number of outliers could indicate we need to add a
cluster point to appropriately model that region.
If we re-run this system with a requirement that the average of
torsion angle to cluster point distance is less than
, we
find the number of cluster points given in
Table 6.4.
We are also interested in the sensitivity of the number of cluster points found to the data set used. For instance, if we randomly sample from our set of data points and use only half of the data to cluster on will we find the same cluster points as with the original set? Using Arg as our test case, generated ten random data sets from the full data set of 1,937 torsion angles.46
The full data set required four cluster points, but the number required for the random subsets ranged from four (three times) to five (five times) to six (two times) cluster points (see Figure 6.2).
We are also interested in whether the ordering of the points entering the algorithm affects the number of cluster points found. In fact, it does. We reordered our ten random subsets and in five of the ten cases the re-ordering changed the number of clusters found. Our algorithm begins with an assignment of the torsion angles to the cluster points. This assignment is dependent on the order in which the torsion angles are entered into the algorithm. Different entry orders will result in the different cluster point descriptions. However, these results do not vary a great deal from one another (see Figure 6.3).
Our algorithm does not guarantee a global solution of the Facilities Location Problem (FLP). With such a large number of data points (1,000+) we are susceptible to being caught in one of the large number of local minima in the FLP's objective function depending on our starting solution.
While this lack of reproducibility may be somewhat disconcerting, we
should also keep in mind that the data in the protein databank is not
completely precise. Therefore, cluster points within
of
each other could likely be viewed as equivalent cluster points.
In Chapter 4.1, we mentioned an incremental approach to clustering. While this is fully implemented in our clustering algorithm, it has not yet been utilized by HOPS. Therefore, we will only give an example of the incremental addition of points to our clustering plan, and will not evaluate the quality of solutions found by HOPS using this technique when compared to non-incremental searches in HOPS.
We will use one of the more difficult amino acids to test our
incremental algorithm. In a typical run, with typical parameters, His
(HIS) would require 10 cluster points to provide an average distance
from torsion angle to cluster point of less than
. To
demonstrate how incremental addition of points works, we first select
a clustering plan with just five cluster points, and then we add five
more cluster points while maintaining the five points that we already
have in hand (see Figure 6.4).
[Comparison of the original clustering plan for His and
an incremental plan.] We are comparing the original plan with ten
cluster points and an incremental plan with five original clusters
(marked by circles) and additional five cluster points chosen later
(
Since the location and number of torsion angles is sensitive to the ordering of torsion angles and the use of random subsets, our clustering should be used as a strong starting point for the selection of cluster points appropriate for the HOPS backbone angles. While the cluster points from any of the runs describe a considerable portion of the information found in the Ramachandran plots, they would likely need a post-processing step in order to be optimally utilized by HOPS. This lack of online-ness is an undesirable quality, but is a byproduct of the size of the data sets we are using and the random nature of our algorithm.