Predicting RNA 3D structure using a coarse-grain helix-centered model

A 3D model of RNA structure can provide information about its function and regulation that is not possible with just the sequence or secondary structure. Current models suffer from low accuracy and long running times and either neglect or presume knowledge of the long-range interactions which stabilize the tertiary structure. Our coarse-grained, helix-based, tertiary structure model operates with only a few degrees of freedom compared with all-atom models while preserving the ability to sample tertiary structures given a secondary structure. It strikes a balance between the precision of an all-atom tertiary structure model and the simplicity and effectiveness of a secondary structure representation. It provides a simplified tool for exploring global arrangements of helices and loops within RNA structures. We provide an example of a novel energy function relying only on the positions of stems and loops. We show that coupling our model to this energy function produces predictions as good as or better than the current state of the art tools. We propose that given the wide range of conformational space that needs to be explored, a coarse-grain approach can explore more conformations in less iterations than an all-atom model coupled to a fine-grain energy function. Finally, we emphasize the overarching theme of providing an ensemble of predicted structures, something which our tool excels at, rather than providing a handful of the lowest energy structures.


A.1. Energies A.1.1. Clash Energy
Having established that we can re-position the atoms of a helix reasonably well given its coarse grain parameters (see Sec. 2.2), we can use this technique to detect clashes within stem regions. This is done by 'virtual' base pairs, or interpolations of where a base pair and its constituent nucleotides and atoms would lie given its position within the stem and the stem's coarse-grain parameters.
Thus, if a stem has a length of n base pairs, a starting position vector s, an ending position vector e and two twist vectors t 1 and t 2 , the position of the i'th virtual base pair along the axis of the helix will be s + i−1 n−1 (e − s). The direction of the angular position can be calculated in a similar fashion by taking the corresponding proportion of the angle that one would need to rotate the first twist around the stem axis to align it with the second twist. This total angle, naturally, needs to be adjusted for helices in which the base pairs twist fully around the stem axis one or more times (usually around the 11th basepair).
Using the stem axis vector a = (e − s), the starting position of the virtual base pair v s , and the virtual twist vector v t , we can define a coordinate system for each base pair. Within this coordinate system, the position of all the atoms of the two nucleotides in the base pair can be represented in a manner that only depends on the strand which contains the nucleotide and the identity of the atom. These positions were calculated for all stem-contained atoms in the ribosome structure 1jj2 and averaged to create an average base-pair representation.
The clash energy function determines if the positions of the atoms calculated from the virtual base-pairs of each helix intersect (i.e. are positioned within 1.8Åof each other). The number of intersections is multiplied by a large energetic penalty to yield a certain rejection of the sampled structure.

A.1.2. A-Minor Energy A.2. Best Sampled Structures
A.3. Enumeration      The target and reference distributions for the radius of gyration. The target and reference measures are obtained from subgraphs of a native ribosome and an artificially constructed structure without clashes or broken junctions. Clearly evident is the tendency for native structures to have a smaller radius of gyration than larger structures. The energy values are thus lower for more compact structures and higher for more spread out structures. The effect becomes more pronounced for larger structures due to their inherent ability to spread out more.   Figure A.4: The distribution of the RMSD between the native and the sampled structures using a naive (constant-energy) energy function and the combination of the three non-constraint coarse grain energies (aminor-perloop-rog).
Figure A.5: The best model for the structure 3R4F (bright colors) aligned to the native structure (dark colors) from a simulation using the naive energy function (above right) and the energy described in Section 2.4 (below right). On the left are the graphs indicating the coarse grain value distributions for each of the measures. The value of the best sampled structure is shown as an orange dashed line, whereas the value from the native is shown as a dashed blue line. The two lines show a perfect overlap for the interior loop A-Minor Energy in the upper right hand corner and thus only the red line is visible.

A.4. Example of a Prediction Failure
As seen in figure A.4, sampling using the presented energy function does not always lead to an improvement over a naive energy. A good example of this is the structure 3R4F, show in Supplementary Figure A.5. The poor prediction performance is most likely attributed to the incorrect close positioning of the two hairpin loops in the predicted structure. The other coarse-grain values differ by smaller magnitudes and likely contribute less to the wrong configuration. Examining the energy of all of the sampled structures show that this structure has a very unfavorable energy-rmsd profile (see Supplementary  Table A.2). This example illustrates one of the pitfalls of using a knowledge-based energy function wherein a strong pattern in the training data is not necessarily reflected in individual structures. We plan on addressing this issue by adding additional terms to our energy function such that a global maximum probability (minimum energy) occurs near the native structure. This, however, is an ongoing pursuit and is beyond the scope of this paper.

A.5. Helix Fitting
Four different methods for fitting a coarse grain helix (consisting of an axis segment and two twist vectors) to an all-atom helix were tested for their ability to accurately represent the positions of the atoms on the helix.
The ad-hoc method uses the directions of the base normals as well as the vector between the estimated centers of the outermost two base pairs of the helix as a way of calculating the axis vector.
The fit method assumes that the projection of the positions of backbone atoms onto the plane normal to the axis vector should form a circle. The axis is calculated by optimizing its vector so as to minimize the root-mean square deviation (RMSD) between the projected heavy atom positions and a circle fit onto their positions.    For the best method, the average aligned root mean square deviation increases for larger stems due to the increasing number of nucleotides present.
The superimpose method fits an axis as per the fit method onto an ideal helix (generated by using the fiber program of the 3DNA package [25]). The ideal helix is then superimposed onto the target helix using Kabsch's algorithm [17] and the resulting transformation is applied to the fitted axis to yield an axis for the target helix.
The template method differs from the superimpose method insofar as it uses an extremely long (30 base pair) helix to fit the axis parameters and then superimposes the first n base pairs of the long ideal helix onto the target helix to generate the transformation for the axis.

A.5.1. Results
To measure the quality of each helix fitting method described in the previous section, we took pairs of helices containing the same number of base-pairs and calculated the parameters for both helices using each of the four described methods. For both helices, we calculated a twist parameter equal to a halfway rotation of one twist onto the other around the helix axis. The direction of this twist parameter as well as the line segment defining the axis of the helix provide enough parameters to direct the superposition of one helix onto another. That is, we can define a translation and a rotation that will superimpose one coarse grain representation onto another. By applying this transformation to each backbone atom of the second helix, we attempt to move and rotate it to its equivalent location on the first helix. We would expect a consistent helix fitting method to yield the lowest RMSD deviation between the atoms of the first helix and the atoms of the second (fitted) helix when superimposed on the first using the axes and twist parameters defining both helices.
The results (illustrated in Fig. A.9) confirm the expected dominance of the superpositionbased methods (superimpose and template). The ad-hoc method's performance on short helices was notably worse than any of the others due to the fact that the conformations of terminal base pairs tend to have a variable geometry which is not tempered by the length of the helix. In longer helices, the vector between the computed centers of the terminal base pairs is longer and thus has less room for variation, leading to better performance of the ad-hoc method. The fit method, which fits a helix axis by optimizing the mean error of a circle fit to the projection of the backbone atoms onto the plane perpendicular to the axis, also lags behind the superposition based methods on shorter helices. The paucity of data points (backbone atoms) leads to semi-circular and noisy projections which in-turn lead to potentially inaccurate circle fits. The two superpositionbased methods show almost identical performance due to the regular nature of the ideal helices to which the axis and circle are fit. The quality of all methods degrades slightly for larger helices due to the increased tendency of helices to bend slightly and deviate from the ideal geometry as they increase in length.

A.5.2. Ad-hoc
Let the residues on one strand of the helix be numbered s a1 ,, s a2 , ..., s an , and on the other strand s b1 ,, s b2 , ... , s bn , where s an−1 < s an and s bn−1 < s bn then the two residues that cap one end of the helix are s a1 and s bn while the residues that cap the other end of the helix are s an and s b1 . Let c α (s) define the location of the C α atom in the residue numbered s. The helix vector which is used as an initial estimate for the optimization function described above is calculated as follows: The vectors V s1 , V s2 , V e1 , and V e2 are between the C α atoms of two base-paired nucleotides. They should be roughly orthogonal to the axis of the helix.
The vectors S n and E n should be two estimates of the helix axis as calculated from the top and bottom two nucleotides. S y and E y should be orthogonal to the estimated helix-axis as well as the vector between the first and last C α atom of the helix, respectively.
Taking the average of the normalized C α vectors and the previous orthogonal vector should yield a vector roughly towards the center of the helix. Finally, adding a multiple of this vector to the C α atoms of the start and end residues, yields rough estimates for the start and end of the helix. This is the crudest method of estimating the cylinder axis vector and performs consistently worse than the other methods (figure A.9).

A.5.3. Fit Method
An improvement over this method comes from the realization that a helix is a path along a cylindrical manifold. We proceed by fitting a cylinder to the backbone atoms of an RNA helix. Given a cylinder with an axis vector C v , then a coordinate system can be created with the z-axis lying along C v . Assuming the axis vector defines a cylinder with radius r which minimizes the root-mean-square distance of the backbone atoms from the surface of the cylinder, then transforming the locations of the backbone atoms into the coordinate system defined by the cylinder vector should yield a circle on the plane defined by the x and y axes which has a minimum root-mean-square deviation from the circle created by intersecting our ideal cylinder with the x-y plane of the our transformed coordinate system. These assumptions yield a straightforward method of fitting a cylinder to an RNA helix by fitting an axis such as to minimize the root-mean-square deviation of the best circle that can be fit onto the transformed backbone atoms of the RNA helix. The use of this method significantly improves the ability to align a second helix onto the first over the naive 'estimate' method ( Fig. A.9).
Fitting the circle is done via the method described by [4], while fitting the axis is done using Python's leastsq optimization function. An initial estimate for the axis vector is calculated using the ad-hoc method.

A.5.4. Superimpose and Template Methods
RNA helices, especially of the shorter variety, often vary slightly in their structure such that their atoms do not form an ideal circle when viewed along the helix axis. To compensate for this, we first created an idealized stem using the fiber program [25], superimposed it onto the stem being parameterized and then calculated the cylinder axis for the ideal stem. The resulting method (called superimpose) showed a marked improvement over all of the previous attempts at approximating a helix.
Longer helices, as one may expect, should be easier to parameterize due to the larger number of data points available. The final method we tried was to create an idealized 30 base-pair helix and to approximate its axis using the 'fit' method described above. For every subsequent helix of length n (n < 30) for which we wanted to calculate an axis, we simply calculated the best rotation and translation to align its atoms onto those of the first n base pairs of long reference helix. The resulting transformation was applied to the reference axis which was further cropped to match the length of the query helix.
The results indicate that while this method (called template) was more consistent, the parameters it created were on average no better than those created by the 'superimpose' method ( Fig. A.9). Nevertheless, it is the method that is used throughout this paper for fitting coarse grain helices onto all-atom models.

A.5.5. Twist Parameters
In addition to the cylinder defining the helix, we need an approximation for how the base pairs are positioned along this cylinder. This is accomplished simply by storing vectors (henceforth called twist vectors) which indicate the direction from each end of the cylinder to the middle of the terminal base pairs. If the axis of the stem cylinder is defined by a start point, C s , and an end point, C e , then the twist vectors are calculated by the taking the vector rejection of a, (the average of the vectors from the end of the cylinder to the terminal nucleotides' C α atoms), from the cylinder's direction vector.
A.6. Sampling Quality and Energy Factor A.6.1. Energy Function Evolution As the sampling proceeds, the distribution of the sampled values for each coarse grain measure change and the reference distribution for the energy function must also change. This is starkly illustrated in the case of the adenine riboswitch (PDB: 1Y26). The initial energy function, as calculated from substructures of the native and artificially constructed ribosome (PDB: 1jj2), favors more compact structures due to the relative paucity of such structures in the initial reference distribution (see Fig. A.3, upper left hand plot labeled 1Y26). If, however, we look at conformations sampled for this particular secondary structure (as opposed to the collection of ribosome substructures of roughly the same size), it becomes apparent that the background distribution actually favors more compact structures than the target distribution (see green and orange, respectively in Fig. A.6, upper left corner labeled 1Y26). Thus as the sampling proceeds, the energy values for more compact structures stay stable whereas the energy values for more spread out structures decrease (see Fig. A

A.6.2. Energy Factor
The energy for each coarse grained value (x) is calculated as the log of the ratio of the target distribution (p t (x)) divided by the reference distribution (p r (x)) and multiplied by a factor c which serves as a parameter for tuning how closely the target distribution should match the sampled values.
Since the reference distribution is re-calculated every ten sampling steps, it tends to converge toward the target distribution. Under-sampled and over-sampled regions of the conformation space will contain structures whose energy values have large absolute values. For over-sampled regions, these values will be large and lead to a quick traversal into a different region of the conformational space whereas in under-sampled regions the energy values will be low, yielding more structures in this region. The factor c in the energy formulation above determines how strictly the sampled distribution should follow the target distribution.
The measure of the Jensen-Shannon (JS) divergence between the target and sampled distributions provides a natural measure of when to terminate the sampling. While other methods sample a fixed number of steps in the hope that a sufficiently low energy structure is found, we can simply identify when the JS divergence stabilizes and return the most frequently sampled conformations at that point. Fig. A.13  iterations have been performed. We present this merely as an avenue for further research. In actuality, all of our simulations were run to 20000 sampling steps. Table A.2: Energy landscape and profile over the course of an Ernwin simulation. See Main Text Figure  9 for a more detailed description.  Figure  9 for a more detailed description.  Table A.4 represent the entire time spent for each simulation. Due to the lack of sequence dependence, all fragment statistics used by Ernwin are precomputed so normal usage does not require their recalculation. FARNA, however, requires the creation of sub-models for each loop region, leading to the large running time listed in Table A.4. Such models could, in principle, be computed ahead of time, stored externally and assembled to form a complete structure which would make simulations using FARNA significantly faster.