Data-efficient machine learning of biomolecules

Life on the cellular level is governed by different types of biomolecules. These building blocks represent the machinery of living organisms and play crucial roles for the functioning of each cell. To study their structure and the function tightly coupled to it, the use of machine learning is becoming increasingly important. Members of the Multiscale Biomolecular Simulation research group and the Helmholtz AI team, in cooperation with Forschungszentrum Jülich and the German Aerospace Center (DLR), have now presented a method to make structural predictions in data sparse scenarios combining self-supervised machine learning and classical machine learning techniques.

Biological macromolecules

As the smallest building blocks, various biomolecules form the molecular machinery of all living organisms. They take on crucial functions in every cell, such as the storage, reading and interpretation of genetic information, the interplay of numerous chemical reactions or the recognition and interpretation of signals or dangers for the cells. These biomolecules include proteins, ribonucleic acids (RNA), carbohydrates and lipids, which interact with each other in complex ways to enable the processes of life.

One of the most important biomolecules is the aforementioned RNA. Put simply, RNA is a single-stranded copy of deoxyribonucleic acid (DNA) that performs an array of different tasks. One important task is to transport the genetic information that the DNA carries. Some viruses store their genetic information on RNA molecules (for example, the corona virus is an RNA-based virus and many of the corona vaccines are also based on RNA technologies). Some RNA molecules , like proteins, can fold into complex three-dimensional structures as proteins do. This 3D structure is in turn of crucial importance for understanding the function of RNA and also for the development of treatments for diseases. the development of treatments for diseases caused by RNA dysfunction.

Structure determination

The experimental determination of such 3D structures of RNA molecules is very complex - and this is where machine learning (ML) comes into play. ML has revolutionized many fields, including biology, by providing powerful tools for analysing complex data and making predictions. In recent years, deep learning (DL) has emerged as a particularly promising approach for analysing biological data, including the prediction of biomolecular structures. DL is a form of machine learning in which artificial neural networks are used to learn from data. These networks consist of numerous layers of interconnected nodes that process information. By training these networks with large data sets, DL algorithms can learn to recognize patterns and make accurate predictions. For proteins, the task of structure prediction has recently been solved by AlphaFold2, a deep neural network, with impressive results. AlphaFold2 uses the extensive databases of protein information to directly predict three-dimensional structures. However, direct transfer is not possible for RNA, as only fractions of the training data are available for RNA: For example, there is also extensive sequence data for RNA, i.e., sequences of the individual RNA building blocks represented by long letter chains. However, there are only about 200 known structures for RNA, which differ significantly from each other.

The BARNACLE model

A new model, called BARNACLE, now uses a DL approach to predict spatial neighbourhoods between RNA building blocks (so-called nucleotides, single letters).

You can think of it like a LEGO model: If individual Lego bricks are replaced in one place, the bricks in the surrounding area must adapt so that the entire structure still fits together. The BARNACLE model uses this idea for RNA: nucleotides that are spatially close to each other in RNA are also more likely to mutate together during evolution. And it is precisely these emerging mutation patterns that the model is looking for. To train the model, a combination of self-supervised pre-training on many sequence data (Figure 1) and efficient use of the few structural data is used. In self-supervised pre-training, the model is trained on a large dataset with sequence data to learn general features of the data. For this purpose, the input data is corrupted, and the model must restore the original. In doing so, the model learns the rules in the data. This pre-trained model is then fine-tuned on the smaller data set with structural data to make specific predictions.

Once the model has been fine-tuned, another model is trained that produces the actual prediction from the internal representations of the network. In a sense, this model "reads the thoughts" of the first model. The advantage of this approach is that the second model does not have to be a deep neural network, but in this case is a so-called random forest that learns more efficiently from the limited data. BARNACLE showed a significant improvement over established classical statistical approaches as well as other neural networks (Figure 2). It also shows that this approach is transferable to related tasks with similar data limitations.

The paper was published as "RNA Contact Prediction by Data Efficient Deep Learning" in Communications Biology.