In our previous article, we described one of the most difficult and significant problems in the biological sciences, the protein-folding problem, and how DeepMind’s AI system largely “solved” it. But what is the secret behind AlphaFold? How is the algorithm configurated and applied to a complex biological problem like predicting the 3D structures of protein?
Solving the Protein Folding Problem before AlphaFold
If you have read our first article of this series, you are already aware of the importance of proteins for the proper functioning of our body and how the protein shape is fundamental for performing their roles. Whether it is an enzyme digesting your food, or an antibody defending you from a virus, all proteins depend on their 3D structure to function correctly.
We also already showed you that knowing the 3D structure of proteins is very important, that many experimental approaches have been developed to solve the protein folding problem. In fact, it is so important that impressively complicated and extremely costly experiments have been used to date. If you want to take a look, see this video on nuclear magnetic resonance (NMR) (Figure 1),
one of the most widely-used methods for obtaining information about protein structures.
Figure 1: Workflow for protein structure by NMR. Source: https://www.creative-biostructure.com/nmr-platform_67.htm
Complicated, expensive and time-consuming. So now, you can understand the excitement of the scientific community about using much more efficient computational approaches to solving the protein problem. Indeed, others have tried using computers and deep-learning methods for the prediction of protein structure before AlphaFold, but issues persisted, and AlphaFold’s method was much more successful.
Protein Coevolution: It’s the Contact That Matters
Let’s first look at how people tried to predict the protein structure in the past. According to the laws of thermodynamics, systems tend to move towards the state of the lowest free energy, and so do proteins. Traditionally, computational approaches exploited the process by which the protein moves from conformational states to states of lower free energies.
In these approaches, a stochastic sampling process, such as simulated annealing, is used to predict the structures. This is a probabilistic technique for approximating the global optimum of a given function. In this case, it minimizes a statistical potential that is derived from summary statistics extracted from actual protein structures. This approach takes too much time and is considered inefficient.
So, what can we use as constraints to make the process more efficient? An answer that gained popularity in recent years is: using evolutionary covariation data. This may sound unfamiliar, but don’t worry, we will explain.
Recent protein folding models believe that there are some defined regions that initiate the folding process and that fragments assemble stepwise to form the final native structure by establishing points of contact. This means that certain amino acids are in contact with each other when the protein is folded, and when in contact, they create stable states with lower potential energy. For a better understanding, please see Figure 2 below.
Figure 2: Consider a protein as a long necklace of beads (amino acids). A long necklace can adopt many forms. In the case of proteins, there are beads that are in contact (blue and green beads) that determine how the protein is folded. Image credit: Nature. Source: https://www.nature.com/articles/s41598-019-55047-4/figures/1
Now you can guess that these contact points are very important for the final structure of the protein.
So, what happens if one of these amino acids suddenly mutates or changes? Let’s say, the blue M in the figure above changes into a K.
These amino acids now probably will not be able to come into contact, the protein structure will become unstable, create a high level of energy potential, and eventually it will turn into a different structure that is unable to perform its task. Long story short, the protein will go extinct.
So, here is where we introduce coevolution. If we take the case above, coevolution occurs when one amino acid mutates and the other amino acid that used to be in contact also mutates in order to maintain the contact. That is how “evolution” works. The more important these amino acids are for maintaining the native 3D structure, the more likely they are to coevolve or change together.
Back a few lines before, I said we were looking for ways to improve protein structure prediction and that the answer was in evolutionary data. Well, these contact amino acids in the sequence -that coevolve- are the information we are looking for.
But how can you search for these correlated changes in the position of two amino acids across the entire sequence of a protein? You can do so by finding the sequences of similar proteins (often called related or homologous) from a large sequence database and determine the evolutionary covariation.
Using such methods, AlphaFold is defined as a “coevolution-dependent method” that employs Multiple Sequence Alignment (MSA) to detect residues that coevolve, thereby suggesting physical proximity in space (see Figure 3).
Figure 3: A protein multiple sequence alignment (MSA). Each row represents the sequence of the same protein from different organisms. The letters in the sequence represent the different amino acids or residues. Image credit: Wikipedia. Source: https://en.wikipedia.org/wiki/Multiple_sequence_alignment
First of all, let us define what an MSA is. An MSA, a short name for multiple sequence alignment, is the alignment of three or more amino acid (or nucleic acid) sequences used in order to find evolutionary relationships and to identify shared patterns of structural and functional importance.
Let’s see the figure below as an example. From the numerous vertical sequences, you can see that only the two multi-colored sequences have changed sequences. This shows coevolution. How? Well, as one position changes (the 3rd line), changes in the second position can also be observed (the 11th line). This “connection” implies spatial proximity of those amino acids in the 3D form of the protein. Thus, MSA can be used to infer which residues might be in contact. These contact predictions are incorporated into structure predictions, and guides the folding process (see Figure 4).
Figure 4: Correlated mutations carry information about the distance relationships in protein structure. Paired correlations in the multiple sequence alignment or MSA (left) allow deducing which residue pairs are likely to be close to each other in the 3D structure (right). Image credit: Nature. Source: https://www.nature.com/articles/s41598-019-55047-4/figures/2
When the coevolutionary information is transformed into a matrix (binary contact map), the possible protein structures can be narrowed down. In summary, AlphaFold analyzes the coevolution apparent in MSA and assembles the most probable fragments based on the information.
Why Deep Learning Is Suitable for Solving the Protein Folding Problem
As I briefly mentioned above, AlphaFold was not the first attempt of using computer algorithms to solve the protein folding problem. In fact, since 1994, the Critical Assessment of Protein Structure Prediction, CASP, experimented with protein structure prediction and delivered an evaluation of the current advances in protein structure modeling.
But how did DeepMind,or any other group, end up training AI to answer biological questions? Actually, the beauty of AI is that the same methods or algorithms can be used to answer any question, even one of biology’s most complicated issues, the protein folding problem. All you need is to feed the system with the right input and ask the right questions! Well, it sounds easy but is actually the most difficult part.
You can read another interesting article on this topic following this link
If it’s so difficult, then why did DeepMind select the protein folding problem as its next target? What makes deep learning suitable to answer this question?
Scientists and engineers in the AI sector have always kept an eye out for the protein folding problem as something that AI could solve. In fact, Jeff Hawkins, the former founder of Palm and CEO of Numenta Inc., predicted in his book On Intelligence in 2004 that artificial intelligence could solve this problem.
First of all, the protein folding problem is a well-defined problem, with well-defined questions and answers.
The question is, ‘when a certain amino acid sequence is given, which 3D form will it take where each amino acid in the sequence could be one of 20 different types of amino acids?’ Regarding this, you need to "transform" biological data into something a computer can process. Thankfully, the amino acid sequences can be easily converted into one-dimensional one-hot encoding vectors, each representing the amino acids.
The answer to solving the problem lies in the shape and sequence of the structure, which is well-defined by three-dimensional coordinates of amino acids or torsion angles between amino acid residues. Just knowing the sequence of a protein can help you to deduce its form.
In addition to the points above, it also makes it easier to apply deep learning methods to solve problems when there is plenty of data for training. In the case of proteins, the Protein Data Bank is a database that holds the proteins' 3D structural data obtained from experimental approaches we discussed earlier. DeepMind also used this database to develop their program.
Do you remember AlphaGo back in 2015? DeepMind’s great debut was through developing AlphaGo, which was highly successful. Similar to the protein folding problem, the game of Go represents a well-defined problem. The question in Go is where to place your stone in 19x19 coordinates, and the desired answer is knowing how you can win or lose. To get to that answer, you have a large amount of data coming from notations of previous games and their results.
Thus, we can expect great achievements in identifying and addressing the protein folding problem, as it has been proved to be possible through AlphaGo’s case.
How Different AlphaFold Is From Previous Approaches
DeepMind’s debut during the 13th version of CASP in 2018 was excellent, even snatching 1st place in the competition. They showed everybody what can be achieved when you put the right data into the right model for AI to process. In fact, “What just happened?” were the exact words that Mohammed AlQuraishi, professor of Columbia University and an expert in the field, wrote at that time.
Before DeepMind’s entry in CASP, there were several previous attempts to use neural networks to predict 3D protein structure from their amino acid sequence. While coevolutionary analysis data has been responsible for most of the progress in protein structure prediction, its conversion into contact maps was restricted by the types of input used and the output generated. In fact, AlphaFold’s approach distinguishes itself from other methods in that the neural network is trained to predict a histogram of distance, a distogram. Instead of predicting contact between two amino acids or the distance of that contact, AlphaFold takes things a step further by changing the output into more informative data, a map distribution over distances (see Figure 5).
Figure 5: Matrix of inter-residue distances. Image credit: DeepMind. Source: https://predictioncenter.org/casp13/doc/presentations/Pred_CASP13-DeepLearning-AlphaFold-Senior.pdf
For AlphaFold, the entire system could be divided into two steps, and the distogram plays an important role in both of these steps. In the next article of our AlphaFold series, we will carefully explain the input and output data of AlphaFold algorithm and its successor, AlphaFold 2, by reviewing the information revealed in DeepMind's blog as well as their peer-reviewed articles.