If you are interested in building business in the pharmaceutical industry using AI and machine learning techniques, you have probably heard about AlphaFold and the stir DeepMind’s AI team caused in the scientific community. And no wonder, that is because the AI system AlphaFold made enormous advances to solve one of the most challenging biological problems, the so-called “protein folding problem.”
We talked about this problem and its importance in a previous article.
In December 2020, Google’s artificial intelligence team joined the CASP14, a competition on protein structure prediction, and significantly outperformed other participants. The predictions are given scores from 0 to 100, with 90 considered on par with the accuracy of experimental methods. The AI system AlphaFold proved the accuracy of their algorithm, by scoring a median of 92.5 and earning an overwhelming first place.
Why Protein Structure Matters?
Proteins perform many biological functions in our body— they run our brain and heart, digest our food and protect us from viruses and bacterias.
The basic structure of a protein is a simple chain of basic units, but this linear form is thermodynamically unstable. To acquire a stable state, they fold until reaching a characteristic shape.
The shape of the protein is important since it defines the protein’s role. For example, T-shaped proteins are normality antibodies that bind viruses to protect our body. Globular proteins transport other molecules -like hemoglobin that transports oxygen- or fibers like collagen are responsible for skin strength and elasticity. The characteristic shape of a protein says a lot about its function.
A fine example that tells us the importance of a protein’s structure is a singular protein in the novel coronavirus SARS-CoV-2. The virus adheres and enters into host cells in a unique form with a protein on its surface called Spike. Which, by the way, was part of the CASP14 competition questions. Depending on the shape of Spike, cells can interact with viruses and let them enter or not. This interaction determines whether the virus can infect humans or animals. You can see a video of how this happens here.
If the spike protein had a shape slightly different, we would probably not be experiencing the pandemic in this way!
Building Blocks of Life
Proteins start as long chains of amino acids which are like building blocks. Proteins are made up of 20 different amino acids, each having a similar skeleton with a central carbon (⍺ carbon), a nitrogen (amino group, NH2), and another carbon (carboxyl group, COOH).
Additionally, each of the 20 amino acids has a specific side chain (R group) that comes in various shapes, charges and reactivity that confers particular chemical properties to that amino acid.
Experimental vs Computational Approaches
These days it is relatively easy to determine the amino acid sequence of a certain protein, which refers to the order in which the amino acids are in the chain. Nevertheless, it is still difficult to predict which shape the amino acid chain will form. This is why experimental methods are being used.
Using the initial sequence and the final 3D shape of a protein, scientists used costly and complex methods (such as x-ray crystallography, cryo-electron microscopy, and nuclear magnetic resonance) to measure and determine the atomic and molecular structures. Over the years, scientists have figured out up to 150,000 protein structures using these methods. All structures and the corresponding sequences are stored in a repository called Protein Data Bank (abbreviated PDB).
These experimental approaches are accurate but are expensive and can take more than a year to identify the structure of just a single protein. We cannot go through such a procedure for the millions of different proteins that exist in nature. Therefore, more and more researchers started moving toward computational approaches, and more recently toward machine learning methods, attempting to predict protein shapes instead of doing the “real work” of measuring the proteins.
Computing Protein Structures from Sequences
The protein folding problem refers to how proteins shape into their final forms and dictates its three-dimensional structure from the amino acid sequence.
The protein sequence, the order of the amino acids, and the electromagnetic properties of each amino acid determine the final form of proteins. Also, since there are 20 different types of amino acids, the protein sequence can be expressed simply as a combination of 20 unique symbols.
Seen from an AI point of view, the protein folding problem is an ideal environment to apply deep learning. We have a database of correct answers (the PDB database) and both the problem (protein sequence) and answer (whether the predicted protein 3D shape is approved or not) can be easily expressed mathematically.
Over the years the protein folding problem has been dealt with various computational methods using template-based modeling (based on the idea that similar sequences lead to similar structures), fragment assembly approaches, and so on. In the last approach, a target protein sequence was deconstructed into small overlapping fragments, and the database was screened to identify known structures of similar fragment sequences which are then assembled into a full-length prediction. Although the last method proved to be successful, it is considered inefficient as it takes too much computing time.
Here is where “evolutionary covariation data” appears and improves the accuracy of structure predictions. AlphaFold 1 is defined as a “coevolution-dependent method.” This will be dealt in detail in the following part.
When proteins are folded, there are contact points in the structure where amino acids of the protein string stick together.
If you could figure out those contact points, we can simply restructure the remaining amino acids that are not stuck to each other.
Protein Coevolution: It’s the Contact That Matters
Contact points play a crucial role in maintaining the overall 3D structure of proteins. Electrochemical interactions of a pair of amino acids make up these contact points.However, mutation can occur and change one of these amino acids during evolution, and consequently change the protein structure and function. That is why in general, when one of these amino acids is modified, the other pair is adjusted as well to maintain similar electromagnetic properties that will keep the protein shape unchanged. In this situation, we say that a pair of amino acids coevolved.
After a replacement, coevolution tends to equilibrate the protein to the change.
Thus, if you want to look for relevant amino acid interactions that contribute to protein structure (and function), you should look for amino acid pairs that have coevolved.
Amino acid covariation can be found by analyzing a large number of sequence databases. Multiple sequence alignment or MSA programs are used to align several amino acid sequences from protein families of different species and examine the correlation between amino acids. This analysis reports the correlation at pairs of positions that may contribute to protein shape.
In other words, coevolution-dependent methods use MSA information to find points of coevolution within a protein and use that to find points from the protein chain where the amino acids are in contact.
As mentioned previously, AlphaFold 1 is a coevolution-based approach as well, but it goes a step further compared with other algorithms.
How Does AlphaFold 1 Predict Protein Structures?
AlphaFold does not simply look for the contact point, but predicts the distance between amino acids in a protein and the probability distribution of the predicted distance.
Let’s say that there are amino acids A, B, C, and D present in a protein.A distance map as below can be drawn:
The distance between one amino acid with itself is zero, and the distance between A and B will be the same as that between B and A. You can draw a table like above by calculating the distance between each amino acid pair.
Furthermore, AlphaFold also predicts the probability distribution of those distances.
AlphaFold Step 1: Convolutional Neural Network CNN Prediction Distogram
This is a picture from the AlphaFold 1 paper. Here you can see that each pixel in the distance map represents a probability distribution. DeepMind calls this a distogram.
The problem of applying AI algorithms to biological problems is about representing something mathematically. In this case, how can a final 3D structure of a protein be defined? This can be done through three-dimensional coordinates of the central molecules of the amino acids.
AlphaFold represents 3D structures as a pair of torsion angles between amino acids. Even when a protein is in a folded state, the basic blocks, the amino acid structure remains unchanged. Yet, the torsion angle between one amino acid and the other changes. AlphaFold calculates the probability distribution of the torsion angles of the amino acids.
One can say that AlphaFold 1 is a two-step process. In the first step, it receives data and amino acid sequence of a target protein and trains a CNN (with PDB dataset) to find the 1) distogram of the protein and 2) the probability distribution of the torsion angle.
AlphaFold Step 2: Repeated Gradient Descent on Protein-Specific Potential
With the two outputs from step 1, AlphaFold 1 now starts the second step, where it tries to find the final folding structure. For this, it builds a protein-specific potential function of the protein folding structure.
First, the initially predicted protein structure is selected using the torsion angle distribution predicted by CNN. If this initial structure and the distogram estimated by the CNN are cross-produced, you can calculate how much the initially predicted protein structure fits the probability obtained from the distogram. This potential is called the distance potential. The bigger the difference between the distance predicted in the distogram and the initial structure, the bigger the potential energy is in the proposed structure.
Similarly, if we cross-produce the initially predicted structure with the torsion distribution, it gives another potential function, which is the geometric potential.
Lastly, the algorithm also considers physical constraints. As mentioned above, the amino acid structure has a backbone structure and side chains. However, when AlphaFold 1 predicts the initial structure, it is done by using just the backbone structure, and whether a side chain exists or not is not considered. Thus, AlphaFold incorporates the Van der Waals term to prevent steric clashes, because residues do not bump into each other. This is referred to as the smooth potential.
Therefore, AlphaFold calculates three potentials and sums them up in a single combined potential function, the protein-specific potential.
The last step is to go through iteration and find an optimal solution that minimizes the corresponding potential function. AlphaFold 1 uses gradient descent minimization to obtain well-packed protein structures.
Thus, after repeated gradient descent process, the optimization converges and the lowest potential structure is stored as the best solution from the iteration as one of the expected answers.
However, if you only rely on the initial structure, you may fall into local minima, so noise shoul be added as well. The optimal value of that iteration is also stored as an expected answer.
Among the expected answers, the answer with the least potential structure is selected as the final answer.
This is how AlphaFold 1 predicts protein structures from protein sequence.
We hope we gave a good explanation on why predicting protein structure is important, why deep learning is suitable for solving the protein folding problem, and how AlphaFold works as a two-step system that uses protein sequence and MSA features as input to output accurate predictions on protein’s 3D structures.
In our next article, we will explain the new features of AlphaFold 2 and will take a look at the three existing patents related to AlphaFold 1.