“We’ve discovered more about the world than any civilisation before us. But we have been stuck on one problem: how do proteins fold up?” - John Moult, Co-Founder of CASP [1]
In 2018, an interdisciplinary team at DeepMind entered the CASP13, Critical Assessment of protein Structure Prediction, competition for the first time with AlphaFold 1. Having previously worked on developing AlphaGo and AlphaZero to beat human players in Go and chess after tirelessly training via reinforcement learning [2], DeepMind took a detour into biochemistry to tackle the decades-long enigma of how proteins fold up into their tertiary structure from their linear sequence of amino acids.
Since Anfinsen’s first experimental research into the denaturation and renaturation of ribonuclease sparked inquiries into protein conformation, no one has been able to definitively predict the shape of any protein given its amino acid sequence. The implications of such a discovery being made would be paramount: if we can accurately predict a sizeable chunk of the proteome’s structure, we can significantly expand our database linking gene sequences with protein misfolding diseases, design drugs with greater efficacy, and manufacture synthetic proteins for agricultural purposes. Though, over the past 60 years of compounding research effort, only 170 000 unique protein structures out of several billion known to exist have been determined via slow and expensive experimental techniques.
The current theory that proteins undergo reversible disorder ⇌ order folding transitions when in contact with water in physiological conditions [3] derives from Anfinsen’s Nobel-prize winning investigations into the denaturation of ribonuclease - an enzyme catalysing the hydrolysis of 3’5’ phosphodiester bonds in RNA [4]. At his lecture, he postulated the theory that the protein’s native structure is in its “physiological milieu” only when the Gibbs free energy of the whole system is the lowest [5]. His team and peers undertook further research into the protein conformation of other proteins in the subsequent years, including conducting studies into ribonuclease with urea when mercaptoethanol – a reducing agent that denatures RNA bases by reducing disulfide bonds [6] - is added. In this case, disulfide interchange occurred, converting the original mixture to a native ribonuclease mixture - indistinguishable from a regular native ribonuclease mixture. The only problem was the renaturation of ribonuclease took hours when it should have taken just a couple of minutes.
Anfinsen’s further analysis into the discrepancy between in vitro and in vivo rates of conformation drove his discovery of another enzyme system in the Endoplasmic Reticulum which rapidly catalysed the formation of native disulfide bonds in less than a few minutes; effectively, this meant that proteins whose S-S bonds were broken and reformed incorrectly could be converted into their native structure with exposure to this enzyme system in a matter of minutes, resolving Anfinsen's earlier dilemma.
The central tenet of the CASP competition is vested in Anfinsen’s dogma and the thermodynamic hypothesis that proteins’ native conformations occur in the most thermodynamically stable structures of the amino acid sequences in the intracellular environment [5]. The first decade of CASP competitions since 1994 showed little progress: research teams failed to discover a shortcut to monotonous experimental work and their speed in determining protein structure was bottlenecked by just how laborious X-ray crystallography, cryo-electron microscopy and nuclear magnetic resonance have been [7][8].
However, in 2020, the same interdisciplinary team at DeepMind re-entered CASP14 with AlphaFold 2, a fresh remake of its previous edition with a significantly upgraded mechanism [9]. Historically, research teams have been synergising and deriving ideas from already-known structural relatives via homology/comparative modelling to predict given protein structures. De novo protein structural prediction therefore proved to be the greatest hurdle for the other CASP competitors, yet the AI’s relative success with proteins where no homologous sequences were known electrified the bioinformatics community. Inputting roughly 30 Multiple Sequence Alignments containing amino acids sequences thought to be evolutionarily related to the target, alongside residue pairs, was enough to output an accurate structural prediction within days. Improvements in the MSA depth of over 100 sequences lead to small accuracy gains, likely because the refinement of the coarse structure does not depend entirely on MSA information and is rather computationally calculated by the AI. It could learn protein-folding far more efficiently from limited databases yet was still able to cope with the sheer complexity and variety of structural data provided [9]. Able to produce accurate models even in highly challenging intertwined homomers or proteins requiring chaperone proteins, with few pre-written functions, AlphaFold was lauded as a breakthrough in the AI and biochemistry communities.
All of the CASP entries are assessed on the Global Distance Test metric, which measures the competitors’ models’ variance from the ground truth experimentally determined by CASP organisers of proteins not yet on public databases: a GDT score of over 90% is enough to consider the model the solution to the protein-folding problem, as it implicates that any discrepancies between the ground truth and the prediction could be rooted in experimental errors rather than prediction errors in the AI. Last year, AlphaFold 2 reported a median GDT score of 92.5% which corresponds to a 1.6 angstrom margin of error – signifying atomic accuracy even when predicting rare, unrelated proteins. After a series of incremental steps by research teams across the past 60 years, AlphaFold took the leap to transform the landscape of protein-folding.
Though AlphaFold has been a promising show of an acceleration of research methodologies, it is not yet the all-knowing bible of native protein structure conformation and function. According to Kathryn Tunyasuvunakool, a member of DeepMind’s division working on AlphaFold, “[the AI’s outputs] are just predictions and they do need to be interpreted critically in light of the confidence metrics – so they’re useful for hypothesis generation but you do need to think critically about them” [8]. The team recommended use of experimental data – chemical foot printing, Hydrogen-Deuterium exchange data and smFRET – alongside AI prediction to determine ground truth for proteins’ native structures. Philip Ball, writer and member of the Royal Chemistry Society, has indicated a more skeptical outlook on DeepMind’s achievement, suggesting that “the algorithm doesn’t so much solve the protein folding problem as evade it. How it ‘reasons’ from [amino acid] sequence to structure remains a black box” [10]. His doubtful sentiments are echoed by some researchers, who cite uncertainty surrounding domain placement and AlphaFold’s predictions of structures of multi-polypeptide chains and membrane proteins as still too inaccurate to be hailed as the “solution” to this 60-year head-scratcher.
Nevertheless, it is clear that AlphaFold has been a gamechanger in our search to find a definitive method to predict native protein structures. Demis Hassabis’ aim to steer DeepMind into solving real-world problems and create a revolutionary model to aid scientists’ experimental efforts has been achieved, and has sparked a discussion on the use of AI in tandem lab work to guide research. DeepMind’s work this time has permanently altered the philosophy towards how scientific research is conducted: perhaps in the coming decades, we will see more discoveries being made with computational methodologies and neural networks and less painstaking pipetting in a lab.
AlphaFold’s source code and database is publicly available to view.
References
[1] DeepMind - “AlphaFold: The making of a scientific breakthrough” | https://www.youtube.com/watch?v=gg7WjuFs8F4 (2021)
[2] Silver et al. “Mastering Chess and Shogi by Self-Play with General Reinforcement Learning Algorithm” (2017). https://arxiv.org/abs/1712.01815
[3] Rose G., Fleming P., Banavar J., Maritan A. “A backbone-based theory of protein folding” (2006) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1636505/
[4] Lee H.H., Wang Y.N., Hung M.C. “Functional roles of the human ribonuclease A superfamily in RNA metabolism and membrane receptor biology” (2019) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6788960/
[5] Afinsen C. “Studies on the Principles that Govern the folding of Protein Chains” (1972) https://www.nobelprize.org/uploads/2018/06/anfinsen-lecture.pdf
[6] Nelson D.., Lenhinger A., Cox M. “Lenhinger principles of biochemistry” New York: W.H. Freeman. Pp. 148 ISBN 0-7167-4339-6 https://archive.org/details/lehningerprincip00lehn_0/page/148/mode/2up
[7] Heaven W.D. “DeepMind’s protein-folding AI has solved a 50-year-old grand challenge of biology” (2020) https://www.technologyreview.com/2020/11/30/1012712/deepmind-protein-folding-ai-solved-biology-science-drugs-disease/
[8] European BioInformatics Institute EMBL-EBI “How to interpret AlphaFold structures” (2021) https://www.youtube.com/watch?v=UqeQfRDA8Yk
[9] Jumper J., Evans R., Pritzel A. et al “Highly accurate protein structure prediction with AlphaFold” Nature 596, 583-589 (2021) https://doi.org/10.1038/s41586-021-03819-2
[10] Ball P. “Behind the screens of AlphaFold” (2020) https://www.chemistryworld.com/opinion/behind-the-screens-of-alphafold/4012867.article
Comments