Currently, the cost of developing new drug candidates is rapidly rising. It takes nearly $2 billion and 12 years for a drug to go from discovering a target for a disease to passing clinical trials. For genetic disease, this is even worse. Our ability to interpret the 3 billion nucleotides which make up our genome is incredibly limited despite lowering costs of genome sequencing. My project aims to utilize the principles of computer vision and NLP to address this issue in drug discovery.
Phase 1 uses convolutional neural networks to understand transcription-factor binding patterns in A549 lung epithelial cells. Using one-hot encoded ChIP-SEQ data, which gives us signals for binding strength across an entire genome, the model is able to learn motifs, or potentially disease-causing regulatory variants that can negatively impact gene-expression. It achieves an accuracy of 90.5%, surpassing traditional approaches which use Position Weight Matrices (PWMs) by nearly 20%.
The second phase of the project uses bidirectional LSTMs for molecular de novo design (hence Project De Novo)! The model can capture the structure and syntax of SMILES molecular representations with near-perfect accuracy, achieving a loss of only ~1.3. After sampling the model on 100,000 SMILES strings (millions of characters of chemical information!), the model can generate valid molecular structures 84% of the time.
The initial models (version 1: March 2019) remain open-source and allow for genomics + cheminformatics researchers to train deep learning models on any CHiP-SEQ or SMILES string dataset and obtain detailed analysis within hours.
What inspired you (or your team)?
Nearly 20 years ago, we sequenced the first human genome After $3 billion in investment and 13 years of research, we had finally found all 3 billion A’s, C’s, T’s and G’s which cracked the code for life. The future for creating personalized medicine and curing genetic disease.
Unfortunately, we’re still incredibly short of achieving personalized medicine and curing genetic disease today. And while sequencing a human genome has gone down from 8-figures to only $47 in 2019, we’re still at a similar standpoint in genomics research. Why? Because biology is fundamentally hard to read. Humans cannot fundamentally conceptualize or understand the hundreds or possibly thousands of mutations which result in complex diseases such as cancer.
Inspired by this and the research being done by companies like Deep Genomics, I decided to try to tackle this problem exactly one year ago. With the rise of more powerful computing, and a rapid increase in genomic data, machine learning can be used to help increase our understanding of the human genome.
What excites me most about AI + biology is the opportunity for immense scale. Drug design and healthcare today is similar to manufacturing right before the beginning of the industrial revolution: it’s incredibly costly, time-consuming and hard to scale. By applying computer vision to genomics, my project is able to scale to understand any gene’s DNA-protein binding patterns without changing one line of code. With computer science, we have the potential to completely disrupt healthcare, by changing how we understand the genome and design drugs.
Last month, I got the incredible opportunity to present my research at the Re-Work Deep Learning Summit in Montreal [https://www.re-work.co/events/deep-learning-summit-montreal-2019/speakers], where I gave a talk on my project in front of machine learning researchers and practitioners at Google Brain, Facebook AI Research, and MIT! Being able to present and get validation from academia was a really valuable experience.
I owe a lot to my supervisors and mentors, Divya Sivasankaran and Andrei Fajardo from Integrate.ai (where I worked this summer), Amit Deshwar and Omar Wagih from Deep Genomics (who helped me start researching into ML + genomics over a year ago), and Jasper Snoek from Google Brain. Without their support and feedback, this project wouldn’t be possible!
Demo + Further Explanation of Project: https://youtu.be/mlUIQnlffFM