Twist Bioscience HQ
681 Gateway Blvd
South San Francisco, CA 94080
codonGPT: reinforcement learning on a generative language model enables scalable mRNA design
PRODUCTS USED
ABSTRACT
Reinforcement learning (RL) has become a powerful tool for design tasks in engineering. However, in biology, the combination of RL with language models has not been explored. Moreover, emerging generative language models for biology focus on DNA, non-coding RNA, or proteins. This ignores information hidden in coding messenger RNA (mRNA). Even though non-generative models for mRNA exist, the absence of generative models hinders scalable mRNA design for therapeutics, synthetic biology, and protein engineering. To address this, first, we built codonGPT, the first generative language model trained exclusively on 338 417 mRNA sequences of model organisms. To constrain synonymous sequences unique to mRNA, we introduce a novel inference-time masking method, enabling us to frame mRNA design as a constrained language modeling task. Second, we demonstrate a robust method for mRNA optimization with two example genes HLA-A and ACTB, by implementing RL on codonGPT with multiple biological constraints such as expression, stability, and GC-content. Our methods can be applied for optimization of widely used reporter genes such as GFP, beta-lactamase, and erythropoietin.