codonGPT: reinforcement learning on a generative language model enables scalable mRNA design

PRODUCTS USED

Genes
Read Full Article

ABSTRACT

Reinforcement learning (RL) has become a powerful tool for design tasks in engineering. However, in biology, the combination of RL with language models has not been explored. Moreover, emerging generative language models for biology focus on DNA, non-coding RNA, or proteins. This ignores information hidden in coding messenger RNA (mRNA). Even though non-generative models for mRNA exist, the absence of generative models hinders scalable mRNA design for therapeutics, synthetic biology, and protein engineering. To address this, first, we built codonGPT, the first generative language model trained exclusively on 338 417 mRNA sequences of model organisms. To constrain synonymous sequences unique to mRNA, we introduce a novel inference-time masking method, enabling us to frame mRNA design as a constrained language modeling task. Second, we demonstrate a robust method for mRNA optimization with two example genes HLA-A and ACTB, by implementing RL on codonGPT with multiple biological constraints such as expression, stability, and GC-content. Our methods can be applied for optimization of widely used reporter genes such as GFP, beta-lactamase, and erythropoietin.

Read Full Article

PRODUCTS USED

Genes