CodonBERT large language model for mRNA vaccines

Sizhen Li; Saeed Moayedpour; Ruijiang Li; Michael Bailey; Saleh Riahi; Lorenzo Kogler-Anele; Milad Miladi; Jacob Miner; Fabien Pertuy; Dinghai Zheng; Jun Wang; Akshay Balsubramani; Khang Tran; Minnie Zacharia; Monica Wu; Xiaobo Gu; Ryan Clinton; Carla Asquith; Joseph Skaleski; Lianne Boeglin; Sudha Chivukula; Anusha Dias; Tod Strugnell; Fernando Ulloa Montoya; Vikram Agarwal; Ziv Bar-Joseph; Sven Jager

doi:10.1101/gr.278870.123

CodonBERT large language model for mRNA vaccines

¹Digital R&D, Sanofi, Cambridge, Massachusetts 02141, USA;
²mRNA Center of Excellence, Sanofi, Waltham, Massachusetts 02451, USA;
³mRNA Center of Excellence, Sanofi, 69280 Marcy L'Etoile, France

↵4 These authors contributed equally to this work.

Corresponding authors: zivbj{at}cs.cmu.edu, sven.jager{at}sanofi.com

Abstract

mRNA-based vaccines and therapeutics are gaining popularity and usage across a wide range of conditions. One of the critical issues when designing such mRNAs is sequence optimization. Even small proteins or peptides can be encoded by an enormously large number of mRNAs. The actual mRNA sequence can have a large impact on several properties, including expression, stability, immunogenicity, and more. To enable the selection of an optimal sequence, we developed CodonBERT, a large language model (LLM) for mRNAs. Unlike prior models, CodonBERT uses codons as inputs, which enables it to learn better representations. CodonBERT was trained using more than 10 million mRNA sequences from a diverse set of organisms. The resulting model captures important biological concepts. CodonBERT can also be extended to perform prediction tasks for various mRNA properties. CodonBERT outperforms previous mRNA prediction methods, including on a new flu vaccine data set.

Footnotes

[Supplemental material is available for this article.]
Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.278870.123.

Received December 15, 2023.
Accepted June 25, 2024.

This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

CodonBERT large language model for mRNA vaccines

Abstract

Footnotes

This Article

Article Category

Services

Citing Articles

Google Scholar

PubMed/NCBI

ORCID

Share

Preprint Server

Current Issue

In This Issue