Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models

He, Haonan; Ren, Yuchen; Tang, Yining; Xu, Ziyang; Li, Junxian; Yang, Minghao; Zhang, Di; Yuan, Dong; Chen, Tao; Zhang, Shufei; Li, Yuqiang; Dong, Nanqing; Ouyang, Wanli; Zhou, Dongzhan; Ye, Peng

Quantitative Biology > Biomolecules

arXiv:2412.19191 (q-bio)

[Submitted on 26 Dec 2024 (v1), last revised 23 Sep 2025 (this version, v2)]

Title:Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models

Authors:Haonan He, Yuchen Ren, Yining Tang, Ziyang Xu, Junxian Li, Minghao Yang, Di Zhang, Dong Yuan, Tao Chen, Shufei Zhang, Yuqiang Li, Nanqing Dong, Wanli Ouyang, Dongzhan Zhou, Peng Ye

View PDF HTML (experimental)

Abstract:Large language models (LLMs) have shown remarkable capabilities in general domains, but their application to multi-omics biology remains underexplored. To address this gap, we introduce Biology-Instructions, the first large-scale instruction-tuning dataset for multi-omics biological sequences, including DNA, RNA, proteins, and multi-molecules. This dataset bridges LLMs and complex biological sequence-related tasks, enhancing their versatility and reasoning while maintaining conversational fluency. We also highlight significant limitations of current state-of-the-art LLMs on multi-omics tasks without specialized training. To overcome this, we propose ChatMultiOmics, a strong baseline with a novel three-stage training pipeline, demonstrating superior biological understanding through Biology-Instructions. Both resources are publicly available, paving the way for better integration of LLMs in multi-omics analysis. The Biology-Instructions is publicly available at: this https URL.

Comments:	EMNLP 2025 findings
Subjects:	Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2412.19191 [q-bio.BM]
	(or arXiv:2412.19191v2 [q-bio.BM] for this version)
	https://doi.org/10.48550/arXiv.2412.19191

Submission history

From: Haonan He [view email]
[v1] Thu, 26 Dec 2024 12:12:23 UTC (10,331 KB)
[v2] Tue, 23 Sep 2025 12:55:03 UTC (7,084 KB)

Quantitative Biology > Biomolecules

Title:Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Quantitative Biology > Biomolecules

Title:Biology-Instructions: A Dataset and Benchmark for Multi-Omics Sequence Understanding Capability of Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators