JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model

Update

2025-08-29: Pre-training weights for models with final MLPs are available here.

2025-08-01: Attaching two MLP layers after Janus fusion layer significantly improves model performance with minimal parameter scaling.

JanusDNA_mlp are the latest ones.

Nucleotide Transformer (NT)

	JanusDNA	JanusDNA	JanusDNA_mlp	JanusDNA_mlp
	w/ midattn	w/o midattn	w/ midattn; w/ rc	w/o midattn; w/ rc
size(M)	1.980	1.988	2.001	2.009
H3	0.821±0.021	0.824±0.012	0.835±0.009	0.831±0.023
H3K14ac	0.665 ± 0.034	0.685±0.016	0.729±0.022	0.718±0.026
H3K36me3	0.658 ± 0.024	0.670±0.012	0.702±0.015	0.699±0.025
H3K4me1	0.563 ± 0.041	0.571±0.018	0.615±0.035	0.616±0.018
H3K4me2	0.509 ± 0.056	0.548±0.022	0.589±0.023	0.586±0.019
H3K4me3	0.605 ± 0.030	0.629±0.022	0.688±0.026	0.675±0.014
H3K79me3	0.716 ± 0.017	0.727±0.023	0.747±0.013	0.743±0.009
H3K9ac	0.641 ± 0.024	0.639±0.019	0.673±0.014	0.661±0.027
H4	0.809 ± 0.021	0.816±0.008	0.812±0.011	0.813±0.013
H4ac	0.637±0.060	0.653±0.034	0.698±0.013	0.705±0.023
enhancers	0.564 ± 0.022	0.535±0.036	0.559±0.042	0.542±0.044
EnhancersTypes	0.462±0.049	0.470±0.025	0.503±0.038	0.492±0.096
PromoterAll	0.969±0.002	0.971±0.002	0.970±0.002	0.970±0.003
PromoterNoTata	0.971±0.003	0.971±0.002	0.971±0.004	0.971±0.003
PromoterTata	0.956±0.010	0.958±0.008	0.958±0.007	0.960±0.008
SpliceSitesAll	0.963±0.022	0.960±0.009	0.967±0.005	0.943±0.020
SpliceSitesAcceptors	0.949±0.020	0.939±0.022	0.957±0.012	0.961±0.009
SpliceSitesDonors	0.947±0.015	0.936±0.014	0.948±0.008	0.935±0.016

DNALONGBENCH

	Caduceus-PH	JanusDNA; w/o midattn	JanusDNA_mlp; w/o midattn
size	7.7M	7.662 M	7.745 M
AT	0.690	0.802	0.851
AS	0.759	0.740	0.768
CCF	0.689	0.770	0.801
MS	0.789	0.803	0.864
NT	0.841	0.877	0.913
SNSES	0.812	0.874	0.903
SSELL	0.691	0.706	0.845
Thyroid	0.703	0.752	0.792
WB	0.768	0.794	0.821

2025-06-28: Pretrained JanusDNA weights are now available for download here.

Getting Started

To begin, create a conda environment with the required dependencies:

conda env create -f janusdna.yml

Activate the environment:

conda activate janusdna

Install Mamba:

wget https://github.com/state-spaces/mamba/releases/download/v2.2.4/mamba_ssm-2.2.4+cu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl
pip install mamba_ssm-2.2.4+cu12torch2.5cxx11abiFALSE-cp311-cp311-linux_x86_64.whl

pip install selene-sdk --no-deps

Reproducing Experiments

As described in the paper, there are four main experimental components:

Pretraining JanusDNA on the Human Reference Genome
GenomicBenchmarks
Nucleotide Transformer Datasets
DNALongBench

Pretraining on the Human Reference Genome

(Data downloading instructions adapted from the HyenaDNA)

First, download the Human Reference Genome data, which consists of two files: a .fasta file containing all sequences, and a .bed file specifying the intervals used.

The directory structure should be:

data
|-- hg38/
  |-- hg38.ml.fa
  |-- human-sequences.bed

Download the fasta (.fa) file for the entire human genome into ./data/hg38. The genome contains approximately 24 chromosomes, merged into a single file. Then, download the .bed file with sequence intervals (chromosome name, start, end, split), which allows retrieval from the fasta file.

mkdir -p data/hg38/
curl https://storage.googleapis.com/basenji_barnyard2/hg38.ml.fa.gz > data/hg38/hg38.ml.fa.gz
gunzip data/hg38/hg38.ml.fa.gz  # unzip the fasta file
curl https://storage.googleapis.com/basenji_barnyard2/sequences_human.bed > data/hg38/human-sequences.bed

Run a pre-training script from scripts/pre_train/:

|-- scripts
  |--pre_train
    |-- slurm_JanusDNA_w_midattn_32dim.sh
    |-- slurm_JanusDNA_w_midattn_72dim.sh
    |-- slurm_JanusDNA_wo_midattn_32dim.sh
    |-- slurm_JanusDNA_wo_midattn_72dim.sh
    |-- slurm_JanusDNA_wo_midattn_144dim.sh

For example:

cd scripts/pre_train/
sbatch slurm_JanusDNA_w_midattn_32dim.sh

GenomicBenchmarks

The GenomicBenchmarks suite, as presented in Grešová et al. (2023), comprises eight classification tasks.

You can launch fine-tuning with 5-fold cross-validation using gb_janusdna.sh:

bash scripts/benchmark/gb/gb_janusdna.sh

Nucleotide Transformer Datasets

The Nucleotide Transformer suite of tasks was introduced in Dalla-Torre et al. (2023). The data is available on HuggingFace: InstaDeepAI/nucleotide_transformer_downstream_tasks.

You can launch fine-tuning with 10-fold cross-validation using nt_janusdna.sh:

bash scripts/benchmark/nt/nt_janusdna.sh

DNALongBench

Download the dataset from the dataset website following the instructions in the DNALongBench repository.

Place the eQTL dataset zip file in the data directory and unzip it:

mkdir -p <root of project dir>/data/dnalongbench
mv <the downloaded zip file> <root of project dir>/data/dnalongbench
unzip <root of project dir>/data/<the downloaded zip file>

cd data/dnalongbench/eQTL/seqs
gunzip hg38.fa.gz

You can fine-tune on a specific cell-type dataset using eqtl_train_janus_8gpu.sh:

sbatch scripts/benchmark/dnalong/eqtl_train_janus_8gpu.sh

After fine-tuning, evaluate the results on the corresponding test dataset using eqtl_evaluation_janus.sh:

sbatch scripts/benchmark/dnalong/eqtl_evaluation_janus.sh

Evaluation output and all log files for fine-tuning and evaluation will be stored in the watch_folder directory. For example:

watch_folder
|-- eQTL
  |-- janusdna_len-131k_d_model-144_inter_dim-576_n_layer-8_lr-8e-3_step-50K_moeloss-true_1head_onlymoe
    |-- Whole_Blood_lr-4e-4_cjtrain_false_batch_8_seed_1.log                    
      (fine-tuning log)
    |-- Whole_Blood_lr-4e-4_cjtrain_false_batch_8_seed_1_cjtest_true.log        
      (evaluation log)
    |-- Whole_Blood_lr-4e-4_cjtrain_false_batch_4_seed_1_cjtest_true_output.txt
      (evaluation output)

To calculate AUROC based on the evaluation output, use the script auroc.py.

A script is also provided to calculate AUROC for all cell-type datasets at once, evaluate_auroc_janus.py.

Acknowledgements

This repository is adopted from the Caduceus and leverages much of the training, data loading, and logging infrastructure defined there. Caduceus was originally derived from the HyenaDNA. We also acknowledge the contributions of Jamba-v0.1, which provided the initial codebase for hybrid architectures.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
caduceus		caduceus
configs		configs
evals		evals
janusdna		janusdna
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
evaluation.py		evaluation.py
janusdna.yml		janusdna.yml
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model

Update

Getting Started

Reproducing Experiments

Pretraining on the Human Reference Genome

GenomicBenchmarks

Nucleotide Transformer Datasets

DNALongBench

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

Qihao-Duan/JanusDNA

Folders and files

Latest commit

History

Repository files navigation

JanusDNA: A Powerful Bi-directional Hybrid DNA Foundation Model

Update

Getting Started

Reproducing Experiments

Pretraining on the Human Reference Genome

GenomicBenchmarks

Nucleotide Transformer Datasets

DNALongBench

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages