+
Skip to content
/ MADRe Public

Strain-level metagenomic classification with Metagenome Assembly driven Database Reduction approach

License

Notifications You must be signed in to change notification settings

lbcb-sci/MADRe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MADRe

Strain-level metagenomic classification with Metagenome Assembly driven Database Reduction approach

Instalation

OPTION 1 : Conda

conda install bioconda::madre

set up the configuration (config.ini file):

[PATHS]
metaflye = flye
metaMDBG = metaMDBG
minimap = minimap2
hairsplitter = hairsplitter.py
seqkit = seqkit

[DATABASE]
predefined_db = /path/to/database.fna
strain_species_json = /path/to/taxids_species.json

NOTE: Prebuilt version of taxids_species.json can be found in GitHub database folder. More information about it find under the section Build database.

simple run:

madre --reads [path_to_the_reads] --out-folder [path_to_the_out_folder] --config config.ini

more information:

madre --help

OPTION 2 : Docker

docker pull jlipovac13/madre:0.0.4

simple run:

docker run --rm -v $PWD:/data jlipovac13/madre:0.0.4 madre --reads /data/reads.fastq --config /data/config.ini --out-folder /data/out_folder

more information:

docker run --rm -v $PWD:/data jlipovac13/madre:0.0.4 madre --help

set up the configuration (config.ini file):

[PATHS]
metaflye = flye
metaMDBG = metaMDBG
minimap = minimap2
hairsplitter = hairsplitter.py
seqkit = seqkit

[DATABASE]
predefined_db = /data/database.fna
strain_species_json = /data/taxids_species.json

NOTE: Ensure that along with input data, database.fna and taxids_species.json are in /data/ folder. Prebuilt version of taxids_species.json can be found in GitHub database folder. More information about it find under the section Build database.

OPTION 3: Running from source

git clone https://github.com/lbcb-sci/MADRe
cd MADRe

For running from source you need to install following dependecies:

  • python >= 3.10
  • scikit-learn
  • minimap2
  • flye
  • metamdbg
  • hairsplitter
  • seqkit
  • kraken2

Dependencies can be installed through conda:

conda create -n MADRe_env python=3.10 scikit-learn minimap2 flye metamdbg hairsplitter seqkit kraken2 -c conda-forge -c bioconda 
conda activate MADRe_env

set up the configuration (config.ini file):

[PATHS]
metaflye = /path/to/flye
metaMDBG = /path/to/metaMDBG
minimap = /path/to/minimap2
hairsplitter = /path/to/hairsplitter.py
seqkit = /path/to/seqkit

[DATABASE]
predefined_db = /path/to/database.fna
strain_species_json = ./database/taxids_species.json

simple run:

python MADRe.py --reads [path_to_the_reads] --out-folder [path_to_the_out_folder] --config config.ini

more information:

python MADRe.py --help

Recommended database is Kraken2 bacteria database - instructions on how to build it you can find under the section Build database.

Information on how to run specific MADRe steps find under the section Run specific steps.

Build database

Recommend database is the kraken2 built bacteria database following next steps:

kraken2-build --download-taxonomy --db $DBNAME
kraken2-build --download-library bacteria --db $DBNAME
kraken2-build --build --db $DBNAME

Detailed instructions that are including the one listed here can be found at kraken2 github page.

If you want to use your database it is important to have taxonomy information for the references included in the database.

References in the database should have headers in this way:

>|taxid|accession_number

../database/taxids_species.json file contains information on species taxid for every strain taxid obtained from NCBI taxonomy (downloaded December 2024.).

For building new taxids index from newer taxonomy file or for different taxonomic levels you can use database/build_json_taxids.py.

Run specific steps

MADRe is the pipeline contained of two main steps: 1) database reduction and 2) read classification.

It is possible to run those steps independently. More infromation on running can be obtained with:

database-reduction --help
read-classification --help

installed from source:

python src/DatabaseReduction.py --help
python src/ReadClassification.py --help

Database reduction information

To run database reduction step separately you need to provide names of the output paths, mapping PAF file containg contigs mappings to large database (database needs to follow rules from Build database section) and text file containing how many strains are collapsed in which contig. If contig represents only one strain there should be 0 next to it, if it represents 2 strains, 1 is collapsed so there should be 1 next to it. The file should look like this:

...
contig_7:0 
contig_8:0 
contig_8:1 
contig_8:2 
contig_8:3
...

If as output you only specify --reduced_list_txt you won't get fasta file of reduced database, just list of references that should go to reduced database. To get fasta file of reduced database specify --reduced_db.

Database reduction step uses taxid index. By default it uses database/taxid_species.json. If specific large database is used, then right taxid index should be provided using --strain_species_info.

Read classification information

To run read classification step separately you need to provide PAF file containing read mappings to the reference. This step can be run on any database (database needs to follow rules from Build database section), so it doesn't have to be previously reduced.

Read classification step uses taxid index. By default it uses database/taxid_species.json. If specific large database is used, then right taxid index should be provided using --strain_species_info.

Output file is text file containg lines as: read_id : reference.

Read Classification with clustering

As part of read classification step, clustering of very similar strains can also be performed. If you want to perform clustering provide path to the directory with output clustering files using --clustering_out. Output clustering files are:

clusters.txt - Every line represents one cluster. References in cluster separated with spaces.
representatives.txt - Every line represents a cluster representative reference of the cluster from that line in clusters.txt file.

Abundance calculation

For abundance calculation information run:

calculate-abundances --help

installed from source:

python src/CalculateAbundances.py --help

The input to this step is read classification output file that has lines as read_id : reference. This file can be obtained with read classification step.

The default output is rc_abundances.out containing read count abundances. If you want to calculate abundance as sum_of_read_lengths/reference_length you need to provide database path used in read classification step using --db - be aware that this step if database is big takes a little bit longer than calculation of just read count abundances.

If you want to calculate cluster abundances, you need to provide path to the directory containing clusters.txt and representatives.txt files. In that case output files will contain only represetative references with sumarized abundances for cluster that reference is represetative of.

Citing MADRe

bioRxiv preprint - https://www.biorxiv.org/content/10.1101/2025.05.12.653324v1:

Lipovac, J., Sikic, M., Vicedomini, R., & Krizanovic, K. (2025). MADRe: Strain-Level Metagenomic Classification Through Assembly-Driven Database Reduction. bioRxiv, 2025-05.

About

Strain-level metagenomic classification with Metagenome Assembly driven Database Reduction approach

Resources

License

Stars

Watchers

Forks

Packages

No packages published
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载