+
Skip to content

swstkm/imli

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Code for paper about loss of bacterial inter-phylum horizontally acquired genes

This repo contains code to reproduce the results of the paper "The fate of horizontally acquired genes: rapid initial turnover followed by long-term persistence".

The main script is imli.py, which implements an IMplicit phylogenetic Lateral gene transfer Inference (IMLI) method to infer lateral gene transfer events across species with different taxonomic groupings but highly similar sequence identity (try python imli.py [-h] for some usage instructions). All other scripts are in the misc/ directory. The code was executed on an HPC cluster using 50-128 threads depending on the script, using Python multiprocessing, so it is not optimized for local execution.

For data of all analyses performed in the paper, see Zenodo repository.

Dependencies

Please use the hgt_analyses.yml file to create a conda/mamba environment with the required dependencies.

Pipeline overview

The data (see Zenodo repo) is organized in nested subdirectories of a pipeline:

bash extractSubsetOfDataset.sh # script in the zenodo data directory to extract subset from EggNOG v5
# and root the trees using MAD using the script from misc/ and downloaded MAD executable
nohup python runMADinParallelOnSplitEggNOGTreesFile.py -i 2_subset_trees.tsv -m ~/bin/mad/mad > mad_rooting_nohup.log & disown

mkdir makeInputFiles # directory to prepare input files for imli using  this subset

This results in a directory with files like:

min10TaxaMax2000Genes
├── 2_members_subset.tsv
├── 2_MSA.faa.filepaths_list.txt
├── 2_msaNOGsFound_sorted.txt
├── 2_MSA_subset.faa.filepaths_list.txt
├── 2_NOGIDs_MSAexists.tsv
├── 2_NOGIDs_subset.tsv
├── 2_NOGIDstwocolumned_MSAexists.tsv
├── 2_NOGIDstwocolumned_subset.tsv
├── 2_trees.subset.info
├── 2.trees_subset.nwk
├── 2.trees_subset.nwk.rooted
├── 2_trees_subset.rooted.tsv
├── 2_trees_subset.tsv
├── extractSubsetOfDataset.sh
├── mad_rooting_nohup.log
└── makeInputFiles
cd makeInputFiles # use this subset to prepare input files for imli
# then prepre seq id input files for imli
nohup python prepare_SeqIdentityInputFiles.parallel.py -i ../2_MSA_subset.faa.filepaths_list.txt -o ./2_sequenceIdentities -thr 75 -t 80 > 2_nohup_sequenceIdentitiesCalculation.out & disown
# prepare PPI and gene position files based on STRING downloaded data
nohup python preparePPIAndGenePositionsFiles.py -l COG.links.v11.5.txt -map COG.mappings.v11.5.txt -mem ../2_members_subset.tsv -p gene.chromosome.positions_eggNOG.tsv -o 2_PPI.csv > 2_nohup_PPIFilePreparation.out & disown
# and taxonomic grouping file
python prepareTaxonomicGroupingFile.py -i ../2_members_subset.tsv -o 2_taxonomicGroupings.tsv

mkdir runIMLI # directory to run imli

which results in the following directory structure (with some symlinks for convenience):

makeInputFiles
├── 2_nohup_PPIFilePreparation.out
├── 2_nohup_sequenceIdentitiesCalculation.out
├── 2_PPI.csv
├── 2_sequenceIdentities
├── 2_sequenceIdentities.log
├── 2_taxonomicGroupings.csv
├── COG.links.v11.5.txt -> ../../../../../STRING_downloads/COG.links.v11.5.txt
├── COG.mappings.v11.5.txt -> ../../../../../STRING_downloads/COG.mappings.v11.5.txt
├── gene.chromosome.positions_eggNOG.tsv -> ../../../../../STRING_downloads/gene.chromosome.positions_eggNOG.tsv
├── gene.chromosome.positions_eggNOG.tsv_subset
├── preparePPIAndGenePositionsFiles.py -> /root/imli/misc/preparePPIAndGenePositionsFiles.py
├── prepare_SeqIdentityInputFiles.parallel.py -> /root/imli/misc/prepare_SeqIdentityInputFiles.parallel.py
├── prepareTaxonomicGroupingFile.py -> /root/imli/misc/prepareTaxonomicGroupingFile.py
└── runIMLI

Now we run IMLI on the prepared input files:

cd runIMLI
# run imli on the prepared input files
nohup ~/miniconda3/envs/pali/bin/python imli.py -m p -name phylumID_log20_sit80 -log 20 -tr "phylum ID" -sit 80,85,90,95,97,99,100 -gt ../../2_trees_subset.rooted.tsv  -tg ../2_taxonomicGroupings.csv -si ../2_sequenceIdentities  -ppi ../2_PPI.csv -thr 75 -hit 80 -al ../../../2_eggnog/2/ >> nohup_phylumID_log20_sit80.out & disown

This provides a results directory used for downstream analyses using the notebooks in the notebooks/ directory. The results are also available in the Zenodo repository.

About

Mirror of my Gitlab repo at CS@HHU

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载