Code for paper about loss of bacterial inter-phylum horizontally acquired genes

This repo contains code to reproduce the results of the paper "The fate of horizontally acquired genes: rapid initial turnover followed by long-term persistence".

The main script is imli.py, which implements an IMplicit phylogenetic Lateral gene transfer Inference (IMLI) method to infer lateral gene transfer events across species with different taxonomic groupings but highly similar sequence identity (try python imli.py [-h] for some usage instructions). All other scripts are in the misc/ directory. The code was executed on an HPC cluster using 50-128 threads depending on the script, using Python multiprocessing, so it is not optimized for local execution.

For data of all analyses performed in the paper, see Zenodo repository.

Dependencies

Please use the hgt_analyses.yml file to create a conda/mamba environment with the required dependencies.

Pipeline overview

The data (see Zenodo repo) is organized in nested subdirectories of a pipeline:

bash extractSubsetOfDataset.sh # script in the zenodo data directory to extract subset from EggNOG v5
# and root the trees using MAD using the script from misc/ and downloaded MAD executable
nohup python runMADinParallelOnSplitEggNOGTreesFile.py -i 2_subset_trees.tsv -m ~/bin/mad/mad > mad_rooting_nohup.log & disown

mkdir makeInputFiles # directory to prepare input files for imli using  this subset

This results in a directory with files like:

min10TaxaMax2000Genes
├── 2_members_subset.tsv
├── 2_MSA.faa.filepaths_list.txt
├── 2_msaNOGsFound_sorted.txt
├── 2_MSA_subset.faa.filepaths_list.txt
├── 2_NOGIDs_MSAexists.tsv
├── 2_NOGIDs_subset.tsv
├── 2_NOGIDstwocolumned_MSAexists.tsv
├── 2_NOGIDstwocolumned_subset.tsv
├── 2_trees.subset.info
├── 2.trees_subset.nwk
├── 2.trees_subset.nwk.rooted
├── 2_trees_subset.rooted.tsv
├── 2_trees_subset.tsv
├── extractSubsetOfDataset.sh
├── mad_rooting_nohup.log
└── makeInputFiles

cd makeInputFiles # use this subset to prepare input files for imli
# then prepre seq id input files for imli
nohup python prepare_SeqIdentityInputFiles.parallel.py -i ../2_MSA_subset.faa.filepaths_list.txt -o ./2_sequenceIdentities -thr 75 -t 80 > 2_nohup_sequenceIdentitiesCalculation.out & disown
# prepare PPI and gene position files based on STRING downloaded data
nohup python preparePPIAndGenePositionsFiles.py -l COG.links.v11.5.txt -map COG.mappings.v11.5.txt -mem ../2_members_subset.tsv -p gene.chromosome.positions_eggNOG.tsv -o 2_PPI.csv > 2_nohup_PPIFilePreparation.out & disown
# and taxonomic grouping file
python prepareTaxonomicGroupingFile.py -i ../2_members_subset.tsv -o 2_taxonomicGroupings.tsv

mkdir runIMLI # directory to run imli

which results in the following directory structure (with some symlinks for convenience):

makeInputFiles
├── 2_nohup_PPIFilePreparation.out
├── 2_nohup_sequenceIdentitiesCalculation.out
├── 2_PPI.csv
├── 2_sequenceIdentities
├── 2_sequenceIdentities.log
├── 2_taxonomicGroupings.csv
├── COG.links.v11.5.txt -> ../../../../../STRING_downloads/COG.links.v11.5.txt
├── COG.mappings.v11.5.txt -> ../../../../../STRING_downloads/COG.mappings.v11.5.txt
├── gene.chromosome.positions_eggNOG.tsv -> ../../../../../STRING_downloads/gene.chromosome.positions_eggNOG.tsv
├── gene.chromosome.positions_eggNOG.tsv_subset
├── preparePPIAndGenePositionsFiles.py -> /root/imli/misc/preparePPIAndGenePositionsFiles.py
├── prepare_SeqIdentityInputFiles.parallel.py -> /root/imli/misc/prepare_SeqIdentityInputFiles.parallel.py
├── prepareTaxonomicGroupingFile.py -> /root/imli/misc/prepareTaxonomicGroupingFile.py
└── runIMLI

Now we run IMLI on the prepared input files:

cd runIMLI
# run imli on the prepared input files
nohup ~/miniconda3/envs/pali/bin/python imli.py -m p -name phylumID_log20_sit80 -log 20 -tr "phylum ID" -sit 80,85,90,95,97,99,100 -gt ../../2_trees_subset.rooted.tsv  -tg ../2_taxonomicGroupings.csv -si ../2_sequenceIdentities  -ppi ../2_PPI.csv -thr 75 -hit 80 -al ../../../2_eggnog/2/ >> nohup_phylumID_log20_sit80.out & disown

This provides a results directory used for downstream analyses using the notebooks in the notebooks/ directory. The results are also available in the Zenodo repository.

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
misc		misc
notebooks		notebooks
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
hgt_analyses.yml		hgt_analyses.yml
imli.py		imli.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Code for paper about loss of bacterial inter-phylum horizontally acquired genes

Dependencies

Pipeline overview

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

swstkm/imli

Folders and files

Latest commit

History

Repository files navigation

Code for paper about loss of bacterial inter-phylum horizontally acquired genes

Dependencies

Pipeline overview

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages