This repo contains code to reproduce the results of the paper "The fate of horizontally acquired genes: rapid initial turnover followed by long-term persistence".
The main script is imli.py
, which implements an IMplicit phylogenetic Lateral gene transfer Inference (IMLI) method to infer lateral gene transfer events across species with different taxonomic groupings but highly similar sequence identity (try python imli.py [-h]
for some usage instructions). All other scripts are in the misc/
directory. The code was executed on an HPC cluster using 50-128 threads depending on the script, using Python multiprocessing
, so it is not optimized for local execution.
For data of all analyses performed in the paper, see Zenodo repository.
Please use the hgt_analyses.yml
file to create a conda/mamba environment with the required dependencies.
The data (see Zenodo repo) is organized in nested subdirectories of a pipeline:
bash extractSubsetOfDataset.sh # script in the zenodo data directory to extract subset from EggNOG v5
# and root the trees using MAD using the script from misc/ and downloaded MAD executable
nohup python runMADinParallelOnSplitEggNOGTreesFile.py -i 2_subset_trees.tsv -m ~/bin/mad/mad > mad_rooting_nohup.log & disown
mkdir makeInputFiles # directory to prepare input files for imli using this subset
This results in a directory with files like:
min10TaxaMax2000Genes
├── 2_members_subset.tsv
├── 2_MSA.faa.filepaths_list.txt
├── 2_msaNOGsFound_sorted.txt
├── 2_MSA_subset.faa.filepaths_list.txt
├── 2_NOGIDs_MSAexists.tsv
├── 2_NOGIDs_subset.tsv
├── 2_NOGIDstwocolumned_MSAexists.tsv
├── 2_NOGIDstwocolumned_subset.tsv
├── 2_trees.subset.info
├── 2.trees_subset.nwk
├── 2.trees_subset.nwk.rooted
├── 2_trees_subset.rooted.tsv
├── 2_trees_subset.tsv
├── extractSubsetOfDataset.sh
├── mad_rooting_nohup.log
└── makeInputFiles
cd makeInputFiles # use this subset to prepare input files for imli
# then prepre seq id input files for imli
nohup python prepare_SeqIdentityInputFiles.parallel.py -i ../2_MSA_subset.faa.filepaths_list.txt -o ./2_sequenceIdentities -thr 75 -t 80 > 2_nohup_sequenceIdentitiesCalculation.out & disown
# prepare PPI and gene position files based on STRING downloaded data
nohup python preparePPIAndGenePositionsFiles.py -l COG.links.v11.5.txt -map COG.mappings.v11.5.txt -mem ../2_members_subset.tsv -p gene.chromosome.positions_eggNOG.tsv -o 2_PPI.csv > 2_nohup_PPIFilePreparation.out & disown
# and taxonomic grouping file
python prepareTaxonomicGroupingFile.py -i ../2_members_subset.tsv -o 2_taxonomicGroupings.tsv
mkdir runIMLI # directory to run imli
which results in the following directory structure (with some symlinks for convenience):
makeInputFiles
├── 2_nohup_PPIFilePreparation.out
├── 2_nohup_sequenceIdentitiesCalculation.out
├── 2_PPI.csv
├── 2_sequenceIdentities
├── 2_sequenceIdentities.log
├── 2_taxonomicGroupings.csv
├── COG.links.v11.5.txt -> ../../../../../STRING_downloads/COG.links.v11.5.txt
├── COG.mappings.v11.5.txt -> ../../../../../STRING_downloads/COG.mappings.v11.5.txt
├── gene.chromosome.positions_eggNOG.tsv -> ../../../../../STRING_downloads/gene.chromosome.positions_eggNOG.tsv
├── gene.chromosome.positions_eggNOG.tsv_subset
├── preparePPIAndGenePositionsFiles.py -> /root/imli/misc/preparePPIAndGenePositionsFiles.py
├── prepare_SeqIdentityInputFiles.parallel.py -> /root/imli/misc/prepare_SeqIdentityInputFiles.parallel.py
├── prepareTaxonomicGroupingFile.py -> /root/imli/misc/prepareTaxonomicGroupingFile.py
└── runIMLI
Now we run IMLI on the prepared input files:
cd runIMLI
# run imli on the prepared input files
nohup ~/miniconda3/envs/pali/bin/python imli.py -m p -name phylumID_log20_sit80 -log 20 -tr "phylum ID" -sit 80,85,90,95,97,99,100 -gt ../../2_trees_subset.rooted.tsv -tg ../2_taxonomicGroupings.csv -si ../2_sequenceIdentities -ppi ../2_PPI.csv -thr 75 -hit 80 -al ../../../2_eggnog/2/ >> nohup_phylumID_log20_sit80.out & disown
This provides a results
directory used for downstream analyses using the notebooks in the notebooks/
directory. The results are also available in the Zenodo repository.