+
Skip to content

gurevichlab/nerpa

Repository files navigation

Nerpa 2.0 Manual

  1. About Nerpa
    1.1 Nerpa pipeline
    1.2 Supported data types
  2. Installation
    2.1. Prerequisites
    2.2. Installation from tarball
    2.3. Verifying your installation
  3. Running Nerpa
    3.1. Quick start
    3.2. Command-line options
    3.3. Output files
  4. Citation
  5. Feedback and bug reports

About Nerpa

Nerpa is a tool for linking biosynthetic gene clusters (BGCs) to known nonribosomal peptide (NRP) structures. You can read more about the Nerpa algorithm and the practical applications of the tool in our papers. Nerpa is currently developed and maintained by Gurevich Lab at the Helmholtz Institute for Pharmaceutical Research Saarland (HIPS) and the Center for Bioinformatics Saar (CBI).

This manual will help you to install and run the tool. Nerpa version 2.0.0 was released on 19.03.2025. The tool is dual-licensed and is available under GPLv3 or Creative Commons BY-NC-SA 4.0, see LICENSE.txt.

Nerpa pipeline

The simplified Nerpa pipeline is depicted in the figure below.

Nerpa pipeline

Nerpa takes as input an NRP structure database and genome sequences. The pipeline goes as follows:

  1. Construct tentative NRP synthetase assembly lines along with respective sequences of genome-predicted residues (using antiSMASH).
  2. Construct representations of the database structures as monomer graphs (using rBAN).
  3. Build HMMs for genome-predicted NRP synthetase assembly lines as described in the Nerpa 2 paper.
  4. Extract NRP linearizations from the monomer graphs.
  5. Score the NRP linearizations against the HMMs all-vs-all manner (using the Viterbi algorithm).
  6. Create an interactive report with the best matches and detailed alignments.

Supported data types

For genome sequences:

  • Recommended: complete antiSMASH output after processing your raw genome sequence (e.g., downloaded from the antiSMASH web server); or antiSMASH job IDs (in this case, Nerpa will download it automatically).
  • Also accepted: raw genome sequences in the FASTA and GenBank formats; in this case, Nerpa will predict NRP BGCs in them with antiSMASH (should be installed separately and present in PATH or provided to Nerpa via --antismash-installation-dir).

For NRP structures:

  • Recommended: isomeric SMILES format; Nerpa distinguishes between L- and D-configurations of amino acids, so the use of the isomeric format leads to more accurate results.
  • Also accepted: any other SMILES, i.e., without stereochemistry information.

Note: you can use free online converters to get (isomeric) SMILES from other popular chemical formats such as MDL MOL or InChI, e.g., this one from UNM. Alternatively, there are many command-line convertors, e.g. molconvert, or programming libraries, e.g. RDKit.

Installation

Prerequisites

  • (Required) Nerpa relies on Java (to run the embedded rBAN), Python v3.10 or higher, and a number of Python dependencies specified in the environment.yml file.
    We highly recommend installing Conda to easily set up all dependencies, as demonstrated below.

  • (Optional) If you plan to use Nerpa with raw genome sequences (FASTA or GenBank) rather than antiSMASH-processed files, you will also need to install antiSMASH locally.
    Alternatively, you can use the antiSMASH web server.

  • (Optional) Nerpa is quite fast by default, but we provide an even faster C++ implementation. To use it, you will need a C++20 compiler and CMake v3.10 or higher.

Installation from tarball

First, download and unpack the release tarball:

wget https://github.com/gurevichlab/nerpa/releases/download/nerpa_2.0.0/nerpa-2.0.0.tar.gz
tar -xzf nerpa-2.0.0.tar.gz
cd nerpa-2.0.0

Next, install all required dependencies. We recommend creating and activating a Conda environment:

conda env create -f environment.yml
conda activate nerpa-env

Finally, run

bash install.sh

This will download PARAS specificity prediction model and compile the C++ part.

Verifying your installation

We recommend adding the nerpa directory to PATH. In this case, you can run Nerpa simply as nerpa.py from anywhere; otherwise, you would need to specify path from the current directory to ./nerpa.py. All running examples below assume that Nerpa is in PATH.

To test your installation, first, try to get the list of the Nerpa command-line options:

nerpa.py -h

Then, try any example from the Quick start section and ensure the log contains no error messages.

If you have any problems, please do not hesitate to contact us.

Running Nerpa

Quick start

Sample test data with three antiSMASH-processed BGCs and three NRP structures in the SMILES format is included in the release tarball.
Alternatively, you can download it from here and unpack it in your current working directory.

To run Nerpa on the test data, execute:

nerpa.py -a test_data/antismash --smiles-tsv test_data/smiles.tsv

The output will be saved in the nerpa_results/{CURRENT_TIME} directory and symlinked to nerpa_results/latest for your convenience.
For details on the output directory contents and their interpretation refer to the corresponding section.

Command-line options

To see the full list of available options, type

nerpa.py -h

All options are divided into four categories. The most important options in each category are listed below.

Genomic input (genome sequences or BGCs)

The most convenient way to obtain antiSMASH predictions of BGCs in your genomic data is to upload your
FASTA or GenBank file to their web server.
Once the server job is completed, download the results (Download -> Download all results), unpack the archive,
and provide the path to the unpacked directory using the -a option.

Alternatively, you can provide Nerpa with the server job ID (e.g., bacteria-2a9bb79e-e804-42c9-bb62-516cac47eca2)
via the --antismash-job-ids option, and Nerpa will download everything automatically.
For multiple jobs, specify them as a space-separated list of IDs.

You can also use the command-line version of antiSMASH.
Nerpa has been tested with outputs from antiSMASH version 7 (7.0.0 and 7.1.0).
If antiSMASH is installed on your system, you can provide raw genome sequences in FASTA or GenBank format via the --genome option,
and Nerpa will run antiSMASH automatically.
To enable this, antiSMASH should be available in your system’s PATH variable, or the path to its installation directory should be specified via the --antismash-installation-path option.

Note that you can specify an unlimited number of antiSMASH output files by either:

  • Using the -a option multiple times.
  • Specifying a root directory containing many inputs.
  • Writing paths to all antiSMASH outputs in a single text file and providing it via the --antismash-paths-file option.

Chemical input (compounds)

NRP molecules should be specified in the SMILES format.
You can provide them in one of the following ways:

  • As a space-separated list of SMILES strings using the --smiles option.
  • In a multi-column file specified via the --smiles-tsv option.

In the latter case, the default column separator (\t), names of the SMILES column (SMILES) and the column with molecule IDs (row index) could be adjusted using the --sep, --col-smiles, and --col-id options, respectively.

The Nerpa release package comes with a set of NRP databases in the SMILES format:

Advanced Input

You can reuse preprocessed BGCs and/or chemical structures from a previous Nerpa run.
This is useful, for example, if you want to screen the same BGCs against different NRPs, or vice versa.

The preprocessed files are stored in the Nerpa output directory in:

  • BGC_variants/ (for BGCs).
  • NRP_variants/ (for NRPs).

To reuse them, provide the corresponding paths via the --bgc-variants and --nrp-variants options.

Pipeline Options

  • --output_dir <DIR>, -o <DIR>
    Path to the output directory.
    If the directory already exists, Nerpa will exit with an error unless --force-output-dir is specified.
    If not set, Nerpa will create the directory nerpa_results/{CURRENT_TIME} and symlink it to nerpa_results/latest.

  • --process-hybrids
    Process NRP-polyketide hybrid monomers (requires rBAN to be used). Recommended.

  • --threads
    Number of threads for running Nerpa. Default: 1.

  • --skip-molecule-drawing
    Disable drawing of NRP compounds (they will not appear in the HTML report). Enabling this option speeds up the run and reduces output size.

  • --fast-matching
    Enable the fast C++-based matching (requires pre-compilation; see the Installation section).

Output Files

The key files and directories inside the Nerpa output directory (--output-dir) are:

  • report.html
    An interactive HTML report showing the best Nerpa matches, along with the corresponding annotated BGCs and NRPs.

  • report.tsv
    A tab-separated file containing matched NRP-BGC pairs with their corresponding scores.

  • BGC_variants/
    Directory containing preprocessed antiSMASH outputs. These can be reused for another run via the --bgc-variants option.

  • NRP_variants/
    Directory containing preprocessed compounds. These can be reused for another run via the --nrp-variants option.

Citation

If you use Nerpa in your research, please cite our papers:
Nerpa v.2 is described in Olkhovskii et al, bioRxiv 2024.
Nerpa v.1 is published in Kunyavskaya, Tagirdzhanov et al., Metabolites 2021.

Feedback and bug reports

You can leave your comments and bug reports at https://github.com/gurevichlab/nerpa/issues (recommended way) or sent it via e-mail to alexey.gurevich@helmholtz-hips.de.

Your comments, bug reports, and suggestions are very welcomed. They will help us to improve Nerpa further. In particular, we would love to hear your thought on desired features of the future Nerpa web service.

If you have any troubles running Nerpa, please attach nerpa.log from the output directory.

About

Nerpa: A Tool for Discovering Biosynthetic Gene Clusters of Bacterial Nonribosomal Peptides

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载