Nerpa 2.0 Manual

About Nerpa
1.1 Nerpa pipeline
1.2 Supported data types
Installation
2.1. Prerequisites
2.2. Installation from tarball
2.3. Verifying your installation
Running Nerpa
3.1. Quick start
3.2. Command-line options
3.3. Output files
Citation
Feedback and bug reports

About Nerpa

Nerpa is a tool for linking biosynthetic gene clusters (BGCs) to known nonribosomal peptide (NRP) structures. You can read more about the Nerpa algorithm and the practical applications of the tool in our papers. Nerpa is currently developed and maintained by Gurevich Lab at the Helmholtz Institute for Pharmaceutical Research Saarland (HIPS) and the Center for Bioinformatics Saar (CBI).

This manual will help you to install and run the tool. Nerpa version 2.0.0 was released on 19.03.2025. The tool is dual-licensed and is available under GPLv3 or Creative Commons BY-NC-SA 4.0, see LICENSE.txt.

Nerpa pipeline

The simplified Nerpa pipeline is depicted in the figure below.

Nerpa takes as input an NRP structure database and genome sequences. The pipeline goes as follows:

Construct tentative NRP synthetase assembly lines along with respective sequences of genome-predicted residues (using antiSMASH).
Construct representations of the database structures as monomer graphs (using rBAN).
Build HMMs for genome-predicted NRP synthetase assembly lines as described in the Nerpa 2 paper.
Extract NRP linearizations from the monomer graphs.
Score the NRP linearizations against the HMMs all-vs-all manner (using the Viterbi algorithm).
Create an interactive report with the best matches and detailed alignments.

Supported data types

For genome sequences:

Recommended: complete antiSMASH output after processing your raw genome sequence (e.g., downloaded from the antiSMASH web server); or antiSMASH job IDs (in this case, Nerpa will download it automatically).
Also accepted: raw genome sequences in the FASTA and GenBank formats; in this case, Nerpa will predict NRP BGCs in them with antiSMASH (should be installed separately and present in PATH or provided to Nerpa via --antismash-installation-dir).

For NRP structures:

Recommended: isomeric SMILES format; Nerpa distinguishes between L- and D-configurations of amino acids, so the use of the isomeric format leads to more accurate results.
Also accepted: any other SMILES, i.e., without stereochemistry information.

Note: you can use free online converters to get (isomeric) SMILES from other popular chemical formats such as MDL MOL or InChI, e.g., this one from UNM. Alternatively, there are many command-line convertors, e.g. molconvert, or programming libraries, e.g. RDKit.

Installation

Prerequisites

(Required) Nerpa relies on Java (to run the embedded rBAN), Python v3.10 or higher, and a number of Python dependencies specified in the environment.yml file.
We highly recommend installing Conda to easily set up all dependencies, as demonstrated below.
(Optional) If you plan to use Nerpa with raw genome sequences (FASTA or GenBank) rather than antiSMASH-processed files, you will also need to install antiSMASH locally.
Alternatively, you can use the antiSMASH web server.
(Optional) Nerpa is quite fast by default, but we provide an even faster C++ implementation. To use it, you will need a C++20 compiler and CMake v3.10 or higher.

Installation from tarball

First, download and unpack the release tarball:

wget https://github.com/gurevichlab/nerpa/releases/download/nerpa_2.0.0/nerpa-2.0.0.tar.gz
tar -xzf nerpa-2.0.0.tar.gz
cd nerpa-2.0.0

Next, install all required dependencies. We recommend creating and activating a Conda environment:

conda env create -f environment.yml
conda activate nerpa-env

Finally, run

bash install.sh

This will download PARAS specificity prediction model and compile the C++ part.

Verifying your installation

We recommend adding the nerpa directory to PATH. In this case, you can run Nerpa simply as nerpa.py from anywhere; otherwise, you would need to specify path from the current directory to ./nerpa.py. All running examples below assume that Nerpa is in PATH.

To test your installation, first, try to get the list of the Nerpa command-line options:

nerpa.py -h

Then, try any example from the Quick start section and ensure the log contains no error messages.

If you have any problems, please do not hesitate to contact us.

Running Nerpa

Quick start

Sample test data with three antiSMASH-processed BGCs and three NRP structures in the SMILES format is included in the release tarball.
Alternatively, you can download it from here and unpack it in your current working directory.

To run Nerpa on the test data, execute:

nerpa.py -a test_data/antismash --smiles-tsv test_data/smiles.tsv

The output will be saved in the nerpa_results/{CURRENT_TIME} directory and symlinked to nerpa_results/latest for your convenience.
For details on the output directory contents and their interpretation refer to the corresponding section.

Command-line options

To see the full list of available options, type

nerpa.py -h

All options are divided into four categories. The most important options in each category are listed below.

Genomic input (genome sequences or BGCs)

The most convenient way to obtain antiSMASH predictions of BGCs in your genomic data is to upload your
FASTA or GenBank file to their web server.
Once the server job is completed, download the results (Download -> Download all results), unpack the archive,
and provide the path to the unpacked directory using the -a option.

Alternatively, you can provide Nerpa with the server job ID (e.g., bacteria-2a9bb79e-e804-42c9-bb62-516cac47eca2)
via the --antismash-job-ids option, and Nerpa will download everything automatically.
For multiple jobs, specify them as a space-separated list of IDs.

You can also use the command-line version of antiSMASH.
Nerpa has been tested with outputs from antiSMASH version 7 (7.0.0 and 7.1.0).
If antiSMASH is installed on your system, you can provide raw genome sequences in FASTA or GenBank format via the --genome option,
and Nerpa will run antiSMASH automatically.
To enable this, antiSMASH should be available in your system’s PATH variable, or the path to its installation directory should be specified via the --antismash-installation-path option.

Note that you can specify an unlimited number of antiSMASH output files by either:

Using the -a option multiple times.
Specifying a root directory containing many inputs.
Writing paths to all antiSMASH outputs in a single text file and providing it via the --antismash-paths-file option.

Chemical input (compounds)

NRP molecules should be specified in the SMILES format.
You can provide them in one of the following ways:

As a space-separated list of SMILES strings using the --smiles option.
In a multi-column file specified via the --smiles-tsv option.

In the latter case, the default column separator (\t), names of the SMILES column (SMILES) and the column with molecule IDs (row index) could be adjusted using the --sep, --col-smiles, and --col-id options, respectively.

The Nerpa release package comes with a set of NRP databases in the SMILES format:

Compounds from MIBiG 4.0 and Norine, available in data/mibig_norine.tsv.
Our own database of putative NRP structures, pNRPdb, available in data/pnrpdb2rc1_summary.tsv.

Advanced Input

You can reuse preprocessed BGCs and/or chemical structures from a previous Nerpa run.
This is useful, for example, if you want to screen the same BGCs against different NRPs, or vice versa.

The preprocessed files are stored in the Nerpa output directory in:

BGC_variants/ (for BGCs).
NRP_variants/ (for NRPs).

To reuse them, provide the corresponding paths via the --bgc-variants and --nrp-variants options.

Pipeline Options

--output_dir <DIR>, -o <DIR>
Path to the output directory.
If the directory already exists, Nerpa will exit with an error unless --force-output-dir is specified.
If not set, Nerpa will create the directory nerpa_results/{CURRENT_TIME} and symlink it to nerpa_results/latest.
--process-hybrids
Process NRP-polyketide hybrid monomers (requires rBAN to be used). Recommended.
--threads
Number of threads for running Nerpa. Default: 1.
--skip-molecule-drawing
Disable drawing of NRP compounds (they will not appear in the HTML report). Enabling this option speeds up the run and reduces output size.
--fast-matching
Enable the fast C++-based matching (requires pre-compilation; see the Installation section).

Output Files

The key files and directories inside the Nerpa output directory (--output-dir) are:

report.html
An interactive HTML report showing the best Nerpa matches, along with the corresponding annotated BGCs and NRPs.
report.tsv
A tab-separated file containing matched NRP-BGC pairs with their corresponding scores.
BGC_variants/
Directory containing preprocessed antiSMASH outputs. These can be reused for another run via the --bgc-variants option.
NRP_variants/
Directory containing preprocessed compounds. These can be reused for another run via the --nrp-variants option.

Citation

If you use Nerpa in your research, please cite our papers:
Nerpa v.2 is described in Olkhovskii et al, bioRxiv 2024.
Nerpa v.1 is published in Kunyavskaya, Tagirdzhanov et al., Metabolites 2021.

Feedback and bug reports

You can leave your comments and bug reports at https://github.com/gurevichlab/nerpa/issues (recommended way) or sent it via e-mail to alexey.gurevich@helmholtz-hips.de.

Your comments, bug reports, and suggestions are very welcomed. They will help us to improve Nerpa further. In particular, we would love to hear your thought on desired features of the future Nerpa web service.

If you have any troubles running Nerpa, please attach nerpa.log from the output directory.

Name		Name	Last commit message	Last commit date
Latest commit History 945 Commits
configs		configs
data		data
docs/img		docs/img
external_tools/rBAN		external_tools/rBAN
matches_inspection		matches_inspection
notebooks		notebooks
paras		paras
scripts		scripts
src		src
test_data		test_data
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE.txt		LICENSE.txt
README.md		README.md
environment.yml		environment.yml
install.sh		install.sh
nerpa.py		nerpa.py
sandbox.ipynb		sandbox.ipynb
test_nerpa.py		test_nerpa.py
train_nerpa.py		train_nerpa.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Nerpa 2.0 Manual

About Nerpa

Nerpa pipeline

Supported data types

Installation

Prerequisites

Installation from tarball

Verifying your installation

Running Nerpa

Quick start

Command-line options

Genomic input (genome sequences or BGCs)

Chemical input (compounds)

Advanced Input

Pipeline Options

Output Files

Citation

Feedback and bug reports

About

Uh oh!

Releases 3

Packages

Contributors 4

Uh oh!

Languages

License

gurevichlab/nerpa

Folders and files

Latest commit

History

Repository files navigation

Nerpa 2.0 Manual

About Nerpa

Nerpa pipeline

Supported data types

Installation

Prerequisites

Installation from tarball

Verifying your installation

Running Nerpa

Quick start

Command-line options

Genomic input (genome sequences or BGCs)

Chemical input (compounds)

Advanced Input

Pipeline Options

Output Files

Citation

Feedback and bug reports

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 4

Uh oh!

Languages

Packages