- About Nerpa
1.1 Nerpa pipeline
1.2 Supported data types - Installation
2.1. Prerequisites
2.2. Installation from tarball
2.3. Verifying your installation - Running Nerpa
3.1. Quick start
3.2. Command-line options
3.3. Output files - Citation
- Feedback and bug reports
Nerpa is a tool for linking biosynthetic gene clusters (BGCs) to known nonribosomal peptide (NRP) structures. You can read more about the Nerpa algorithm and the practical applications of the tool in our papers. Nerpa is currently developed and maintained by Gurevich Lab at the Helmholtz Institute for Pharmaceutical Research Saarland (HIPS) and the Center for Bioinformatics Saar (CBI).
This manual will help you to install and run the tool. Nerpa version 2.0.0 was released on 19.03.2025. The tool is dual-licensed and is available under GPLv3 or Creative Commons BY-NC-SA 4.0, see LICENSE.txt.
The simplified Nerpa pipeline is depicted in the figure below.
Nerpa takes as input an NRP structure database and genome sequences. The pipeline goes as follows:
- Construct tentative NRP synthetase assembly lines along with respective sequences of genome-predicted residues (using antiSMASH).
- Construct representations of the database structures as monomer graphs (using rBAN).
- Build HMMs for genome-predicted NRP synthetase assembly lines as described in the Nerpa 2 paper.
- Extract NRP linearizations from the monomer graphs.
- Score the NRP linearizations against the HMMs all-vs-all manner (using the Viterbi algorithm).
- Create an interactive report with the best matches and detailed alignments.
For genome sequences:
- Recommended: complete antiSMASH output after processing your raw genome sequence (e.g., downloaded from the antiSMASH web server); or antiSMASH job IDs (in this case, Nerpa will download it automatically).
- Also accepted: raw genome sequences in the FASTA and GenBank formats;
in this case, Nerpa will predict NRP BGCs in them with antiSMASH
(should be installed separately and present in
PATH
or provided to Nerpa via--antismash-installation-dir
).
For NRP structures:
- Recommended: isomeric SMILES format; Nerpa distinguishes between L- and D-configurations of amino acids, so the use of the isomeric format leads to more accurate results.
- Also accepted: any other SMILES, i.e., without stereochemistry information.
Note: you can use free online converters to get (isomeric) SMILES from other popular chemical formats such as MDL MOL or InChI, e.g., this one from UNM. Alternatively, there are many command-line convertors, e.g. molconvert, or programming libraries, e.g. RDKit.
-
(Required) Nerpa relies on Java (to run the embedded rBAN), Python v3.10 or higher, and a number of Python dependencies specified in the environment.yml file.
We highly recommend installing Conda to easily set up all dependencies, as demonstrated below. -
(Optional) If you plan to use Nerpa with raw genome sequences (FASTA or GenBank) rather than antiSMASH-processed files, you will also need to install antiSMASH locally.
Alternatively, you can use the antiSMASH web server. -
(Optional) Nerpa is quite fast by default, but we provide an even faster C++ implementation. To use it, you will need a C++20 compiler and CMake v3.10 or higher.
First, download and unpack the release tarball:
wget https://github.com/gurevichlab/nerpa/releases/download/nerpa_2.0.0/nerpa-2.0.0.tar.gz
tar -xzf nerpa-2.0.0.tar.gz
cd nerpa-2.0.0
Next, install all required dependencies. We recommend creating and activating a Conda environment:
conda env create -f environment.yml
conda activate nerpa-env
Finally, run
bash install.sh
This will download PARAS specificity prediction model and compile the C++ part.
We recommend adding the nerpa
directory to PATH
. In this case, you can run Nerpa simply as nerpa.py
from anywhere;
otherwise, you would need to specify path from the current directory to ./nerpa.py
.
All running examples below assume that Nerpa is in PATH
.
To test your installation, first, try to get the list of the Nerpa command-line options:
nerpa.py -h
Then, try any example from the Quick start section and ensure the log contains no error messages.
If you have any problems, please do not hesitate to contact us.
Sample test data with three antiSMASH-processed BGCs and three NRP structures in the SMILES format is included in the release tarball.
Alternatively, you can download it from here and unpack it in your current working directory.
To run Nerpa on the test data, execute:
nerpa.py -a test_data/antismash --smiles-tsv test_data/smiles.tsv
The output will be saved in the nerpa_results/{CURRENT_TIME}
directory and symlinked to nerpa_results/latest
for your convenience.
For details on the output directory contents and their interpretation refer to the corresponding section.
To see the full list of available options, type
nerpa.py -h
All options are divided into four categories. The most important options in each category are listed below.
The most convenient way to obtain antiSMASH predictions of BGCs in your genomic data is to upload your
FASTA or GenBank file to their web server.
Once the server job is completed, download the results (Download -> Download all results
), unpack the archive,
and provide the path to the unpacked directory using the -a
option.
Alternatively, you can provide Nerpa with the server job ID (e.g., bacteria-2a9bb79e-e804-42c9-bb62-516cac47eca2
)
via the --antismash-job-ids
option, and Nerpa will download everything automatically.
For multiple jobs, specify them as a space-separated list of IDs.
You can also use the command-line version of antiSMASH.
Nerpa has been tested with outputs from antiSMASH version 7 (7.0.0 and 7.1.0).
If antiSMASH is installed on your system, you can provide raw genome sequences in FASTA or GenBank format via the --genome
option,
and Nerpa will run antiSMASH automatically.
To enable this, antiSMASH should be available in your system’s PATH
variable, or the path to its installation directory should be specified via the --antismash-installation-path
option.
Note that you can specify an unlimited number of antiSMASH output files by either:
- Using the
-a
option multiple times. - Specifying a root directory containing many inputs.
- Writing paths to all antiSMASH outputs in a single text file and providing it via the
--antismash-paths-file
option.
NRP molecules should be specified in the SMILES format.
You can provide them in one of the following ways:
- As a space-separated list of SMILES strings using the
--smiles
option. - In a multi-column file specified via the
--smiles-tsv
option.
In the latter case, the default column separator (\t
), names of the SMILES column (SMILES
) and the column with molecule IDs
(row index) could be adjusted using the --sep
, --col-smiles
, and --col-id
options, respectively.
The Nerpa release package comes with a set of NRP databases in the SMILES format:
- Compounds from MIBiG 4.0 and Norine, available in data/mibig_norine.tsv.
- Our own database of putative NRP structures, pNRPdb, available in data/pnrpdb2rc1_summary.tsv.
You can reuse preprocessed BGCs and/or chemical structures from a previous Nerpa run.
This is useful, for example, if you want to screen the same BGCs against different NRPs, or vice versa.
The preprocessed files are stored in the Nerpa output directory in:
BGC_variants/
(for BGCs).NRP_variants/
(for NRPs).
To reuse them, provide the corresponding paths via the --bgc-variants
and --nrp-variants
options.
-
--output_dir <DIR>, -o <DIR>
Path to the output directory.
If the directory already exists, Nerpa will exit with an error unless--force-output-dir
is specified.
If not set, Nerpa will create the directorynerpa_results/{CURRENT_TIME}
and symlink it tonerpa_results/latest
. -
--process-hybrids
Process NRP-polyketide hybrid monomers (requires rBAN to be used). Recommended. -
--threads
Number of threads for running Nerpa. Default:1
. -
--skip-molecule-drawing
Disable drawing of NRP compounds (they will not appear in the HTML report). Enabling this option speeds up the run and reduces output size. -
--fast-matching
Enable the fast C++-based matching (requires pre-compilation; see the Installation section).
The key files and directories inside the Nerpa output directory (--output-dir
) are:
-
report.html
An interactive HTML report showing the best Nerpa matches, along with the corresponding annotated BGCs and NRPs. -
report.tsv
A tab-separated file containing matched NRP-BGC pairs with their corresponding scores. -
BGC_variants/
Directory containing preprocessed antiSMASH outputs. These can be reused for another run via the--bgc-variants
option. -
NRP_variants/
Directory containing preprocessed compounds. These can be reused for another run via the--nrp-variants
option.
If you use Nerpa in your research, please cite our papers:
Nerpa v.2 is described in Olkhovskii et al, bioRxiv 2024.
Nerpa v.1 is published in Kunyavskaya, Tagirdzhanov et al., Metabolites 2021.
You can leave your comments and bug reports at https://github.com/gurevichlab/nerpa/issues (recommended way) or sent it via e-mail to alexey.gurevich@helmholtz-hips.de.
Your comments, bug reports, and suggestions are very welcomed. They will help us to improve Nerpa further. In particular, we would love to hear your thought on desired features of the future Nerpa web service.
If you have any troubles running Nerpa, please attach nerpa.log
from the output directory.