Myrio

Myrio is a command-line application designed to identify the taxonomy of plants, and potentially other organisms, from amplified DNA barcode sequences.

It is tailored to handle mixed reads specifically, wherein multiple barcode regions (e.g. matK, rbcL, ITS, trnH-psbA, etc.) are amplified and sequenced together.

In addition to taxonomic identification, Myrio also estimates whether a given sample corresponds to an organism that is already represented in the provided reference databases, or if it may represent a unrecorded lineage. (Still experimental, will most likely not work for an unrecorded species in a known genus)

The name Myrio is inspired by the scientific name of the plant Myriophyllum Spicatum, commonly known as Eurasian Watermilfoil, an aquatic plant found in the Léman.

Installation

If you are running Linux or Windows on a x86_64 processor, you can download a pre-built binary from the releases page.

To build and install myrio on your system, you'll need rust installed. Then run:

cargo install --locked --git https://github.com/anesthetice/Myrio myrio-cli

Note

You can also compile with CPU-specific optimizations if desired (although I personally haven't noticed much of a difference):

cargo install --locked --config build.rustflags="['-C', 'target-cpu=native']" --git https://github.com/anesthetice/Myrio myrio-cli

Quickstart

After installing (or downloading) myrio, head to the releases page and download the four .myrtree files and Hedera_Helix_Fulvia_matk_rbcL_psbA-trnH_ITS.fastq.zst from the latest release where these are available.

Then, open a terminal, cd into the directory where these files were downloaded and run the following:

# If you downloaded a pre-built binary, call that binary instead of calling `myrio`
myrio run -i Hedera_Helix_Fulvia_matk_rbcL_psbA-trnH_ITS.fastq.zst -t ./ --cache-counts

Now sit back, wait a little while for the k-mer counts to be computed (don't worry, they will be cached, subsequent runs will take substantially less time; note that the .myrtree files will increase a lot in size)

Once the counts have finished being computed, myrio will perform clustering, then finally compute and display the results per tree.

The first tree (for the ITS gene) shows very positive results:

The ◎ (bullseye) symbol represents the application's confidence in this clade being correctly identified, confidence scores range from 0.0 to 1.0, but all you really have to keep in mind is that "green or cyan = good". Note that the bullseye symbol appearing without a score next to it means that no other clade remained for the analysis (as these did not contain a sequence in the top 'x' of scores).

Here are the remaining results:

I recommend looking at all the available options for the run subcommand by running myrio run --help, as well as the configuration file, which can be located by running myrio misc get-config-location.

Important

As of version 0.3.0, a confidence score is displayed at the end, indicating how confident the program is that the sample analyzed exists in one of the databases, please note that this feature is still experimental and in need of improvement.

Usage

Creating a database (a "tree")

Each reference database corresponds to a single barcode gene (e.g. one database for matK, another for rbcL, etc.). Databases are generated from a single .fasta file.

FASTA entries must contain a tax={...} annotation. For example:

>BOLD_PROCESS_ID=ZPLPP049-13|tax={p:Tracheophyta, c:Magnoliopsida, o:Rosales, f:Rosaceae, g:Prunus, s:Prunus persica}
ATACCCTACCCCATTCATCTGGAAATCTTGGTTCAAACCCTTCGCTATTGGGTGAAAGACGCCTCTTCTTTGCATTTATTACGACTCTTTCTTCACGAGTATTATAATTGGAATAG...

The parser is flexible, so the following would also pass:

>BOLD_PROCESS_ID=ZPLPP049-13|tax={domain: Eukarya, kingdom: Plantae, phylum: Tracheophyta, class: Magnoliopsida, order: Rosales, family: Rosaceae, genus: Prunus, species: Prunus persica}
ATACCCTACCCCATTCATCTGGAAATCTTGGTTCAAACCCTTCGCTATTGGGTGAAAGACGCCTCTTCTTTGCATTTATTACGACTCTTTCTTCACGAGTATTATAATTGGAATAG...

>BOLD_PROCESS_ID=ZPLPP049-13|tax={g:prunus; species: Prunus persica;}
ATACCCTACCCCATTCATCTGGAAATCTTGGTTCAAACCCTTCGCTATTGGGTGAAAGACGCCTCTTCTTTGCATTTATTACGACTCTTTCTTCACGAGTATTATAATTGGAATAG...

Important

All entries must share the same highest-ranked clade. For example, if the highest rank of the first record is family: Araliaceae, then every other record must also have family: Araliaceae as their highest-ranked clade (note that the highest rank defined is Domain, while the lowest is Species).
No rank gaps are allowed. For instance, if you specify family, you cannot skip genus and go directly to species.

Once your FASTA database is ready, you can convert it to the format used by myrio:

myrio tree new --input BOLD_Plantae_20250831_ITS.fasta --gene "ITS"
# And if we want to pre-compute k-mer counts with `k=18` (highly recommended for performance, comes at the cost of size however)
myrio tree new --input BOLD_Plantae_20250831_ITS.fasta --gene "ITS" -k 18

This will create a file called BOLD_Plantae_20250831_ITS.myrtree.

If errors are encountered, they will be reported and the problematic entries skipped. For example:

Failed to parse taxonomic identity of the record starting on line 356021, Failed to parse string into a list of clade: cannot have rank gaps, expected 6 elements, got 5; string: '>BOLD_PROCESS_ID=MHPAF950-11|tax={p:Tracheophyta, c:Liliopsida, o:Poales, f:Poaceae, s:Poaceae A.guadamuz275}'

Bio-seq parsing error for the record starting on line 393997, Unrecognised character: 'I' (0x49)

See myrio-py/db_gen.py for an example of how to generate a FASTA precursor database (note: it's a marimo notebook)

Pre-built databases are also available on the releases page.

Running

The main entry point for analysis is the run subcommand. At this stage, you should have:

One or more .myrtree databases corresponding to the barcode genes you wish to target (e.g., one database each for matK, rbcL, ITS, etc.).
A .fastq file containing sequencing reads from your sample. These reads should originate from mixed amplifications of the barcode genes you are targeting.

Usage of the run subcommand:

Usage: myrio run [OPTIONS] --input <input> --trees <trees>...

Options:
  -i, --input <input>
          The `.fastq` file to use as input

          [aliases: -q, --query]

  -t, --trees <trees>...
          The one or more `.myrtree` reference databases, also accepts directories

          [aliases: -r, --refs, --references, --db]

  -k, --k-search <k-search>
          The length of each k-mer (i.e., `k` itself) used for sequence comparison

  -s, --save-clusters <save-clusters>
          Save clusters to the specified path

      --output-csv <output-csv>
          Write results to a `.csv` file (e.g., `--csv .` will write to a timestamped file in the current directory)

          [aliases: --csv, --csv-output]

      --output-txt <output-txt>
          Write results to a `.txt` file (e.g., `--txt .` will write to a timestamped file in the current directory)

          [aliases: --txt, --txt-output]

      --cache-counts
          Flag that decides if newly-computed kmer counts are then cached

  -n, --nb-clusters <nb-clusters>
          The number of clusters to expect, defaults to the number of `.myrtree` files found

  -h, --help
          Print help (see a summary with '-h')

Example runs:

# Input must be a single `.fastq` file.
# `--trees` can be a directory containing multiple `.myrtree` databases.
myrio run --input Berberis_Julianae_matK_rbcL_psbA-trnH_ITS.fastq --trees myrio-db/

# Also, k-mer counts computed (if not already pre-computed) can be cached directly into their respective database file.
myrio run --input Berberis_Julianae_matK_rbcL_psbA-trnH_ITS.fastq --trees myrio-db/ --k-search 19 --cache-counts
# Note that if `--k-search` is not provided, the value is read from the configuration file (`~/.config/myrio/myrio.conf.toml`).

# If you expect more clusters than gene databases, you can set `--nb-clusters`.
myrio run --input Berberis_Julianae_matK_rbcL_psbA-trnH_ITS.fastq --trees myrio-db/BOLD_Plantae_20250831_ITS.myrtree --nb-clusters 4

Features

Cross-platform (windows, macOS, and linux are all supported)
Single binary, doesn't depend on any external libraries
Developed in Rust (free from the hassle of installing/using Python code)
Optimized codebase, including but not limited to:
- Custom sparse vector implementation with efficient operations
- Parallelism via Rayon
- Specialized database format able to store pre-computed k-mer counts efficiently
Flexible output, results can be exported as .csv or visualized as a tree in .txt format

Acknowledgments

Special thanks to the Paoli Lab for hosting this project, as well as for all the encouragement, feedback, and support they provided along the way.
Special thanks to GenoRobotics, and especially our team for the 2025 Lemanic Life Sciences Hackathon, which built the proof-of-concept for this application.

Name		Name	Last commit message	Last commit date
Latest commit History 87 Commits
assets		assets
myrio-cli		myrio-cli
myrio-core		myrio-core
myrio-exp		myrio-exp
myrio-proc		myrio-proc
myrio-py		myrio-py
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
justfile		justfile
rustfmt.toml		rustfmt.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Myrio

Table of Contents

Installation

Quickstart

Usage

Creating a database (a "tree")

Running

Features

Acknowledgments

About

Uh oh!

Releases 3

Packages

Languages

anesthetice/Myrio

Folders and files

Latest commit

History

Repository files navigation

Myrio

Table of Contents

Installation

Quickstart

Usage

Creating a database (a "tree")

Running

Features

Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages