+
Skip to content

Myrio is a command-line tool for identifying plant taxonomy from mixed barcode gene reads. It can also estimate whether a sample is already represented or not in reference databases.

Notifications You must be signed in to change notification settings

anesthetice/Myrio

Repository files navigation



Myrio

Myrio is a command-line application designed to identify the taxonomy of plants, and potentially other organisms, from amplified DNA barcode sequences.

It is tailored to handle mixed reads specifically, wherein multiple barcode regions (e.g. matK, rbcL, ITS, trnH-psbA, etc.) are amplified and sequenced together.

In addition to taxonomic identification, Myrio also estimates whether a given sample corresponds to an organism that is already represented in the provided reference databases, or if it may represent a unrecorded lineage. (Still experimental, will most likely not work for an unrecorded species in a known genus)

The name Myrio is inspired by the scientific name of the plant Myriophyllum Spicatum, commonly known as Eurasian Watermilfoil, an aquatic plant found in the Léman.

Table of Contents

Installation

If you are running Linux or Windows on a x86_64 processor, you can download a pre-built binary from the releases page.

To build and install myrio on your system, you'll need rust installed. Then run:

cargo install --locked --git https://github.com/anesthetice/Myrio myrio-cli

Note

You can also compile with CPU-specific optimizations if desired (although I personally haven't noticed much of a difference):

cargo install --locked --config build.rustflags="['-C', 'target-cpu=native']" --git https://github.com/anesthetice/Myrio myrio-cli

Quickstart

After installing (or downloading) myrio, head to the releases page and download the four .myrtree files and Hedera_Helix_Fulvia_matk_rbcL_psbA-trnH_ITS.fastq.zst from the latest release where these are available.

Then, open a terminal, cd into the directory where these files were downloaded and run the following:

# If you downloaded a pre-built binary, call that binary instead of calling `myrio`
myrio run -i Hedera_Helix_Fulvia_matk_rbcL_psbA-trnH_ITS.fastq.zst -t ./ --cache-counts

Now sit back, wait a little while for the k-mer counts to be computed (don't worry, they will be cached, subsequent runs will take substantially less time; note that the .myrtree files will increase a lot in size)

Once the counts have finished being computed, myrio will perform clustering, then finally compute and display the results per tree.

The first tree (for the ITS gene) shows very positive results:

The (bullseye) symbol represents the application's confidence in this clade being correctly identified, confidence scores range from 0.0 to 1.0, but all you really have to keep in mind is that "green or cyan = good". Note that the bullseye symbol appearing without a score next to it means that no other clade remained for the analysis (as these did not contain a sequence in the top 'x' of scores).

Here are the remaining results:

I recommend looking at all the available options for the run subcommand by running myrio run --help, as well as the configuration file, which can be located by running myrio misc get-config-location.

Important

As of version 0.3.0, a confidence score is displayed at the end, indicating how confident the program is that the sample analyzed exists in one of the databases, please note that this feature is still experimental and in need of improvement.

Usage

Creating a database (a "tree")

Each reference database corresponds to a single barcode gene (e.g. one database for matK, another for rbcL, etc.). Databases are generated from a single .fasta file.

FASTA entries must contain a tax={...} annotation. For example:

>BOLD_PROCESS_ID=ZPLPP049-13|tax={p:Tracheophyta, c:Magnoliopsida, o:Rosales, f:Rosaceae, g:Prunus, s:Prunus persica}
ATACCCTACCCCATTCATCTGGAAATCTTGGTTCAAACCCTTCGCTATTGGGTGAAAGACGCCTCTTCTTTGCATTTATTACGACTCTTTCTTCACGAGTATTATAATTGGAATAG...

The parser is flexible, so the following would also pass:

>BOLD_PROCESS_ID=ZPLPP049-13|tax={domain: Eukarya, kingdom: Plantae, phylum: Tracheophyta, class: Magnoliopsida, order: Rosales, family: Rosaceae, genus: Prunus, species: Prunus persica}
ATACCCTACCCCATTCATCTGGAAATCTTGGTTCAAACCCTTCGCTATTGGGTGAAAGACGCCTCTTCTTTGCATTTATTACGACTCTTTCTTCACGAGTATTATAATTGGAATAG...

>BOLD_PROCESS_ID=ZPLPP049-13|tax={g:prunus; species: Prunus persica;}
ATACCCTACCCCATTCATCTGGAAATCTTGGTTCAAACCCTTCGCTATTGGGTGAAAGACGCCTCTTCTTTGCATTTATTACGACTCTTTCTTCACGAGTATTATAATTGGAATAG...

Important

  1. All entries must share the same highest-ranked clade. For example, if the highest rank of the first record is family: Araliaceae, then every other record must also have family: Araliaceae as their highest-ranked clade (note that the highest rank defined is Domain, while the lowest is Species).
  2. No rank gaps are allowed. For instance, if you specify family, you cannot skip genus and go directly to species.

Once your FASTA database is ready, you can convert it to the format used by myrio:

myrio tree new --input BOLD_Plantae_20250831_ITS.fasta --gene "ITS"
# And if we want to pre-compute k-mer counts with `k=18` (highly recommended for performance, comes at the cost of size however)
myrio tree new --input BOLD_Plantae_20250831_ITS.fasta --gene "ITS" -k 18

This will create a file called BOLD_Plantae_20250831_ITS.myrtree.

If errors are encountered, they will be reported and the problematic entries skipped. For example:

Failed to parse taxonomic identity of the record starting on line 356021, Failed to parse string into a list of clade: cannot have rank gaps, expected 6 elements, got 5; string: '>BOLD_PROCESS_ID=MHPAF950-11|tax={p:Tracheophyta, c:Liliopsida, o:Poales, f:Poaceae, s:Poaceae A.guadamuz275}'

Bio-seq parsing error for the record starting on line 393997, Unrecognised character: 'I' (0x49)

See myrio-py/db_gen.py for an example of how to generate a FASTA precursor database (note: it's a marimo notebook)

Pre-built databases are also available on the releases page.

Running

The main entry point for analysis is the run subcommand. At this stage, you should have:

  • One or more .myrtree databases corresponding to the barcode genes you wish to target (e.g., one database each for matK, rbcL, ITS, etc.).
  • A .fastq file containing sequencing reads from your sample. These reads should originate from mixed amplifications of the barcode genes you are targeting.

Usage of the run subcommand:

Usage: myrio run [OPTIONS] --input <input> --trees <trees>...

Options:
  -i, --input <input>
          The `.fastq` file to use as input

          [aliases: -q, --query]

  -t, --trees <trees>...
          The one or more `.myrtree` reference databases, also accepts directories

          [aliases: -r, --refs, --references, --db]

  -k, --k-search <k-search>
          The length of each k-mer (i.e., `k` itself) used for sequence comparison

  -s, --save-clusters <save-clusters>
          Save clusters to the specified path

      --output-csv <output-csv>
          Write results to a `.csv` file (e.g., `--csv .` will write to a timestamped file in the current directory)

          [aliases: --csv, --csv-output]

      --output-txt <output-txt>
          Write results to a `.txt` file (e.g., `--txt .` will write to a timestamped file in the current directory)

          [aliases: --txt, --txt-output]

      --cache-counts
          Flag that decides if newly-computed kmer counts are then cached

  -n, --nb-clusters <nb-clusters>
          The number of clusters to expect, defaults to the number of `.myrtree` files found

  -h, --help
          Print help (see a summary with '-h')

Example runs:

# Input must be a single `.fastq` file.
# `--trees` can be a directory containing multiple `.myrtree` databases.
myrio run --input Berberis_Julianae_matK_rbcL_psbA-trnH_ITS.fastq --trees myrio-db/

# Also, k-mer counts computed (if not already pre-computed) can be cached directly into their respective database file.
myrio run --input Berberis_Julianae_matK_rbcL_psbA-trnH_ITS.fastq --trees myrio-db/ --k-search 19 --cache-counts
# Note that if `--k-search` is not provided, the value is read from the configuration file (`~/.config/myrio/myrio.conf.toml`).

# If you expect more clusters than gene databases, you can set `--nb-clusters`.
myrio run --input Berberis_Julianae_matK_rbcL_psbA-trnH_ITS.fastq --trees myrio-db/BOLD_Plantae_20250831_ITS.myrtree --nb-clusters 4

Features

  • Cross-platform (windows, macOS, and linux are all supported)
  • Single binary, doesn't depend on any external libraries
  • Developed in Rust (free from the hassle of installing/using Python code)
  • Optimized codebase, including but not limited to:
    • Custom sparse vector implementation with efficient operations
    • Parallelism via Rayon
    • Specialized database format able to store pre-computed k-mer counts efficiently
  • Flexible output, results can be exported as .csv or visualized as a tree in .txt format

Acknowledgments

  • Special thanks to the Paoli Lab for hosting this project, as well as for all the encouragement, feedback, and support they provided along the way.
  • Special thanks to GenoRobotics, and especially our team for the 2025 Lemanic Life Sciences Hackathon, which built the proof-of-concept for this application.

About

Myrio is a command-line tool for identifying plant taxonomy from mixed barcode gene reads. It can also estimate whether a sample is already represented or not in reference databases.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载