sigsim

sigsim simplifies simulation of mutational signature catalogues.

sigsim is part of the sigverse

sigsim is in early development and not yet ready for use.

Philosophy

Simulating mutational catalogues is crucial for evaluating the performance of signature analysis tools. To assess how reliably signatures can be detected, we use simulated catalogues with known compositions.

To generate simulated datasets, we first parameterise a multinomial distribution based on a single mutational signature OR some combination of multiple mutational signatures. We then randomly sample this multinomial distribution ‘n’ times (where n = number of mutations we want to simulate for a sample). It is the randomness of the sampling which adds noise to profiles.

When simulating a catalogue generated from multiple mutational signatures, sigsim offers three different approaches, mixture distribution (generally recommended), stratified sampling, and noiseless reconstruction.

For example, to simulate catalogue of 400 mutations where 30% are from signature 1 and 70% are from signature 2:

Mixture distribution sampling (sig_simulate_mixed):

First creates a combined signature using a weighted average (30% Signature 1, 70% Signature 2).
Then simulates mutations by drawing 400 samples from a multinomial distribution defined by this combined signature.
This adds realistic randomness - you don’t get exactly 30%/70% in every simulation, just on average.

Stratified Sampling (sig_simulate_stratified):

Independently sample 120 mutations (30% x 400) from signature 1 and 280 (70% x 400) from signature 2, then combine them together.
Gives us exact knowledge over the balance of signatures in the final sample (explicitly enforced to be 120 & 280).
However, imagine we simulate a catalogue with total of 10 mutations, 96% from signature 1 and 4% from signature 2, the result would never include any mutations from signature 2! The mixture distribution sampling avoids this this problem.

Perfect (noiseless) catalogues (sig_simulate_perfect):

Creates a deterministic catalogue from a signature model (e.g., 30% SBS2 and 70% SBS13), without any stochastic noise.
This is useful for generating ground truth profiles or for validating how well tools can detect signatures when no noise is present.
By default, counts are rounded to whole numbers to reflect biological realism (mutations are discrete).
However, rounding can distort the expected proportions, especially with low mutation counts. For example, simulating 1 mutation will rarely produce a catalogue with good cosine similarity to the underlying signature profile.
If exact proportions are needed (e.g., for cosine similarity of 1 with the model), set round = FALSE.

Installation

You can install the development version of sigsim from GitHub with:

# install.packages("devtools")
devtools::install_github("selkamand/sigsim")

Quick Start

Which Function Should I Use?

Goal	Use This Function
Simulate from 1 signature	`sig_simulate_single()`
Simulate from multiple (stochastic)	`sig_simulate_mixed()`
Simulate from multiple (fixed proportions)	`sig_simulate_stratified()`
Create noiseless catalogue (based on model)	`sig_simulate_perfect()`
Create noiseless catalogue (from one signature)	`sigstats::sig_reconstruct()`
Bootstrap/resample observed catalogue	`sig_bootstrap_catalogue()`

library(sigsim) # Signature simulation
library(sigstash) # Signature collections
library(sigvis) # Visualisation of signatures
library(sigstats) # Signature Statistics

# Load signature collections from sigstash
signatures = sig_load('COSMIC_v3.4_SBS_GRCh38')


# Randomly simulate 5 samples, each with 400 mutations, by multinomial random sampling of SBS1
catalogues <- sig_simulate_single(
  signatures[["SBS1"]], 
  n = 400, 
  n_catalogues = 5
)

# Randomly simulate 5 samples, each with 400 mutations, by multinomial random sampling of an underlying reconstructed signature containing 30% SBS1 and 70% SBS3
catalogues_from_model <- sig_simulate_mixed(
  signatures, 
  model = c('SBS2' = 0.3, 'SBS13' = 0.7), 
  n = 400, 
  n_catalogues = 5, 
  seed = 42
)

# Create a noiseless catalogue (i.e., exact contribution: 30% SBS2, 70% SBS13, total 400 mutations)
perfect_catalogue <- sig_simulate_perfect(
  signatures,
  model = c('SBS2' = 0.3, 'SBS13' = 0.7),
  n = 400
)

# Compare a randomly simulated catalogue to its corresponding theoretical (perfect) model
sig_visualise_compare(catalogue1 = perfect_catalogue, catalogue2 = catalogues_from_model[[1]], "count", names = c("Simulated", "Perfect"), title = "Simulated (coloured) vs theoretically perfect (black outline)", subtitle = "400 mutations: 30% SBS2 and 70% SBS13")
#> ✔ All channels matched perfectly to set [sbs_96]. Using this set for sort order
#> ✔ All types matched perfectly to set [sbs_type]. Using this set for sort order
#> ✔ Types matched perfectly to palette [snv_type]

Bootstrapping Existing Catalogues

Sometimes instead of simulating from known signatures, you want to resample from an observed catalogue to explore uncertainty or assess robustness.

Use sig_bootstrap_catalogue() to generate bootstrap replicates:

library(sigshared)
#> 
#> Attaching package: 'sigshared'
#> The following object is masked from 'package:sigvis':
#> 
#>     example_umap
library(sigsim)

# Load an example observed catalogue
catalogue <- example_catalogue()

# Generate 5 bootstrap replicates
bootstraps <- sig_bootstrap_catalogue(catalogue, n_catalogues = 5)

# View one bootstrap replicate
bootstraps[[1]]
#>   channel type count  fraction
#> 1 A[T>C]G  T>C     4 0.1481481
#> 2 A[T>C]C  T>C     8 0.2962963
#> 3 A[T>C]T  T>C    15 0.5555556

Simulating Datasets for Sensitivity, Inter-Signature Interference, and Overfitting Assessment

The datasets simulated above are well-suited for evaluating sensitivity and inter-signature interference. However, if your focus is on assessing overfitting, you may need more control over the level and type of ‘noise’ added to the model.

Simulating Overfitting with Dropped Signatures

One strategy to test for overfitting is to simulate mutation catalogues using a combination of known signatures, but then drop one signature when performing the fitting. If the fitting algorithm attempts to explain the missing signature by over-relying on other available signatures, it indicates overfitting.

This method allows you to test how well the algorithm can handle ‘unknown’ components - i.e., real mutational processes that aren’t represented in your signature collection.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
R		R
man		man
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
cran-comments.md		cran-comments.md
sigsim.Rproj		sigsim.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

sigsim

Philosophy

Installation

Quick Start

Which Function Should I Use?

Bootstrapping Existing Catalogues

Simulating Datasets for Sensitivity, Inter-Signature Interference, and Overfitting Assessment

Simulating Overfitting with Dropped Signatures

About

Licenses found

Uh oh!

Releases

Packages

Languages

License

Licenses found

selkamand/sigsim

Folders and files

Latest commit

History

Repository files navigation

sigsim

Philosophy

Installation

Quick Start

Which Function Should I Use?

Bootstrapping Existing Catalogues

Simulating Datasets for Sensitivity, Inter-Signature Interference, and Overfitting Assessment

Simulating Overfitting with Dropped Signatures

About

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages