Abstract
Accurate identification of somatic small variants in tumors plays a crucial role in cancer diagnosis. Various somatic mutation callers have been developed; however, existing methods face limitations in modeling mapping information of flanking genomic sites (genomic sites adjacent to a somatic site) that influence the state of the somatic site. Additionally, they are unable to analyze inter-site interactions within the context sequence centered around a somatic site or appropriately weigh the effects of flanking genomic sites on the somatic site. To address these limitations, the Transformer model is utilized to develop TransSSVs for detecting somatic small variants. The core functionality of TransSSVs relies on the multi-head attention mechanism, which generates a reliable representation of interactions between a candidate somatic site and its flanking genomic sites within the context sequence. TransSSVs effectively extract mapping features of various genomic sites in the context sequence to enhance prediction accuracy. Benchmarking experiments demonstrate that TransSSVs exhibit robust performance when compared with state-of-the-art methods on well-characterized real and simulated tumor datasets. Furthermore, the contributions of flanking genomic sites to the detection of somatic sites are assessed, and attention weight patterns for positive and negative somatic sites are analyzed.
Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Avoid common mistakes on your manuscript.
1 Introduction
Somatic mutations are genetic alterations that occur in non-germline cells (i.e., cells other than sperm or eggs) and are not inherited from parents. These mutations arise during an individual’s lifetime and play important roles in cancer development and mutational progression [1, 2]. Identifying these mutations is considered a critical step in precision medicine [3]. High-throughput sequencing technologies have improved the accessibility of sequencing data [4]. However, accurately detecting somatic mutations from sequencing data remains challenging due to biological and technological noise, including intra-tumor heterogeneity (the presence of genetically distinct subclones within a tumor), sample contamination, read alignment uncertainties, and base sequencing artifacts [5, 6]. A major challenge is the effective modeling of these noise factors and the application of robust filtering techniques to reduce false-positive predictions [7].
Various computational tools, referred to as somatic callers, are available for identifying somatic mutations, including Mutect2 [8], EBCall [9], VarScan2 [10], SomaticSeq [11], and MutationSeq [12]. These tools are broadly classified into statistical-based and machine learning-based methods. Statistical approaches primarily rely on prior and likelihood probabilities, while machine learning methods depend on feature engineering, which requires domain-specific expertise [3, 13]. Despite their widespread use, both approaches have a major limitation: they focus on individual loci and do not model interactions between candidate somatic sites and adjacent genomic regions. This limitation hinders the detection of sequencing artifacts influenced by genomic context [14, 15]. As a result, distinguishing true somatic mutations from noise remains particularly challenging in regions with low variant allele frequencies (VAFs) (the proportion of reads supporting a variant allele) and high intra-tumor heterogeneity.
Deep learning models have demonstrated potential in overcoming these challenges by processing high-dimensional raw data and learning abstract feature representations [16,17,18,19,20,21,22]. For example, NeuSomatic [23] employs a convolutional neural network (CNN)-based approach to detect somatic mutations by converting the mapping information of the context sequence surrounding a candidate somatic site into a 3D matrix, which serves as input for a deep residual model. Likewise, DeepSSV [24] encodes the raw mapping data of the context sequence into a 2D matrix to differentiate true somatic mutations from background noise. Despite their effectiveness, these CNN-based models are unable to assess inter-site interactions or quantify the impact of flanking genomic sites on candidate somatic sites. This limitation in capturing long-range dependencies and contextual relationships affects their accuracy and reliability, particularly in complex genomic regions.
The Transformer model, widely recognized for its multi-head attention mechanism in natural language processing, has gained attention in bioinformatics due to its ability to capture interactions among genomic sites using attention weights, providing biological interpretability [25,26,27,28,29,30]. However, its application in somatic mutation detection remains in the early stages. Current implementations of Transformers in bioinformatics primarily focus on areas such as single-cell genomics, protein structure prediction, and the functional effects of genomic variations. For instance, research by Artur Szałata et al. demonstrates their effectiveness in analyzing complex single-cell data [30]. Other studies have utilized Transformer models to predict the functional consequences of genetic variants and model gene expression patterns, highlighting their adaptability in various genomic applications [25, 29]. However, few studies have specifically addressed somatic small variant detection using Transformers, indicating an unexplored opportunity in this area.
To address the limitations of existing somatic callers, TransSSVs has been developed as a tool that utilizes the Transformer model for detecting somatic small variants. Unlike conventional methods, TransSSVs effectively captures interactions between candidate somatic sites and their adjacent genomic regions through the multi-head attention mechanism. This innovative approach enhances the representation of the genomic context, leading to improved accuracy in somatic mutation detection. The multi-head attention mechanism allows the model to assess the influence of flanking genomic sites and quantify their impact on the candidate somatic site. Furthermore, the attention weights provide biological interpretability, offering insights into inter-site interactions. Benchmarking experiments conducted on real and simulated tumor datasets demonstrate the strong performance of TransSSVs, confirming its effectiveness in comparison to state-of-the-art methods. Code availability:https://github.com/jingmeng-bioinformatics/TransSSVs.
2 Materials and methods
2.1 Datasets
2.1.1 Real tumors
In this study, four real tumor and matched normal whole genome sequencing datasets were selected to evaluate TransSSVs alongside six somatic callers, as these datasets were well characterized by their providers to identify high-confidence somatic mutation sites. The first dataset consists of paired COLO829 (metastatic melanoma) and COLO829BL (matched lymphoblastoid) cell lines with sequencing depths of 80x, provided by the Translational Genomics Research Institute (TGen) [31]. The second and third datasets include medulloblastoma (MB) and chronic lymphocytic leukemia (CLL) samples from the International Cancer Genome Consortium (ICGC) [7]. The final dataset consists of a primary acute myeloid leukemia (AML) sample with a sequencing depth of 312x and its matched normal sample with a sequencing depth of 121x [32]. High-confidence somatic mutation sites identified by dataset providers were designated as ground-truth positive mutations, while the remaining genomic sites were considered ground-truth negative mutations.
For AML, MB, and CLL samples, the provided BAM files were not used directly due to inconsistencies in the human reference genome versions. Instead, BWA was employed to align reads from the original FASTQ files to the human reference genome hg38/GRCh38 using default parameters [33]. The resulting BAM files were then processed using Picard to mark PCR and optical duplicates, followed by base quality score adjustments and realignment using GATK [34]. Since tumor genomes are typically sequenced at depths between 30x and 50x, BAM files for COLO829 and AML samples were downsampled from 80x to 50x for both tumor and normal samples and from 312x and 121x to 50x for tumor and normal samples, respectively, to better reflect real application scenarios. BAM files for MB and CLL samples were downsampled to 40x for tumor samples and 30x for normal samples.
Somatic mutation detection presents a highly skewed class distribution problem, as non-somatic sites vastly outnumber somatic sites in tumor genomes. Most currently sequenced tumors are not fully characterized, and only a limited number of small-sized somatic mutations have been identified and experimentally validated in specific genome regions rather than across the entire genome, leading to biased evaluations of somatic callers [32]. Additionally, training models with a small number of high-confidence somatic mutations is inadequate, as such models are prone to high false-positive rates and reduced performance. Considering these factors, the highly mutated COLO829 genome was selected as the training set, containing about 35,000 ground-truth somatic sites. This dataset al.lows for the creation of a large, balanced training set that enhances the ability to distinguish somatic sites from non-somatic sites. The other three tumor datasets, MB, CLL, and AML, pose significant challenges for somatic callers due to their low mutation rates, substantial clonal heterogeneity, and the presence of copy number variations. These datasets were chosen for independent validation to assess the generalization capabilities of somatic callers.
2.1.2 Simulated tumors
To comprehensively evaluate the generalization capabilities of somatic mutation callers, three simulated whole-genome sequencing tumor datasets (M1, M2, and M3) with a sequencing depth of 50x were used [35]. These datasets were generated by introducing somatic mutations into the well-characterized pre-tumor or normal NA12878 genome [36]. The simulation process was designed to replicate real-world tumor heterogeneity and mutation burdens, providing a controlled environment for assessing the performance of somatic mutation detection tools.
The simulated tumors M1, M2, and M3 were constructed with different mutation loads to represent various tumor types. M1 and M3 exhibit a higher mutation burden of 10 somatic SNVs per megabase, simulating hypermutated tumors commonly observed in melanoma or colorectal cancer. In contrast, M2 has a lower mutation load of 5 somatic SNVs (single nucleotide variants) per megabase, reflecting tumors with moderate mutation rates, such as those associated with breast or prostate cancers. Additionally, the mutation load for somatic small INDELs (insertions and deletions) in all three datasets was set at 10% of the somatic SNV load, aligning with the typical ratio observed in real tumors. This design enables the evaluation of mutation caller robustness under varying mutation burdens and mutation types.
The simulated tumors also incorporated sub-clonal structures to replicate intra-tumor heterogeneity. M1 and M3 contain four cell sub-clones with expected VAFs of 0.1, 0.2, 0.35, and 0.5. The presence of multiple sub-clones with varying VAFs reflects the complexity of real tumors, where distinct cell populations may harbor unique mutations. In contrast, M2 consists of three sub-clones with VAFs of 0.2, 0.35, and 0.5, representing tumors with a less complex sub-clonal architecture. This variation in sub-clonal composition provides a basis for assessing the ability of mutation callers to detect mutations in heterogeneous tumor environments.
2.2 Overview of TransSSVs
Figure 1 presents an overview of the TransSSVs model, illustrating its workflow for detecting somatic mutations. The process begins with the extraction of candidate somatic sites from a mixed pileup file, which is generated by aligning tumor and normal sequencing data. TransSSVs then encodes the mapping information of the genomic context surrounding each candidate site, constructing an input feature matrix that captures the characteristics of the context sequence. This matrix is subsequently processed by the Transformer-based deep learning model, an attention-driven neural network designed to capture interactions between the candidate somatic site and adjacent genomic sites within the context sequence. By leveraging this approach, the model generates a refined feature representation of the context sequence, which is essential for accurately identifying somatic mutations and improving the precision of mutation detection.
Overview of the TransSSVs model. The workflow begins with the identification of candidate somatic sites from a mixed pileup file, which is generated from BAM files of both tumor and normal samples. An input matrix is then constructed to represent the context sequence surrounding each candidate site, with each column corresponding to either the candidate site itself or its adjacent genomic sites. These columns contain summarized information on three key mapping features: base count, mapping quality, and base quality, derived from reads covering the genomic sites in both tumor and normal samples. The input matrix is subsequently processed by a Transformer-based deep learning model, an attention-driven neural network designed to capture interactions between the candidate somatic site and its neighboring genomic sites within the context sequence. The multi-head attention mechanism plays a crucial role in predicting the probability that a candidate site represents a true somatic mutation, thereby improving the accuracy of somatic mutation detection
TransSSVs consists of four interconnected components: the preparation block, embedding block, encoder block, and optimization block (Fig. 2) [25]. The process begins with an input matrix representing a candidate somatic site, which undergoes transformation in the preparation block through a series of fully connected layers to generate a refined feature matrix. The embedding block then integrates positional embeddings for genomic sites within the context sequence surrounding the candidate somatic site, incorporating a dropout layer to enhance model robustness (embedding position vector in Supplementary Figure S1). This positionally embedded context sequence is subsequently fed into the encoder block, which serves as the core of TransSSVs. Within the encoder block, the multi-head attention mechanism is employed to capture interactions between the candidate somatic site and its neighboring genomic sites. Additionally, feature representation is refined through a combination of dropout, layer normalization, fully connected layers, and activation functions. The output of the encoder block is then processed by the optimization block, which consists of fully connected neural layers designed to predict the somatic probability of a candidate site. This probability indicates the likelihood that a candidate site represents a true somatic mutation rather than sequencing noise or a germline variant. The optimization block utilizes the enriched feature representation to effectively distinguish somatic mutations from background noise, a critical task in regions with low VAFs and high intra-tumor heterogeneity where detection is particularly challenging. By generating a probabilistic assessment, TransSSVs improves detection accuracy and enhances the reliability of somatic mutation identification in cancer genomics. This structured workflow highlights the model’s innovative approach to somatic mutation detection.
The architecture of the TransSSVs model. TransSSVs is an attention-based deep neural network adapted from the encoder component of the Transformer model (a). It consists of four key blocks: the preparation block (b), the embedding block (c), the encoder block (d), and the optimization block (e). The preparation block processes the input matrix to generate a new feature matrix. The embedding block incorporates positional embeddings for genomic sites within the context sequence centered on a candidate somatic site, allowing the model to recognize the relative positions of these sites. The encoder block, which forms the core of TransSSVs, utilizes a multi-head attention mechanism to capture interactions between the candidate somatic site and its surrounding genomic sites. Finally, the optimization block processes the features extracted by the encoder to predict the somatic probability of a candidate site. The dimensions (256, L, 512) represent a batch size of 256, L as the number of genomic sites in the context sequence, and 512 as the number of neurons in the linear layers
2.2.1 Preparation of candidate somatic sites
The mixed pileup file generated by Samtools from tumor and matched normal BAM files is scanned to identify candidate somatic sites, following the approach used in DeepSSV [24]. To ensure the reliability of candidate somatic sites, genomic sites are selected based on the following criteria: (1) the base in the reference genome must be a standard nucleotide; (2) the strand-specific frequency must fall within the range of 10–90%; (3) the length of INDELs in the tumor sample must not exceed 50 bases; (4) the read depth covering both tumor and normal samples must be greater than 10; and (5) the Phred-scaled base quality scores and mapping quality scores of covering reads must exceed 10.
2.2.2 Generation of input matrix of candidate somatic sites
To generate the input feature matrix for each candidate somatic site, an L \times 52 matrix is constructed, centered around the candidate site, where L represents the number of columns. For instance, if the number of flanking genomic sites is set to 7, L equals 15, with each column corresponding to a flanking genomic site on either side of the candidate somatic site within the context sequence. Each column comprises 52 rows, divided into three sections. The first Sect. (4 rows) represents the base counts of four allele types: variant allele in normal, reference allele in normal, variant allele in tumor, and reference allele in tumor among the covering reads. The second Sect. (24 rows) captures six mapping quality metrics (maximum, minimum, average, median, upper quartile, and lower quartile) for these four allele types. Similarly, the third Sect. (24 rows) represents six base quality metrics for the alleles. For normalization, base counts are adjusted by dividing them by the depth of covering reads, and base quality and mapping quality metrics are converted from Phred scores to base-calling error probabilities. The resulting input matrix for each candidate somatic site is denoted as \(M=\left\{ {{M_1},{M_2},{...},{M_L}} \right\}\), where Mi represents the ith column with 52 rows.
2.2.3 Capture of inter-site interactions by a Transformer-based deep learning model
The input feature matrix is processed by a Transformer-based deep learning model, an attention-driven neural network designed to capture inter-site interactions within the context sequence centered around a candidate somatic site [25]. To represent these interactions, the feature vector of each site is initially encoded using a fully connected layer, with mapping information Mi as input:
where f denotes the transformation applied by the fully connected layer. The model utilizes multiple attention heads, represented as h, to extract diverse information from different sites across various representation spaces. Each attention head corresponds to a distinct representation space, focusing on different regions of the context sequence that are aligned with specific genomic sites. This multi-head attention mechanism enhances the model’s ability to generate comprehensive site-based feature representations enriched with contextual semantics. The matrices \(\:{W}_{Q}\),\(\:{\:W}_{k}\), and \(\:{W}_{v}\) are learnable parameter matrices used in the attention mechanism.
The attention weight is computed by considering the feature vectors within the context sequence:
where d represents the dimension of the feature vectors Q (query) and K (key). The attention weight, which determines the weighted representation of each genomic site, quantifies the interactions between the candidate somatic site and its flanking genomic sites within the context sequence. The signal propagated from one site to another in the context sequence is expressed as:
where V represents the value matrix. The signal vectors generated by different attention heads are then concatenated and processed through a fully connected layer, followed by a normalization layer to refine the mapping features [35]. By integrating signals across the context sequence, the final feature vector of the candidate somatic site, \(\:{F}_{final}\), captures its interactions with flanking genomic sites. This feature vector is subsequently used to predict the somatic probability of the candidate site.
2.3 Model training
To mitigate the challenges posed by class imbalance and ensure effective model training, the highly mutated COLO829 genome was selected as the training dataset. This dataset comprises about 35,000 ground truth somatic sites, facilitating the creation of a large and balanced training set to enhance the accurate differentiation between somatic and non-somatic sites. Specifically, the training dataset consists of 34,691 ground truth somatic sites, including 34,298 somatic SNVs and 393 somatic INDELs, along with 85,263 ground truth non-somatic sites, which include 64,430 non-somatic SNVs and 20,833 non-somatic INDELs. Furthermore, the dataset was divided into five test subsets (COLO829-1, COLO829-2, COLO829-3, COLO829-4, and COLO829-5), each containing between 6,811 and 7,025 ground truth somatic sites, comprising somatic SNVs and somatic INDELs, with 6,738 to 6,946 somatic SNVs in each subset.
Due to the imbalance between non-somatic and somatic sites, focal loss was utilized for training the TransSSVs model [37],
Focal loss integrates two essential components: the balance factor (α) and the focusing parameter (γ). The balance factor α modifies the significance of positive and negative somatic sites by assigning greater weights to true positive sites while reducing weights for true negative sites [38]. The focusing parameter γ regulates the rate at which easily identified sites are down-weighted. To effectively address class imbalance during model training, the balance factor α was set to 0.25, and the focusing parameter γ was set to 2.
We used TensorFlow 2 to train the TransSSVs model on a TITAN RTX GPU with 24 GB memory, employing the Adam optimizer with an initial learning rate of 0.001 [39]. The training process was conducted for 30 epochs, after which training performance reached saturation. Training and validation loss were monitored at each epoch to ensure convergence and minimize overfitting. Additionally, we implemented early stopping to halt training if validation loss did not improve over a predefined number of epochs. This strategy ensured effective model training, enabling generalization to unseen data while addressing class imbalance.
2.4 Performance evaluation metrics
For the training set of the COLO829 genome, ground truth somatic sites were divided into training and testing subsets. The training subset was utilized for model fitting and hyperparameter optimization, while the testing subset was designated for model evaluation. Independent validations were conducted using three well-characterized real tumors (MB, CLL, and AML) and simulated tumors (M1, M2, and M3) to assess the generalization abilities of somatic callers. Details on running TransSSVs and six competing methods are provided in the “Parameters of Somatic Callers” section of the Supplementary Material.
We calculated precision and recall for benchmarking somatic callers. Precision represents the fraction of true somatic sites among predicted sites, while recall denotes the fraction of true somatic sites that were correctly identified [40]. The F1 score, a harmonic mean of precision and recall, was also calculated.
where TP is true-positive, TN is true-negative, FP is false-positive and FN is false-negative.
2.5 Extraction of attention weights in TransSSVs
The multi-head attention mechanism is a crucial component of TransSSVs, facilitating the relationship between different genomic sites within the context sequence to capture interactions between a candidate somatic site and its surrounding genomic sites [41]. To analyze the attention mechanism, attention weight matrices were first extracted across multiple attention heads and layers in TransSSVs. Subsequently, the row corresponding to each candidate somatic site was extracted, illustrating the interactions between the candidate somatic site and its flanking genomic sites in the context sequence through attention weights.
3 Results
3.1 Comparison on the COLO829 dataset
The performance of TransSSVs was compared against six state-of-the-art somatic callers, including NeuSomatic, Mutect2, VarScan2, DeepSSV, VarNet [42], and Octopus [43]. To assess the impact of the number of flanking genomic sites around a candidate somatic site on TransSSVs’ performance, multiple models were trained with varying numbers of flanking sites. A 5-fold cross-validation analysis was conducted on the COLO829 dataset for comparison. The training set included 34,691 ground truth somatic sites (34,298 somatic SNVs and 393 somatic INDELs) and 85,263 ground truth non-somatic sites (64,430 non-somatic SNVs and 20,833 non-somatic INDELs). The five test subsets (COLO829-1, COLO829-2, COLO829-3, COLO829-4, and COLO829-5) contained 6,811, 6,946, 6,974, 6,935, and 7,025 ground truth somatic sites, respectively, of which 6,738, 6,870, 6,893, 6,851, and 6,946 were somatic SNVs. Different numbers of attention heads were tested in TransSSVs, and it was determined that four attention heads achieved the best performance across the five test subsets (Supplementary Figure S2).
Table 1 presents the performance of these somatic callers on the COLO829 test set, while performance on each of the five test subsets is provided in Supplementary Tables S1-S6. The results indicate that TransSSVs, with different numbers of flanking genomic sites, achieves higher F1 scores than its six competitors. Five TransSSVs models attain the highest overall F1 scores of 0.9268, 0.9265, 0.9257, 0.9257, and 0.9254, all of which surpass the F1 score of DeepSSV at 0.9238 (Table 1 and Supplementary Table S6). For all somatic detection methods, performance on somatic SNVs is superior to that on somatic INDELs. The F1 scores for somatic SNVs across all somatic callers range from 0.8741 to 0.9355, whereas the F1 scores for somatic INDELs fall between 0.1991 and 0.5612. These differences are expected, as distinguishing noise in somatic INDELs is more challenging, and the number of somatic SNVs in the training set is significantly larger than that of somatic INDELs.
3.2 Generalization abilities of somatic callers on real datasets
Independent validation experiments were conducted on three real tumor genomes (MB, CLL, and AML) to compare the generalization abilities of TransSSVs with six state-of-the-art somatic mutation detection callers. Figure 3, Supplementary Figure S3, and Supplementary Tables S7–S9 present the validation performance of these somatic callers. Compared to the 5-fold cross-validation results on the COLO829 genome, a higher number of false-positive somatic mutations were predicted in the three real tumor genomes (Supplementary Tables S7–S9). The COLO829 genome is highly mutated, whereas the MB, CLL, and AML genomes exhibit lower mutational burdens, with about one mutation per megabase, which may partially account for the lower performance observed in these real tumor genomes.
TransSSVs models with varying numbers of flanking genomic sites around a candidate somatic site either outperformed or performed comparably to all six competitors. The highest AUPRC values for the MB, CLL, and AML genomes were obtained by TransSSVs-07, with values of 0.838, 0.665, and 0.846, respectively, followed by VarNet with 0.834, VarNet with 0.587, and Octopus with 0.721 (Fig. 3 and Supplementary Figure S3). The highest overall F1 scores for the MB, CLL, and AML genomes were achieved by TransSSVs-40 with 0.7770, TransSSVs-07 with 0.6346, and TransSSVs-07 with 0.7974, respectively (Supplementary Tables S7–S9). For the tested numbers of flanking genomic sites around a candidate somatic site, TransSSVs models exhibited improved performance in terms of AUPRC value and overall F1 score as the number of flanking genomic sites increased, up to a threshold of 7, beyond which the performance remained stable or declined (Fig. 3, Supplementary Figure S3, and Supplementary Tables S7–S9).
Performance comparison of TransSSVs and six state-of-the-art somatic callers on three real datasets. (a) Overall performance and performance on somatic SNVs and somatic INDELs for the MB genome. (b) Overall performance and performance on somatic SNVs and somatic INDELs for the CLL genome. (c) Overall performance and performance on somatic SNVs and somatic INDELs for the AML genome
The lower precision values contribute to the reduced performance of the six competing methods. For instance, TransSSVs-07, which achieves an overall F1 score of 0.6346, produces recall values of 0.4795 and 0.5911, along with precision values of 0.3271 and 0.7607 for somatic INDELs and somatic SNVs on the CLL genome, respectively. In contrast, Mutect2 attains recall values of 0.6849 and 0.9200 but has significantly lower precision values of 0.1497 and 0.3026 for somatic INDELs and somatic SNVs on the CLL genome, respectively, leading to an overall F1 score of 0.4322 (Supplementary Table S9). Mutect2 exhibits particularly poor performance in detecting somatic INDELs, as indicated by AUPRC values of 0.1, 0.014, and 0.032 for the MB, CLL, and AML genomes, respectively (Fig. 3).
3.3 Generalization abilities of somatic callers on three simulated datasets
Next, we applied TransSSVs and six competing methods to three simulated tumor genomes, M1, M2, and M3, to conduct independent validation experiments for evaluating generalization capabilities. Figure 4, Supplementary Figure S4, and Supplementary Tables S10-S12 present the benchmarking performance of these somatic callers. The results indicate that performance on simulated datasets is superior to that on real datasets, which may be attributed to the lower noise levels in simulated datasets. Consistent with the findings from the five-fold cross-validation on the COLO829 genome and independent validations on real datasets, the performance of somatic callers in detecting somatic INDELs was lower than that for somatic SNVs across all three simulated datasets. The AUPRC values for somatic SNVs ranged from 0.5743 to 0.8543 for M1, from 0.5841 to 0.9476 for M2, and from 0.5191 to 0.9035 for M3. In contrast, the AUPRC values for somatic INDELs ranged from 0.3111 to 0.6467 for M1, from 0.1319 to 0.6461 for M2, and from 0.1770 to 0.6633 for M3 (Fig. 4 and Supplementary Figure S4).
TransSSVs consistently achieves the highest overall F1 scores across the tested datasets. The best-performing TransSSVs models, which incorporate different numbers of flanking genomic sites around a candidate somatic site, yield F1 scores ranging from 0.6269 to 0.6711 for M1, from 0.7789 to 0.8863 for M2, and from 0.7026 to 0.8191 for M3 (Supplementary Tables S10-S12). In comparison, competing methods demonstrate similar but lower overall F1 scores, which range from 0.2485 to 0.6253 for M1, from 0.5514 to 0.7483 for M2, and from 0.5311 to 0.6676 for M3 (Supplementary Tables S12-S14). TransSSVs models also outperform competing methods in terms of AUPRC values across all three simulated tumor genomes, with DeepSSV and NeuSomatic ranking as the next best-performing tools (Fig. 4 and Supplementary Figure S4). As observed in real dataset evaluations, TransSSVs models with fewer than 10 flanking genomic sites demonstrate superior performance. This trend can be attributed to the fact that models with larger input matrices require more extensive training datasets to reach performance saturation during validation. When the number of flanking genomic sites is increased beyond a certain threshold, the model’s ability to generalize may be affected if the available training data is insufficient to optimize its learning parameters effectively.
Performance comparison of TransSSVs and six state-of-the-art somatic callers on three simulated datasets. (a) Overall performance, as well as performance on somatic SNVs and somatic INDELs for the M1 genome. (b) Overall performance, as well as performance on somatic SNVs and somatic INDELs for the M2 genome. (c) Overall performance, as well as performance on somatic SNVs and somatic INDELs for the M3 genome
3.4 TransSSVs reveals underlying patterns of inter-site interactions in the context sequence
The multi-head attention mechanism in TransSSVs plays a crucial role in analyzing genomic interactions. By evaluating attention weights across four attention heads, insights are gained into how the model interprets the context sequence. The extracted attention weight matrices (Fig. 5 and Supplementary Tables S13-S15) indicate that each attention head captures inter-site interactions from different perspectives. A global view, which incorporates both the candidate somatic site and its flanking genomic sites, is essential for understanding broader genomic patterns. This perspective is particularly significant, as somatic mutation evidence extends beyond the specific site to its surrounding regions. For positive samples with seven flanking genomic sites, the uniform attention weights observed in the first attention head suggest a global focus, indicating its role in capturing comprehensive genomic interactions. In contrast, the last three attention heads assign higher weights to the candidate somatic site, emphasizing its local characteristics.
Analyzing the attention weights for samples with different numbers of flanking genomic sites provides deeper insights into the performance of TransSSVs. The distinct attention patterns observed in samples with five versus ten flanking sites highlight the model’s adaptability to varying genomic contexts. For positive samples with five flanking sites, the first two attention heads primarily focus on local features, while the last two heads emphasize global interactions. In contrast, for samples with seven flanking sites, attention is distributed more evenly across the genomic context, suggesting that TransSSVs dynamically adjusts its attention strategy based on the available genomic information. The lower proportion of attention weights assigned to flanking sites in the global view for samples with fewer flanking sites may contribute to performance variations. This analysis provides valuable insights into how TransSSVs processes genomic data and explains its superior performance when seven flanking genomic sites are considered. The findings indicate that this configuration achieves an optimal balance between local site-specific features and broader genomic context, enhancing somatic mutation detection.
In conclusion, TransSSVs’ capability to capture interactions between candidate somatic sites and their flanking genomic regions represents a notable advancement in somatic mutation detection. The attention weight analysis demonstrates that the model effectively utilizes the multi-head attention mechanism to assess the influence of genomic context, enabling a refined approach to distinguishing true somatic mutations from background noise. This detailed examination of attention weights enhances the understanding of TransSSVs’ decision-making process and underscores its potential to improve the accuracy of somatic variant detection in cancer genomics.
Attention weights of genomic sites, including the candidate somatic site and its flanking genomic sites, in the context sequence in TransSSVs with four attention heads. (a) Attention weights for correctly predicted positive and negative samples with seven flanking genomic sites. (b) Attention weights for correctly predicted positive samples with five flanking genomic sites. (c) Attention weights for correctly predicted positive samples with ten flanking genomic sites
4 Discussion
Accurate identification of somatic mutations in tumor samples plays a crucial role in cancer diagnosis and precision medicine. In this study, we employ the advantages of the Transformer model to develop a tool, TransSSVs, for detecting somatic small variants. The core component of TransSSVs is an attention-based deep neural network that employs the multi-head attention mechanism to capture interactions among genomic sites within the context sequence centered on a candidate somatic site. This mechanism enables the generation of a reliable representation of mapping features across different genomic sites in the context sequence, facilitating accurate somatic mutation detection.
Ground truth data from the COLO829 genome were utilized to train TransSSVs, and its performance was evaluated against six state-of-the-art somatic detection methods through 5-fold cross-validation on the COLO829 genome. Additionally, independent validation experiments were conducted on three real tumor datasets and three simulated tumor datasets with low mutational burdens to assess the generalization capabilities of these somatic callers. Benchmarking results indicate that TransSSVs demonstrates strong and consistent performance in terms of overall F1 score and AUPRC value for detecting somatic small variants.
We also analyzed the contributions of different numbers of flanking genomic sites around a candidate somatic site in the context sequence. By employing the multi-head attention mechanism, TransSSVs effectively attends to different genomic sites in the context sequence, capturing a reliable feature representation for precise somatic mutation identification. Analysis of attention weights revealed that higher proportions of attention weights assigned to flanking genomic sites in a global view contributed to improved performance of TransSSVs. Distinct attention weight patterns were observed between positive and negative samples.
TransSSVs and DeepSSV exhibit distinct levels of performance and behavioral differences in somatic mutation detection. TransSSVs, utilizing the Transformer model, achieves superior performance, particularly in capturing complex genomic interactions through its multi-head attention mechanism. This mechanism enables TransSSVs to dynamically adjust focus based on the genomic context, resulting in a more refined and accurate prediction of somatic mutations. In contrast, DeepSSV, which employs a CNN-based approach, may be less effective in capturing intricate long-range dependencies emphasized by the Transformer model. Consequently, DeepSSV tends to display a more uniform attention distribution, potentially leading to a less precise understanding of genomic interactions. These differences highlight TransSSVs’ advantage in accurately detecting somatic mutations, particularly in regions with complex genomic contexts, while also indicating potential areas for improving DeepSSV’s architecture to enhance detection capabilities.
Although TransSSVs demonstrates strong performance in somatic mutation detection, certain limitations should be considered. The computational demands of the Transformer-based architecture are substantial (Supplementary Table S16). Training on the COLO829 dataset using a TITAN RTX GPU (24 GB memory) required significant time and resources, with each epoch taking several hours to complete. Additionally, inference time and disk usage increase as the number of flanking genomic sites included in the context sequence grows, which may pose challenges for scalability in large datasets or real-time applications. The model also relies on high-quality, well-characterized training data to achieve optimal performance. Differences in sequencing depth, mutation burden, or dataset characteristics could influence its generalizability. Furthermore, while TransSSVs captures interactions within the context sequence, it may not fully account for higher-order genomic interactions or epigenetic factors. Future research will focus on improving computational efficiency, increasing training data diversity, and exploring advanced architectures to address these challenges.
References
Aparicio S, Caldas C (2013) The implications of clonal genome evolution for cancer medicine. N Engl J Med 368:842–851
Greaves M (2012) Maley. Clonal evolution in cancer. Nature 481:306–313
Xu C (2018) A review of somatic single nucleotide variant calling algorithms for next-generation sequencing data. Comput Struct Biotechnol J 16:15–24
J. K. Teer (2014) An improved Understanding of cancer genomics through massively parallel sequencing. Transl Cancer Res. 3:243–259
Zhang S, Fan R, Liu Y, Chen S, Liu Q (2023) Zeng. Applications of transformer-based Language models in bioinformatics: a survey. Bioinform Adv 3:vbad001
Mwenifumbo JC (2013) Marra. Cancer genome-sequencing study design. Nat Rev Genet 14:321–332
Alioto TS, Buchhalter I, Derdak S, Hutter B, Eldridge MD, Hovig E, Heisler LE, Beck TA, Simpson JT, Tonon L, Sertier AS, Patch AM, Jager N, Ginsbach P, Drews R, Paramasivam N, Kabbe R, Chotewutmontri S, Diessl N, Previti C, Schmidt S, Brors B, Feuerbach L, Heinold M, Grobner S, Korshunov A, Tarpey PS, Butler AP, Hinton J, Jones D, Menzies A, Raine K, Shepherd R, Stebbings L, Teague JW, Ribeca P, Giner FC, Beltran S, Raineri E, Dabad M, Heath SC, Gut M, Denroche RE, Harding NJ, Yamaguchi TN, Fujimoto A, Nakagawa H, Quesada V, Valdes-Mas R, Nakken S, Vodak D, Bower L, Lynch AG, Anderson CL, Waddell N, Pearson JV, Grimmond SM, Peto M, Spellman P, He M, Kandoth C, Lee S, Zhang J, Letourneau L, Ma S, Seth S, Torrents D, Xi L, Wheeler DA, Lopez-Otin C, Campo E, Campbell PJ, Boutros PC, Puente XS, Gerhard DS, Pfister SM, J. D., McPherson TJ, Hudson M, Schlesner P, Lichter R, Eils (2015) D. T. Jones, I. G. Gut. A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing, Nat Commun. 6:10001
Cibulskis K, Lawrence MS, Carter SL, Sivachenko A, Jaffe D, Sougnez C, Gabriel S, Meyerson M, Lander ES (2013) Getz. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat Biotechnol 31:213–219
Shiraishi Y, Sato Y, Chiba K, Okuno Y, Nagata Y, Yoshida K, Shiba N, Hayashi Y, Kume H, Homma Y, Sanada M, Ogawa S (2013) Miyano. An empirical bayesian framework for somatic mutation detection from cancer genome sequencing data. Nucleic Acids Res 41:e89
Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, Miller CA, Mardis ER, Ding L (2012) Wilson. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res 22:568–576
Fang LT, Afshar PT, Chhibber A, Mohiyuddin M, Fan Y, Mu JC, Gibeling G, Barr S, Asadi NB, Gerstein MB, Koboldt DC, Wang W, Wong WH (2015) Lam. An ensemble approach to accurately detect somatic mutations using somaticseq. Genome Biol 16:197
Ding J, Bashashati A, Roth A, Oloumi A, Tse K, Zeng T, Haffari G, Hirst M, Marra MA, Condon A, Aparicio S, Shah SP (2012) Feature-based classifiers for somatic mutation detection in tumour-normal paired sequencing data. Bioinformatics 28:167–175
D. J. Wilkinson (2007) Bayesian methods in bioinformatics and computational systems biology. Brief Bioinform. 8:109–116
Frazer Meacham D, Boffelli J, Dhahbi DIK, Martin M, Singer (2011) Lior Pachter. Identification and correction of systematic error in high-throughput sequence data. BMC Bioinformatics 12:1–11
Nathan D, Olson J, Wagner N, Dwarshuis KH, Miga FJ, Sedlazeck M, Salit, Justin M, Zook (2023) Variant calling and benchmarking in an era of complete human genome sequences. Nat Rev Genet 24:464–483
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell 35:1798–1828
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521:436–444
Schmidhuber J (2015) Deep learning in neural networks: an overview. Neural Netw 61:85–117
Jing Meng J, Liu W, Song H, Li J, Wang L, Zhang Y, Peng A, Wu T, Jiang (2024) PREDAC-CNN: predicting antigenic clusters of seasonal influenza A viruses with convolutional neural network. Brief Bioinform 25:bbae033
Mansoor Hayat N, Ahmad A, Nasir (2024) Zeeshan Ahmad Tariq. Hybrid deep learning EfficientNetV2 and vision transformer (EffNetV2-ViT) model for breast Cancer histopathological image classification. IEEE Access 12:184119–184131
Nouman Ahmad R, Strand Björn, Sparresäter S, Tarai E, Lundström (2023) Göran Bergström, Håkan Ahlström, Joel Kullberg. Automatic segmentation of large-scale CT image datasets for detailed body composition analysis. BMC Bioinformatics, 24
Nouman Ahmad S, Asghar, Saira Andleeb Gillani (2021) Transfer learning-assisted multi-resolution breast cancer histopathological images classification. Visual Comput 38:2751–2770
Sahraeian SME, Liu R, Lau B, Podesta K, Mohiyuddin M (2019) Y. K. Lam. Deep convolutional neural networks for accurate somatic mutation detection. Nat Commun 10:1041
Meng J, Victor B, He Z, Liu H, Jiang T (2021) DeepSSV: detecting somatic small variants in paired tumor and normal sequencing data with convolutional neural network. Brief Bioinform, 22
Brandes N, Goldman G, Wang CH, Ye CJ, Ntranos V (2023) Genome-wide prediction of disease variant effects with a deep protein Language model. Nat Genet 55:1512–1522
Chen J, Xu H, Tao W, Chen Z, Zhao Y, Jing-Dong J (2023) Han. Transformer for one stop interpretable cell type annotation. Nat Commun, 14
Chu Y, Zhang Y, Wang Q, Zhang L, Wang X, Wang Y, Salahub DR, Xu Q, Wang J, Jiang X, Xiong Y (2022) Dong-Qing Wei. A transformer-based model to predict peptide–HLA class I binding and optimize mutated peptides for vaccine design. Nat Mach Intell 4:300–311
Zhang S, Liu RFY, Chen S, Liu Q, Zeng W (2023) Alex Bateman. Applications of transformer-based Language models in bioinformatics: a survey. Bioinf Adv, 3
Žiga Avsec V, Agarwal D, Visentin JR, Ledsam A, Grabska-Barwinska KR, Taylor Y, Assael J, Jumper P, Kohli, David R (2021) Kelley. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods 18:1196–1203
Artur Szałata K, Hrovatin H, Cui B, Wang, Fabian J (2024) Theis. Transformers in single-cell omics: a review and new perspectives, Nature Methods. 21:1430–1443
Craig DW, Nasser S, Corbett R, Chan SK, Murray L, Legendre C, Tembe W, Adkins J, Kim N, Wong S, Baker A, Enriquez D, Pond S, Pleasance E, Mungall AJ, Moore RA, McDaniel T, Ma Y, Jones SJ, Marra MA, Carpten JD (2016) Liang. A somatic reference standard for cancer genome sequencing. Sci Rep 6:24607
Griffith M, Miller CA, Griffith OL, Krysiak K, Skidmore ZL, Ramu A, Walker JR, Dang HX, Trani L, Larson DE, Demeter RT, Wendl MC, McMichael JF, Austin RE, Magrini V, McGrath SD, Ly A, Kulkarni S, Cordes MG, Fronick CC, Fulton RS, Maher CA, Ding L, Klco JM, Mardis ER (2015) Ley, R. K. Wilson. Optimizing cancer genome sequencing and analysis. Cell Syst 1:210–223
Li H (2013) Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM., arXiv:1303.3997v2 [q-bio.GN]
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M (2010) DePristo. The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data. Genome Res 20:1297–1303
Meng J, Chen YP (2018) A database of simulated tumor genomes towards accurate detection of somatic small variants in cancer. PLoS ONE 13:e0202982
Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W (2014) Salit. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol 32:246–251
Lin T-Y, Goyal P, Girshick R, He K (2017) Piotr Dollar. Focal Loss for Dense Object Detection. IEEE International Conference on Computer Vision (ICCV). 2999–3007
Liangpeng Nie L, Quan T, Wu R, He Q, Lyu (2022) Pier Luigi Martelli. TransPPMP: predicting pathogenicity of frameshift and non-sense mutations by a transformer based on protein features. Bioinformatics 38:2705–2711
Jimmy Lei Ba Diederik P Kingma. Adam: A method for stochastic optimization, arXiv:1412.6980 2015
Vijayan V, Yiu SM, Zhang L (2017) Improving somatic variant identification through integration of genome and exome data. BMC Genomics 18:748
Vaswani NSA, Parmar N, Uszkoreit J, Jones L, Gomez AN (2017) Łukasz Kaiser, Illia Polosukhin. Attention is all you need, 31st Conference on Neural Information Processing Systems (NIPS 2017)
Kiran Krishnamachari D, Lu A, Swift-Scott A, Yeraliyev K, Lee W, Huang (2022) Sim Ngak Leng, Anders Jacobsen Skanderup. Accurate somatic variant detection using weakly supervised deep learning. Nat Commun, 13
Daniel P, Cooke DC, Wedge (2021) Gerton lunter. A unified haplotype-based method for accurate and comprehensive variant calling. Nat Biotechnol 39:885–892
Acknowledgements
JM and JW conceived the idea, designed and implemented the algorithm, and analyzed the results and drafted the manuscript. JL, WS and ML helped in algorithm implementation. TJ and AW supervised the study and revised the manuscript. All authors read and approved the final manuscript.
Funding
This work was supported by the National Natural Science Foundation of China (32070678, 31671371, 31900472, 92169106 and 9216910042); Natural Science Foundation of Jiangsu Province (BK20220278); the Emergency Key Program of Guangzhou Laboratory, Grant No. EKPG21-12; the National Key Research and Development Program of China (2021YFC2302000 and 2021YFC2301305); Capital’s Funds for Health Improvement and Research (2022-1G-1131); Suzhou Science and Technology Plan Project (szs2020311); the Non-profit Central Research Institute Fund of Chinese Academy of Medical Sciences (2021-PT180-001) and CAMS Innovation Fund for Medical Sciences (CIFMS) (2021-I2M-1-061 and 2022-I2M-2-004) and the NCTIB Fund for R&D Platform for Cell and Gene Therapy.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Competing interests
The authors have declared that no competing interests exist.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Meng, J., Wang, J., Liu, J. et al. TransSSVs: a Transformer-based deep learning model for accurate detection of somatic small variants in paired tumor and normal sequencing data. Appl Intell 55, 874 (2025). https://doi.org/10.1007/s10489-025-06607-x
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-025-06607-x