HKDE-LACM: a hybrid model for lactic acid bacteria classification via k-mer and DNABERT-2 embedding fusion with cyclic DE-BO optimization

Zou, Jie; Liu, Weichi; Dai, Jinhui; Dong, Gaifang

doi:10.1186/s12864-025-12009-7

Research
Open access
Published: 25 September 2025

HKDE-LACM: a hybrid model for lactic acid bacteria classification via k-mer and DNABERT-2 embedding fusion with cyclic DE-BO optimization

BMC Genomics volume 26, Article number: 815 (2025) Cite this article

381 Accesses
Metrics details

Abstract

Background

Lactic acid bacteria (LAB) play vital roles in food production and clinical applications. Accurate classification of LAB strains facilitates their functional development and targeted utilization. Although machine learning and deep learning methods have been widely applied to genome sequence classification, challenges remain in capturing comprehensive feature representations and enhancing model generalizability.

Results

We present HKDE-LACM, a hybrid classification model that integrates high-dimensional k-mer frequency features with contextual embeddings derived from DNABERT-2. To optimize model hyperparameters, we introduce a Cyclic Differential Evolution and Bayesian Optimization with Failure Avoidance (C-DBFA) framework. We conducted 10-fold cross-validation on three LAB datasets and evaluated performance. Experimental results demonstrate that HKDE-LACM outperforms existing methods in terms of both classification accuracy and robustness.

Conclusions

HKDE-LACM overcomes the limitations of traditional k-mer features by incorporating semantic embeddings, thereby enriching the representation of genomic sequences. In addition, the model can automatically identify optimal combinations of feature extractors and classifiers through the C-DBFA optimization framework. These advantages effectively enhance the model’s generalization ability, making it a promising tool for genome-based LAB classification and related tasks.

Peer Review reports

Background

Lactic acid bacteria (LAB) are capable of metabolizing carbohydrates to produce lactic acid [1]. They are typically spherical or rod-shaped, predominantly Gram-positive [2]. LAB can produce organic acids and other metabolites through fermentation, which help prevent food spoilage and improve sensory qualities [3]. In agriculture, their metabolites can promote biodegradation and increase the organic matter content in soil [4]. Probiotics are live microorganisms that confer health benefits when consumed in adequate amounts [5]. Some LAB are used as probiotics, which can enhance immunity, balance the gut microbiota, and prevent constipation and diarrhea [6,7,8]. LAB are classified based on cell morphology, glucose fermentation patterns, growth temperature ranges, and sugar utilization modes [9]. Different categories of LAB exhibit variations in their metabolic products, probiotic capabilities, antimicrobial activities, and abilities to inhibit the growth of pathogenic microorganisms [10, 11]. LAB classification facilitates a deeper understanding of their properties and supports rational application under diverse conditions. However, conventional LAB classification methods—based on physiological and gene-level characteristics—are time-consuming, expensive, and prone to bias arising from operator variability [12].

In recent years, with the continuous advancements in bioinformatics and computational biology, the cost of DNA sequencing has decreased significantly, while the volume of sequences requiring analysis has increased substantially. This has drawn the attention of researchers to the challenge of efficiently analyzing and processing DNA sequence data. Machine learning techniques are capable of efficiently processing large-scale, heterogeneous, and complex datasets. By identifying patterns and correlations among features, they enable the construction of predictive models, making them a powerful tool for addressing this challenge [13]. However, traditional machine learning algorithms typically focus on extracting features from local gene sequences, often failing to fully utilize global contextual information. In contrast, deep learning algorithms effectively address this limitation [14, 15]. Deep learning algorithms can construct complex neural network models to identify local patterns and their interactions within gene sequences, thereby learning contextual information from the sequences [16, 17]. Currently, these algorithms have been applied to various aspects of genomics, including but not limited to predicting the sequence specificity of DNA- and RNA-binding proteins [18], analyzing the impact of mutations on protein-RNA interactions [19], predicting RNA secondary structures [20], and estimating DNA methylation states in single cells [21]. In recent years, the application of fine-tuned deep learning-based large language models for processing and analyzing genomic sequences has gradually expanded [22, 23].

Meanwhile, processing long genomic sequences remains challenging. K-mer-based methods convert sequences into numerical representations, capturing key microbial genomic features while significantly enhancing processing efficiency and information extraction, advancing genomic data analysis [24]. Marcel et al. [25] proposed the use of k-mers to capture and analyze 16S rRNA gene sequence fragments, demonstrating the reliability of this approach. Aaron et al. [26] introduced a novel method that analyzes DNA sequences by integrating the content, position, and related information of k-mers. Currently, methods combining artificial intelligence (AI) techniques with k-mer sequences have been developed to achieve more accurate classification and prediction. For example, Davis et al. [27] employed the Adaboost algorithm to construct a classifier for identifying specific bacterial resistance to particular antibiotics, based on metagenomic data from antibiotic-resistant bacteria. Nguyen et al. [28] used 10-mer nucleotide sequences of Klebsiella pneumoniae as input and applied the XGBoost model to successfully predict the minimum inhibitory concentrations (MICs) for 20 antibiotics with high precision. Shuyi Wang et al. [29] extracted k-mers from whole-genome sequencing data of Staphylococcus aureus and employed three machine learning methods—Random Forest, Support Vector Machine, and XGBoost—to predict the minimum inhibitory concentrations (MICs) of ten antimicrobial agents against this bacterium. These achievements not only demonstrated the effectiveness of k-mer sequences as features in predicting microbial phenotypes but also highlighted their potential in genomic function prediction and sequence classification. Additionally, Jie Ren et al. [30] developed a k-mer-based method capable of successfully distinguishing viral and host genomic sequences in metagenomic data. María et al. [31] drew inspiration from the “bag-of-words” and word2vec vector space models to construct a computational framework that achieved the classification of regulatory regions based on k-mers. Simon et al. [32] utilized k-mer frequency as features, combined with methods such as multilayer perceptron and random forest, and achieved significant results in the lineage/family classification of plant LTR retrotransposons. Harwah et al. [33] employed k-mers, frequency chaos game representation, one-hot encoding, and integer encoding to represent 16S rRNA gene sequences, respectively, and used a CNN architecture to classify and predict bacteria from three major phyla. The results demonstrated that the representation method using k-mers achieved better performance. González et al. [34] combined k-mers with vector embedding methods and successfully distinguished bacteriocin sequences produced by LAB using deep neural networks.

Extending these techniques to the analysis of probiotic-rich LAB enables more efficient capture of key features in the genomes of LAB, significantly improving classification accuracy and computational efficiency. Karlsen et al. [35] found that using 9-mer features with a Random Forest model achieved the highest accuracy in predicting the acidification capacity of Lactococcus lactis, outperforming other genomic representations. Sun et al. [36] utilized a combined feature matrix of k-mers (k=2 to 8) to represent probiotic genomic sequences and employed a Support Vector Machine model to distinguish between probiotic and non-probiotic genomes. Their results indicated that higher-dimensional k-mers (particularly k=7 and k=8) achieved superior performance in clustering, highlighting the importance of combinatorial k-mer patterns in probiotic functionality.

In the field of genomic sequence classification of LAB, existing methods have achieved some progress but still face notable limitations. Although k-mer methods efficiently capture local sequence patterns, they typically neglect sequence context and positional information, resulting in suboptimal feature representation. Additionally, many models exhibit limited generalization capability, requiring retraining when applied to new datasets. Higher-dimensional k-mer features also remain underexplored, primarily due to computational and data volume constraints. To address these limitations, we propose HKDE-LACM, a novel LAB classification framework that integrates high-dimensional k-mer statistics with contextual genomic embeddings. By fusing k-mer frequency features (capturing local patterns) and DNABERT-2 fine-tuned embeddings (modeling sequence context), our approach significantly enhances feature expressiveness for microbial genomes. Furthermore, we propose an automated feature fusion and modeling pipeline. This end-to-end optimized framework systematically explores combinations of feature selection strategies and classifiers, enabling robust cross-dataset generalization without manual intervention. We evaluate our model on three LAB datasets using 10-fold cross-validation. The results demonstrate that HKDE-LACM consistently outperforms existing methods across all datasets, highlighting its robustness and effectiveness.

Result

Overall performance of HKDE-LACM

This study aims to classify LAB based on their genomic sequences. To achieve this, we developed a model named HKDE-LACM and conducted 10-fold cross-validation on three separate datasets. The feature matrix used in the experiments was composed of k-mer frequency features and embedding vectors derived from the genomic sequences, with the embeddings further divided into forward and reverse components. We designed three comparative experiments: using k-mer features alone, using embedding vectors alone, and using a fusion of both.

Tables 1, 2, and 3 present the test set results on Dataset 1, Dataset 2, and Dataset 3, respectively, under different feature construction strategies. The results reported in all three tables are based on a single run, with the random seed fixed at 42 to ensure reproducibility. Since the fused feature models already achieved near-saturated classification accuracy, and the k-mer-based models served only as baselines for evaluating the effectiveness of feature fusion, we did not repeat the experiments to report variance. It can be observed that the integration of k-mer features and bidirectional embeddings significantly enhances model performance. Compared to the study by Sun et al., all evaluation metrics show improvements, validating the effectiveness of the fused feature matrix. Considering dataset characteristics, the proposed HKDE-LACM model achieves higher accuracy in distinguishing probiotics from non-probiotics, remains robust on imbalanced data, and effectively differentiates between probiotic strains.

Table 1 Dataset 1 test results

Full size table

Table 2 Dataset 2 test results

Full size table

Moreover, some model combinations performed well on the training set but showed a noticeable drop on the test set, suggesting overfitting. To avoid bias from a single best-training result, we retained and compared the top five combinations with the highest training accuracy, as shown in Fig. 1C. For instance, the combination of SVM + VarianceThreshold achieved the highest training accuracy across the 8-mer, 9-mer, and 10-mer feature matrices, yet its test accuracy dropped significantly, displaying a typical overfitting pattern. This phenomenon may result from several factors: the standardization process reduced variability among features; the Variance Threshold method retained nearly all features without considering their relevance to the label; and SVMs’ sensitivity to redundant or noisy inputs likely amplified overfitting. In contrast, some models achieved much better performance on the test set than on the training set, reflecting stronger generalization. Therefore, retaining multiple top-performing combinations during training allows us to identify models with better generalization in the final testing phase, rather than relying solely on the one with the highest training accuracy.

Table 3 Dataset 3 test results

Full size table

The data in the Table 4 shows that the optimal feature processing methods and classifiers selected after optimization differ across datasets and k-mer values. These results highlight the importance of the hybrid optimization strategy in model construction. This adaptability enhances generalization, enabling robust and accurate predictions across diverse datasets.

Table 4 Main control key combinations under different datasets and feature representations

Full size table

Analysis of feature representation and fusion strategies

To intuitively illustrate the effectiveness of different feature construction strategies in distinguishing positive and negative samples, t-SNE was used to visualize each feature matrix in two dimensions, as shown in Fig. 1A. To better present the distribution trends across different classes, we selected Dataset 1, which contains the largest number of samples, for visualization. It can be observed that when using only 8-mer frequency features, the samples are relatively mixed in the feature space. After incorporating forward embeddings, positive and negative samples begin to cluster more clearly, significantly improving separability. With the further integration of bidirectional embeddings, the boundary between classes becomes more distinct. Upon combining k-mer frequencies with embedding vectors, the overall sample distribution remains stable. Notably, the combination of 10-mers with bidirectional embeddings exhibits a near-linearly separable structure, indicating enhanced classification capability. Although t-SNE does not explicitly reflect the decision boundaries of classification models, it reveals spatial patterns that provide useful insights into feature effectiveness. Biologically, k-mers capture local sequence motifs, while embeddings encode contextual and global information, making them complementary. In our classification models, the integration of both features led to improved accuracy, further confirming the effectiveness of the fused feature representation in practical applications.

It is worth noting that for Dataset 2, before integrating embedding vectors, the feature matrices constructed using 8-mer, 9-mer, and 10-mer frequencies already achieved approximately 95% accuracy on the test set, indicating strong discriminative power. After incorporating DNABERT-2 embeddings, test accuracy further improved, approaching perfect scores across all main control key configurations. The entire experimental process strictly followed the separation between training and test sets, ensuring that no data leakage occurred. This notable improvement may be partly explained by the relatively small sample size and clear class boundaries of Dataset 2. These findings highlight that under favorable data conditions, the proposed feature fusion strategy can further enhance classification performance by capturing global sequence features.

Since Dataset 3 was constructed by randomly sampling positive samples from Dataset 1, we reused the corresponding embedding vectors from Dataset 1 for feature construction. The training and test sets of Dataset 3 were derived from those of Dataset 1, ensuring that no data leakage occurred. To investigate the impact of k-mer contribution on model performance, we controlled the proportion of retained k-mer features in the feature fusion pipeline by adjusting the number of selected k-mers in the first step. We conducted a visualization analysis on the fused feature matrices of 8-mer and 9-mer under varying numbers of selected k-mers, as shown in Fig. 2C and D. To maintain a balanced feature representation, we did not experiment with feature matrices in which k-mer features constitute a larger proportion (i.e., >5000 k-mers) relative to the embedding features. As shown, performance continues to improve within the current range, without a clear saturation point. These results indicate that increasing the relative weight of k-mer features within reasonable limits benefits classification performance. Although embedding vectors alone did not strongly differentiate probiotic species, they preserved valuable contextual information. Compared to using k-mer features alone (accuracy: 95.35%), the fusion with embeddings improved the accuracy to 99.41%. In practical applications, the relative weight of the two feature types can be flexibly adjusted according to the task requirements, enabling better adaptation to different types of datasets.

Performance of the C-DBFA optimization strategy

As shown in Table 1, even when using only 8-mer frequency features, the overall model performance is comparable to or even surpasses that of Sun et al. Specifically, improvements were observed in Recall (0.9623), Precision (0.9808), F1-score (0.9714), AUC (0.9907), and MCC (0.9522). In Table 2, models using 8-mer, 9-mer, and 10-mer features all outperform the baseline methods. The corresponding ROC curves are shown in Fig. 2B, indicating that the proposed C-DBFA optimization mechanism is effective in identifying optimal combinations and demonstrates strong generalization ability across datasets with varying compositions.

We conducted an ablation study to evaluate the feature optimization strategy on Dataset 2, using only 8-mer frequency features. The results are presented in Table 5. Compared to the first three approaches, the combination of Bayesian Optimization (BO), Differential Evolution (DE), and the failure-region avoidance mechanism achieved the highest accuracy. Although grid search can systematically explore the entire hyperparameter space, its search space is typically discrete, which makes it prone to missing potential optima in the continuous space. Furthermore, grid search often leads to overfitting on the training set, limiting its generalization to the test set. BO, on the other hand, enables efficient exploration in continuous spaces but may suffer from redundant sampling when handling discrete parameters and is prone to getting stuck in local optima. By introducing DE, globally diverse candidate parameters are periodically injected into BO, enhancing its ability to escape local optima. Figure 1B illustrates the alternating process between BO and DE: DE tends to explore more broadly across the feature space, while BO focuses more narrowly. The combination of both improves the search for better solutions. With the further integration of the failure-region avoidance mechanism, the model actively avoids regions of the parameter space previously identified as suboptimal, thereby reducing computational waste and improving the likelihood of discovering optimal configurations. As shown in Fig. 2A, after applying the avoidance mechanism, the model continues to concentrate its search within high-performing main-control-key regions, while expanding coverage within these regions, leading to more efficient parameter space exploration.

Table 5 Comparison of optimization strategies for model performance

Full size table

Discussion

In this study, we proposed a model named HKDE-LACM, designed to distinguish between different types of LAB. The model leverages both local and global features derived from genomic sequences and does not rely on phenotypic data or annotation information, enabling effective classification of various LAB solely based on genomic data. We validated the model using three datasets, and the experimental results demonstrate that HKDE-LACM can accurately differentiate probiotics from non-probiotics and also achieve high accuracy in distinguishing among different probiotic strains. Therefore, the proposed model not only provides a powerful tool for the accurate and rapid identification of probiotics but also shows great potential for extension to a broader range of LAB classification tasks.

The superior performance of HKDE-LACM can be attributed to the following key factors: First, unlike previous studies, the model utilizes the Jellyfish tool to rapidly compute high-dimensional k-mer frequencies from genomic sequences, enabling the exploration of a broader range of informative patterns. Second, in addition to k-mer features, the model incorporates embedding vectors during matrix construction, providing a more comprehensive representation of genomic information. Finally, we designed a feature fusion pipeline based on a cyclic differential evolution–Bayesian optimization strategy with failure-region avoidance, which adopts a two-stage feature processing mechanism. This approach not only improves modeling efficiency but also enhances the model’s adaptability across different LAB datasets.

It is worth noting that this study retained only the top 20% most frequent k-mer features. Although these features account for only approximately 16% of the total chi-squared score across all k-mers, the combined representation with embedding features still achieved excellent classification performance. This indicates that the frequency-based filtering strategy effectively reduced feature dimensionality while preserving informative signals, balancing computational efficiency with predictive power.

Despite the promising results achieved in this study, several limitations remain. First, the datasets primarily consist of probiotic and non-probiotic samples, which may not comprehensively represent the full diversity of LAB. Future work could expand the sample size by including more diverse LAB strains to enhance model generalization. In addition, the interpretability of the model is relatively limited, making it difficult to provide detailed biological explanations for its predictions. Experimental validation could help elucidate the biological basis of model predictions. Integrating genomic and microbiological knowledge, along with bioinformatics tools, may further improve feature representation and model performance. Although the model performed well in classifying LAB strains, the dataset included only probiotic samples, which may limit generalization to non-probiotic strains or broader LAB categories. Expanding the training data to include non-probiotic strains and applying transfer learning are potential strategies to improve classification across diverse LAB types.

Although HKDE-LACM was originally designed for LAB classification, its core methodology is not restricted to a specific taxonomic group. LAB constitute an important subset of microorganisms, and the principle that nucleotide sequences determine functional traits (e.g., metabolic properties) is also applicable to other microbes [37]. Previous studies have shown that k-mer features and deep learning models can effectively support microbial classification tasks [38]. Therefore, we believe that our framework holds promise for broader applications in microbial taxonomy.

Materials and methods

Datasets

In this study, we obtained the dataset collected by Sun et al. [36] from the iProbiotics website (http://bioinfor.imu.edu.cn/iprobiotics/public/download.html). A total of three datasets were used for training and testing: Dataset 1, which includes 239 probiotic samples and 411 non-probiotic samples (probiotic samples labeled as positive); Dataset 2, which includes 57 probiotic Lactobacillus samples and 57 non-probiotic Lactobacillus samples (probiotic Lactobacillus samples labeled as positive); and Dataset 3, which includes 70 probiotic Lactobacillus samples, 30 probiotic Bifidobacterium samples, and 112 other probiotic samples (probiotic Lactobacillus and Bifidobacterium samples labeled as positive). Among them, the samples in Dataset 3 were randomly selected proportionally from the positive samples in Dataset 1(The sample IDs of Dataset 3 are provided in the Supplementary Material 2). We divided the three datasets into training sets and independent test sets at an 8:2 ratio.

Feature extraction and representation description

In this study, we developed a model named HKDE-LACM for classifying LAB based on genomic sequences, as illustrated in Fig. 3. Using Dataset 2 as an example, we first employed the Jellyfish tool [39] to count the occurrences of all k-mers in each genome and constructed an initial feature matrix with a dimension of (4$^{k}$ $\,\times \,$ 114). When the k value is large, the number of features increases exponentially (e.g., when k = 10, the number of features reaches 1,048,576). To address this, we retained only the top 20% most frequent k-mers to reduce noise, avoid overfitting, lower memory consumption, and improve processing speed. To obtain more consistent features and enhance the model’s generalization capability, we converted raw k-mer counts into k-mer frequency values, resulting in the final k-mer feature matrix. This preprocessing strategy not only ensured feature quality but also provided a robust basis for downstream classification.

To capture global features of genomic sequences and compensate for the limitation that k-mers primarily reflect local characteristics, we fine-tuned DNABERT-2 [40] (preprint) to obtain embedding vectors enriched with contextual information. The DNABERT model [41], based on the BERT architecture, is a pre-trained model for DNA sequences that learns contextual representations through masked language modeling, effectively capturing global semantic features. DNABERT-2 further improves upon DNABERT by introducing Byte Pair Encoding (BPE) and Attention with Linear Biases (ALiBi), enhancing its capacity for modeling DNA sequences and improving overall performance. In practice, we observed that as the sequence length increased, the fine-tuning process became significantly slower and more memory-intensive. To address this issue, we adopted a sliding window approach to segment the genome into overlapping fragments of length 800 with a stride of 200, preserving sequence continuity. During the fine-tuning process, we set the learning rate to 2e-5, the batch size to 8, and trained the model for 6 epochs (see Supplementary Material 3 for detailed parameter settings). The output includes 768-dimensional embedding vectors and their corresponding prediction probabilities for each input sequence.

It is worth noting that fine-tuning DNABERT-2 is computationally intensive and requires substantial memory resources. Therefore, during the generation of embedding vectors, we randomly selected approximately 20% of the samples—proportional to the original class distribution—for model fine-tuning. This strategy helps solve the computational limitations mentioned above. It also reduces the risk of overfitting by not using the entire dataset for fine-tuning. In this way, the generalizability of the pretrained model is better preserved.

To enhance the discriminative power of the embedding vectors, we proposed a feature extraction strategy based on high-confidence prediction results. Specifically, for positive samples, embedding vectors with prediction probabilities greater than 0.9 are categorized as positive embedding vectors, while those with probabilities less than 0.1 are categorized as negative embedding vectors. For negative samples, the categorization is reversed. This threshold setting helps reduce the influence of ambiguous or borderline predictions, yielding more discriminative feature subsets for downstream classification. These vectors were separately aggregated using a BiLSTM model [42], and the negative feature matrix was negated to emphasize polarity. This process resulted in two global feature matrices, each with a dimension of (114 $\,\times \,$ 1024). This strategy effectively improved the separability between positive and negative samples in the feature space. Finally, the two global feature matrices were concatenated with the k-mer features, producing a comprehensive feature matrix of dimension (114 $\,\times \,$ (4$^{k}$/5 + 2048)). This matrix integrates both local and global information from genomic sequences, providing a more comprehensive and discriminative representation for the downstream classification task.

Cyclic DE-BO with failure avoidance (C-DBFA)

To enable intelligent feature selection and classification of LAB genomic data, we designed and implemented an automated modeling pipeline. This pipeline systematically combines three feature selection methods—variance thresholding, ANOVA F-value, and principal component analysis (PCA)—with three classifiers: support vector machine (SVM), random forest (RF), and XGBoost. Model performance was optimized through hyperparameter tuning across all combinations (see the detailed List of hyperparameter search ranges in Supplementary Material 4). Experimental results showed that applying unified feature selection directly on the high-dimensional concatenated feature matrix significantly reduced computational efficiency. To balance efficiency and the discriminative power of k-mer features, we first applied a chi-squared test to the k-mer feature set and selected the most label-associated features. These were then concatenated with the embedding vectors and further refined using the feature selection methods within the pipeline. The final fused features were input into the classifier for prediction.

To address the tendency of Bayesian Optimization (BO) to become trapped in local optima, we designed a cyclic hybrid optimization strategy Cyclic Differential Evolution and Bayesian Optimization with Failure Avoidance (C-DBFA). In this approach, Differential Evolution (DE) is first employed for global exploration, and a subset of high-quality solutions is injected into BO to guide its search. If BO fails to improve performance after multiple iterations, the process automatically switches back to DE, and the two optimizers alternate cyclically. To improve efficiency, we incorporated early stopping mechanisms. Specifically, the optimization process is automatically terminated if the performance does not improve over a predefined number of trials.

To further alleviate the problem of redundant sampling in the same region—or even at the exact same point—we propose a failure-aware mechanism for Bayesian Optimization (BO). This mechanism detects local failure regions and injects simulated failure samples into the TPE sampler to guide it away from these areas. Let the set of all historical trials be denoted as $T=\left\{ t_{i} \right\}$. Each parameter vector is represented as $x_{i} \in R^{d}$, with the corresponding objective value denoted as $y_{i}$. The grouping function based on the main control keys is defined as $g\left( t_{i} \right)$. First, all completed trials are grouped according to their main control keys, as shown in Eq. (1). Then, within each group, all trials are normalized, and a local neighborhood is defined based on the Euclidean distance, as described in Eq. (2).

$$\begin{aligned} G_{k} =\left\{ t_{i}:g\left( t_{i} \right) =k \right\} \end{aligned}$$

(1)

$$\begin{aligned} d\left( x_{i},x_{j}\right) =\left\| x_{i}-x_{j} \right\| _{2} \end{aligned}$$

(2)

Let the neighborhood radius be r. The neighborhood set for each trial is defined accordingly.

$$\begin{aligned} N_{i} =\left\{ j\mid d\left( x_{i}, x_{j}\right) \le r,t_{j}\in G_{g(ti)} \right\} \end{aligned}$$

(3)

If the number of samples within a trial’s neighborhood $\left| N_{i} \right|$ is bigger than a predefined threshold M, and the range of objective values (i.e., the difference between the maximum and minimum) is smaller than a predefined threshold $\epsilon$, then the corresponding trial is considered to belong to a failure region. For each detected failure region, the trial with the smallest total Euclidean distance to all other trials in the same region is selected as the center point.

$$\begin{aligned} t_{c} =arg \min _{t_{i}\in G_{k }^{fail} } \sum \limits _{t_{j}\in G_{k }^{fail} }^{}d\left( x_{i},x_{j} \right) \end{aligned}$$

(4)

Around this center point, a fixed number of perturbed and randomly sampled trials are generated and injected into the optimization process, with artificially low objective values assigned to guide the TPE sampler to avoid this region.

We empirically set the neighborhood radius to $r = 0.25$, the performance threshold to $\varepsilon = 0.001$, and injected 10 trials per failure region. These values were selected to balance failure avoidance and computational cost. Specifically, r defines the scope of the failure zone, $\varepsilon$ sets the minimum improvement required to escape it, and the number of injected trials controls how effectively such regions are bypassed.

We retained the top five parameter configurations with the highest performance, ensuring diversity in either the feature selectors or classifiers to prevent candidate models from being overly concentrated in a single local optimum. The standardized feature matrix was fed into each of the five models, and the one achieving the highest accuracy was selected as the final prediction result. Compared with exhaustive grid search, C-DBFA significantly reduces search time by leveraging surrogate models for intelligent exploration of the search space, while maintaining comparable performance.

Performance evaluation

To comprehensively evaluate model performance, we adopted six quantitative metrics: Accuracy, Precision, Recall, Matthews Correlation Coefficient (MCC), F1-score, and Area Under the Curve (AUC), with their mathematical definitions provided below:

$$\begin{aligned} Accuracy=\frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$

(5)

$$\begin{aligned} \text {Re}call=\frac{TP}{TP+FN} \end{aligned}$$

(6)

$$\begin{aligned} Precision=\frac{TP}{TP+FP} \end{aligned}$$

(7)

$$\begin{aligned} M=\frac{ TP\times TN - FP\times FN }{\sqrt{\left( TP+FP \right) \left( TP+FN \right) \left( TN+FP \right) \left( TN+FN \right) }} \end{aligned}$$

(8)

$$\begin{aligned} F1-score=2\times \frac{Precision\times \text {Re}call}{Precision+\text {Re}call} \end{aligned}$$

(9)

Here, TP (True Positives): The count of correctly predicted positive instances. FN (False Negatives): The count of positive instances erroneously predicted as negative. FP (False Positives): The count of negative instances erroneously predicted as positive. TN (True Negatives): The count of correctly predicted negative instances. In both the model optimization phase and the selection of the top five models, accuracy was used as the evaluation metric.

Conclusions

This study presents HKDE-LACM, a classification model designed for accurate LAB identification based solely on genomic sequences. It integrates high-dimensional k-mer features with DNABERT-2 embeddings and applies C-DBFA for automated optimization. Experimental results demonstrate consistent improvements over previous approaches across three datasets, with Dataset 2 showing more than a 6% increase in accuracy and over a 5% gain in AUC. Furthermore, the model underscores the value of combining traditional sequence features with semantic contextual representations. Overall, this work provides a reliable and generalizable tool for genome-based LAB classification, with potential for other LAB-related applications.

Data availability

The three datasets employed in this study can be freely and openly accessed from the iProbiotics website (http://bioinfor.imu.edu.cn/iprobiotics/public/download.html).

References

George F, Daniel C, Thomas M, Singer E, Guilbaud A, Tessier FJ, et al. Occurrence and dynamism of lactic acid bacteria in distinct ecological niches: a multifaceted functional health perspective. Front Microbiol. 2018;9:2899. https://doi.org/10.3389/fmicb.2018.02899.
Article PubMed PubMed Central Google Scholar
Wang Y, Wu J, Lv M, Shao Z, Hungwe M, Wang J, et al. Metabolism characteristics of lactic acid bacteria and the expanding applications in food industry. Front Bioeng Biotechnol. 2021;9:612285. https://doi.org/10.3389/fbioe.2021.612285.
Article PubMed PubMed Central Google Scholar
Ayivi RD, Gyawali R, Krastanov A, Aljaloud SO, Worku M, Tahergorabi R, et al. Lactic acid bacteria: food safety and human health applications. Dairy. 2020;1(3):202–32. https://doi.org/10.3390/dairy1030015.
Article Google Scholar
Raman J, Kim JS, Choi KR, Eun H, Yang D, Ko YJ, et al. Application of lactic acid bacteria (LAB) in sustainable agriculture: advantages and limitations. Int J Mol Sci. 2022. https://doi.org/10.3390/ijms23147784.
Article PubMed PubMed Central Google Scholar
Sulaimany S, Farahmandi K, Mafakheri A. Computational prediction of new therapeutic effects of probiotics. Sci Rep. 2024;14(1):11932. https://doi.org/10.1038/s41598-024-62796-4.
Article CAS PubMed PubMed Central Google Scholar
Tegegne BA, Kebede B. Probiotics, their prophylactic and therapeutic applications in human health development: a review of the literature. Heliyon. 2022;8(6):e09725. https://doi.org/10.1016/j.heliyon.2022.e09725.
Article CAS PubMed PubMed Central Google Scholar
Wang R, Yu YF, Yu WR, Sun SY, Lei YM, Li YX, et al. Roles of probiotics, prebiotics, and postbiotics in B-cell-mediated immune regulation. J Nutr. 2025;155(1):37–51. https://doi.org/10.1016/j.tjnut.2024.11.011.
Article CAS PubMed Google Scholar
Kumari A, Catanzaro R, Marotta F. Clinical importance of lactic acid bacteria: a short review. Acta Biomed. 2011;82(3):177–80.
PubMed Google Scholar
Quinto EJ, Jiménez P, Caro I, Tejero J, Mateo J, Girbés T. Probiotic lactic acid bacteria: a review. Food Nutr Sci. 2014;5(18):1765. https://doi.org/10.4236/fns.2014.518190.
Article CAS Google Scholar
Garg V, Velumani D, Lin YC, Haye A. A comprehensive review of probiotic claim regulations: updates from the Asia-Pacific regions, the United States, and Europe. PharmaNutrition. 2024;30:100423. https://doi.org/10.1016/j.phanu.2024.100423.
Article CAS Google Scholar
Mokoena MP. Lactic acid bacteria and their bacteriocins: classification, biosynthesis and applications against uropathogens: a mini-review. Molecules. 2017. https://doi.org/10.3390/molecules22081255.
Article PubMed PubMed Central Google Scholar
Ben Amor K, Vaughan EE, de Vos WM. Advanced molecular tools for the identification of lactic acid bacteria. J Nutr. 2007;137(3 Suppl 2):741s–7s. https://doi.org/10.1093/jn/137.3.741S.
Article CAS PubMed Google Scholar
Camacho DM, Collins KM, Powers RK, Costello JC, Collins JJ. Next-generation machine learning for biological networks. Cell. 2018;173(7):1581–92. https://doi.org/10.1016/j.cell.2018.05.015.
Article CAS PubMed Google Scholar
Rafi AM, Nogina D, Penzar D, Lee D, Lee D, Kim N, et al. A community effort to optimize sequence-based deep learning models of gene regulation. Nat Biotechnol. 2024. https://doi.org/10.1038/s41587-024-02414-w.
Article PubMed PubMed Central Google Scholar
Sanabria M, Hirsch J, Joubert PM, Poetsch AR. DNA language model GROVER learns sequence context in the human genome. Nat Mach Intell. 2024;6(8):911–23. https://doi.org/10.1038/s42256-024-00872-0.
Article Google Scholar
Xi X, Li J, Jia J, Meng Q, Li C, Wang X, et al. A mechanism-informed deep neural network enables prioritization of regulators that drive cell state transitions. Nat Commun. 2025;16(1):1284. https://doi.org/10.1038/s41467-025-56475-9.
Article CAS PubMed PubMed Central Google Scholar
Dautle M, Zhang S, Chen Y. Sctiger: a deep-learning method for inferring gene regulatory networks from case versus control scrna-seq datasets. Int J Mol Sci. 2023. https://doi.org/10.3390/ijms241713339.
Article PubMed PubMed Central Google Scholar
Alipanahi B, Delong A, Weirauch MT, Frey BJ. Predicting the sequence specificities of DNA- and RNA-binding proteins by deep learning. Nat Biotechnol. 2015;33(8):831–8. https://doi.org/10.1038/nbt.3300.
Article CAS PubMed Google Scholar
Liu H, Jian Y, Zeng C, Zhao Y. RNA-protein interaction prediction using network-guided deep learning. Commun Biol. 2025;8(1):247. https://doi.org/10.1038/s42003-025-07694-9.
Article CAS PubMed PubMed Central Google Scholar
Singh J, Hanson J, Paliwal K, Zhou Y. RNA secondary structure prediction using an ensemble of two-dimensional deep neural networks and transfer learning. Nat Commun. 2019;10(1):5407. https://doi.org/10.1038/s41467-019-13395-9.
Article CAS PubMed PubMed Central Google Scholar
Angermueller C, Lee HJ, Reik W, Stegle O. DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol. 2017;18(1):67. https://doi.org/10.1186/s13059-017-1189-z.
Article CAS PubMed PubMed Central Google Scholar
Hwang Y, Cornman AL, Kellogg EH, Ovchinnikov S, Girguis PR. Genomic language model predicts protein co-regulation and function. Nat Commun. 2024;15(1):2880. https://doi.org/10.1038/s41467-024-46947-9.
Article CAS PubMed PubMed Central Google Scholar
Dalla-Torre H, Gonzalez L, Mendoza-Revilla J, Lopez Carranza N, Grzywaczewski AH, Oteri F, et al. Nucleotide transformer: building and evaluating robust foundation models for human genomics. Nat Methods. 2025;22(2):287–97. https://doi.org/10.1038/s41592-024-02523-z.
Article CAS PubMed Google Scholar
Moeckel C, Mareboina M, Konnaris MA, Chan CSY, Mouratidis I, Montgomery A, et al. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J. 2024;23:2289–303. https://doi.org/10.1016/j.csbj.2024.05.025.
Article CAS PubMed PubMed Central Google Scholar
Martínez-Porchas M, Vargas-Albores F. An efficient strategy using k-mers to analyse 16S rRNA sequences. Heliyon. 2017;3(7):e00370. https://doi.org/10.1016/j.heliyon.2017.e00370.
Article PubMed PubMed Central Google Scholar
Sievers A, Bosiek K, Bisch M, Dreessen C, Riedel J, Froß P, et al. K-mer content, correlation, and position analysis of genome DNA sequences for the identification of function and evolutionary features. Genes. 2017;8(4):122. https://doi.org/10.3390/genes8040122.
Article CAS PubMed PubMed Central Google Scholar
Davis JJ, Boisvert S, Brettin T, Kenyon RW, Mao C, Olson R, et al. Antimicrobial resistance prediction in PATRIC and RAST. Sci Rep. 2016;6:27930. https://doi.org/10.1038/srep27930.
Article CAS PubMed PubMed Central Google Scholar
Nguyen M, Brettin T, Long SW, Musser JM, Olsen RJ, Olson R, et al. Developing an in silico minimum inhibitory concentration panel test for Klebsiella pneumoniae. Sci Rep. 2018;8(1):421. https://doi.org/10.1038/s41598-017-18972-w.
Article CAS PubMed PubMed Central Google Scholar
Wang S, Zhao C, Yin Y, Chen F, Chen H, Wang H. A practical approach for predicting antimicrobial phenotype resistance in Staphylococcus aureus through machine learning analysis of genome data. Front Microbiol. 2022;13:841289. https://doi.org/10.3389/fmicb.2022.841289.
Article PubMed PubMed Central Google Scholar
Ren J, Ahlgren NA, Lu YY, Fuhrman JA, Sun F. Virfinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome. 2017;5(1):69. https://doi.org/10.1186/s40168-017-0283-5.
Article PubMed PubMed Central Google Scholar
Mejía-Guerra MK, Buckler ES. A k-mer grammar analysis to uncover maize regulatory architecture. BMC Plant Biol. 2019;19(1):103. https://doi.org/10.1186/s12870-019-1693-2.
Article PubMed PubMed Central Google Scholar
Orozco-Arias S, Candamil-Cortés MS, Jaimes PA, Piña JS, Tabares-Soto R, Guyot R, et al. K-mer-based machine learning method to classify LTR-retrotransposons in plant genomes. PeerJ. 2021;9: e11456. https://doi.org/10.7717/peerj.11456.
Article CAS PubMed PubMed Central Google Scholar
Helaly MA, Rady S, Aref MM. Deep learning for taxonomic classification of biological bacterial sequences. In: Machine learning and big data analytics paradigms: analysis, applications and challenges. 2021. pp. 393–413. https://doi.org/10.1007/978-3-030-59338-4_20.
González LL, Arias-Serrano I, Villalba-Meneses F, Navas-Boada P, Cruz-Varela J. Deep learning neural network development for the classification of bacteriocin sequences produced by lactic acid bacteria. F1000Research. 2024;13:981. https://doi.org/10.12688/f1000research.154432.2.
Karlsen ST, Vesth TC, Oregaard G, Poulsen VK, Lund O, Henderson G, et al. Machine learning predicts and provides insights into milk acidification rates of Lactococcus lactis. PLoS ONE. 2021;16(3):e0246287. https://doi.org/10.1371/journal.pone.0246287.
Article CAS PubMed PubMed Central Google Scholar
Sun Y, Li H, Zheng L, Li J, Hong Y, Liang P, et al. Iprobiotics: a machine learning platform for rapid identification of probiotic properties from whole-genome primary sequences. Brief Bioinform. 2022. https://doi.org/10.1093/bib/bbab477.
Article PubMed PubMed Central Google Scholar
Li Z, Selim A, Kuehn S. Statistical prediction of microbial metabolic traits from genomes. PLoS Comput Biol. 2023;19(12):e1011705. https://doi.org/10.1371/journal.pcbi.1011705.
Article PubMed PubMed Central Google Scholar
Liang Q, Bible PW, Liu Y, Zou B, Wei L. Deepmicrobes: taxonomic classification for metagenomics with deep learning. NAR Genom Bioinform. 2020;2(1): lqaa009. https://doi.org/10.1093/nargab/lqaa009.
Article CAS PubMed PubMed Central Google Scholar
Marçais G, Kingsford C. A fast, lock-free approach for efficient parallel counting of occurrences of k-mers. Bioinformatics. 2011;27(6):764–70. https://doi.org/10.1093/bioinformatics/btr011.
Article CAS PubMed PubMed Central Google Scholar
Zhou Z, Ji Y, Li W, Dutta P, Davuluri R, Liu H. DNABERT-2: efficient foundation model and benchmark for multi-species genome. 2023. arXiv:2306.1500.
Ji Y, Zhou Z, Liu H, Davuluri RV. DNABERT: pre-trained bidirectional encoder representations from transformers model for DNA-language in genome. Bioinformatics. 2021;37(15):2112–20. https://doi.org/10.1093/bioinformatics/btab083.
Article CAS PubMed PubMed Central Google Scholar
Graves A. Long short-term memory. In: Supervised sequence labelling with recurrent neural networks, vol. 385. 2012. pp. 37–45. https://doi.org/10.1007/978-3-642-24797-2_4.

Download references

Funding

This research was funded by the Inner Mongolia Natural Science Foundation Project (No.2025MS06027)

and the 2022 Basic Scientific Research Business Fee Project of Universities Directly under the Inner Mongolia Autonomous Region–Interdisciplinary Research Fund of Inner Mongolia Agricultural University (No.BR22-14-01).

Author information

Authors and Affiliations

College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, 010018, Inner Mongolia, China
Jie Zou, Jinhui Dai & Gaifang Dong
College of Food Science and Engineering, Inner Mongolia Agricultural University, Hohhot, 010018, Inner Mongolia, China
Weichi Liu
Inner Mongolia Autonomous Region Key Laboratory of Big Data Research and Application of Agriculture and Animal Husbandry, Hohhot, Inner Mongolia, China
Gaifang Dong

Authors

Jie Zou
View author publications
Search author on:PubMed Google Scholar
Weichi Liu
View author publications
Search author on:PubMed Google Scholar
Jinhui Dai
View author publications
Search author on:PubMed Google Scholar
Gaifang Dong
View author publications
Search author on:PubMed Google Scholar

Contributions

J.Z. designed and conducted the experiments, performed the data analyses, and drafted the manuscript. W.L. provided computational resources and contributed to the visualization and presentation of experimental results. J.D. supported the refinement of the experimental design and provided constructive feedback on the manuscript. G.D. supervised the study, provided experimental guidance, and reviewed the manuscript.

Corresponding author

Correspondence to Gaifang Dong.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1. Complete and precise training and test results for Dataset 2

Additional file 2. Sample IDs and class labels of Dataset 3

Additional file 3. Main parameters for fine-tuning DNABERT-2

Additional file 4. Hyperparameter search space during optimization

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zou, J., Liu, W., Dai, J. et al. HKDE-LACM: a hybrid model for lactic acid bacteria classification via k-mer and DNABERT-2 embedding fusion with cyclic DE-BO optimization. BMC Genomics 26, 815 (2025). https://doi.org/10.1186/s12864-025-12009-7

Download citation

Received: 13 July 2025
Accepted: 14 August 2025
Published: 25 September 2025
Version of record: 25 September 2025
DOI: https://doi.org/10.1186/s12864-025-12009-7

HKDE-LACM: a hybrid model for lactic acid bacteria classification via k-mer and DNABERT-2 embedding fusion with cyclic DE-BO optimization