Introduction

The last several decades have seen a sharp increase in the creation of biopharmaceuticals, especially protein products, although these advances have been accompanied by difficulties in formulation1,2,3. Parenteral administration is the primary method of introducing biopharmaceuticals into the human body4. Biomolecules, however, are often less stable in liquid form than in solid form. It has been demonstrated that keeping biopharmaceuticals in solid form can help preserve molecular stability and increase product shelf life5.

Transporting goods is also made easier by the elimination of water. In solid goods, disaccharide excipients are typically used to dry the biomolecules in order to preserve their structure and guarantee stability. Either the vitrification hypothesis or the water replacement theory is frequently used to describe the stabilizing processes of excipients6.

Excipients significantly influence biomolecule stability the type of excipient used7. For the solid biopharmaceutical development process, a variety of methods with varying drying efficiencies are suggested and developed. Understanding biopharmaceuticals and the matrix in solids completely can be difficult because most characterization methods were created for liquids8. Dried biopharmaceuticals typically require a more involved development process than solution products9.

Biopharmaceuticals in the solid state have long been developed through the process of lyophilization, also known as freeze drying10,11,12. Compared to other drying techniques like spray drying, it is a more developed method. Lyophilization, however, is a labor-intensive batch process that consumes a lot of energy. A key goal in the development of biopharmaceuticals is to preserve stability of the formulation over a long period of time. But different stresses brought on by the drying process can encourage aggregation, resulting in visible and sub-visible particles in the finished goods8,13.

The application of machine learning (ML) has proven to be a great tool for simulating and predicting complex parameters. Ensemble models, particularly those built on decision trees, have become highly favored to accurately capture intricate relationships between inputs and target variables14,15. In this study, we look at the complex task of predicting the concentration levels of one particular chemical, denoted as C (mol/m3), using coordinates X(m), Y(m), and Z(m). With a large dataset of over 46,000 data points, we want to use advanced modeling techniques to accurately estimate C at any given point within the defined spatial domain.

The models used in this study include Decision Tree (DT), Ridge Regression (RR), and Support Vector Regression (SVR), each with their own advantages in modeling non-linear relationships between coordinates and chemical concentrations. Decision Tree models divide data into hierarchical decision nodes, allowing for easy interpretation of complex relationships. Ridge Regression, on the other hand, uses regularization to prevent overfitting, which is especially useful in high-dimensional data sets. Support Vector Regression, an improved version of the Support Vector Machine algorithm, is effective at capturing nonlinear relationships by projecting data points into a higher-dimensional space.

To enhance the accuracy of these models, hyper-parameter optimization is conducted using the innovative Dragonfly Algorithm (DA). This optimization technique aims to adjust the parameters for each model, thereby improving their predictive performance.

This study presents a novel approach to hyperparameter optimization in the context of lyophilization modeling by combining the DA with a generalizability-specific objective function: the mean 5-fold R² score. This methodology is employed for the first time to improve the generalizability of machine learning models used for predicting chemical concentrations in pharmaceutical drying processes. The emphasis on model generalization, alongside the use of DA for hyperparameter tuning, marks a significant advancement over previous work, where such an integrated approach has not been explored. Moreover, this work demonstrates that the DA-enhanced SVR model not only outperforms traditional machine learning models but also exhibits superior generalization compared to earlier mechanistic models applied to similar contexts.

An extensive investigation of the use of DT, RR, and SVR models to chemical concentration prediction is presented in this work. The following parts will explore the approach, outcomes, and debates on how well these models estimate C, illuminating the complex relationship between chemical distribution and spatial dimensions.

Materials and methods

Dataset

The dataset comprises over 46,000 data points with coordinates of X(m), Y(m), and Z(m) as inputs and corresponding concentrations (C) as the target output. Preprocessing involved outlier removal (using Isolation Forest), normalizing features (using Min-Max scaler) and splitting the data into train (~ 80%) and test (~ 20%) sets (in random manner). The methodology developed by Alqarni11 is used for building the models. The pairwise distributions of dataset variables are shown in Fig. 115. The data used in this study can be accessed in the Supplementary File.

Concentration distribution of the moisture content was estimated in the sample via numerical solution of mass transfer equation where finite element method was applied. The numerical simulations were carried out for solving convection-diffusion equation in unsteady-state mode. Moreover, heat transfer conduction was coupled to the mass transfer for tracking the temperature variations across the sample and times. The model can help track the variations of moisture content and temperature with time and locations, while the data was used in machine learning analysis.

Fig. 1
figure 1

Dataset pair plot.

Preprocessing

The Isolation Forest (IF) algorithm employs an unsupervised method that relies on ensemble modeling to calculate the score for every point, thereby identifying outliers. The data is next split randomly according to particular properties of interest16.

A key advantage of the IF algorithm is its information processing methodology. By employing a decision tree to detect anomalies instead of evaluating the complete points, the algorithm efficiently saves time and space17,18. In order to implement the subsampling approach of the Isolation Forest method, the model is divided into a grid to easily create the necessary number of segments.

In the preprocessing stage, a total of 973 data points were identified and removed as outliers using the IF algorithm to enhance the performance of ML models. This unsupervised method effectively detects anomalies by calculating an isolation score for each data point, ensuring that only relevant and reliable data were used for model training18. Additionally, a contamination parameter of 0.02 was applied, which is a reasonable and widely used threshold in outlier detection tasks.

Machine learning models

Decision tree regression (DT)

The Decision Tree (DT) regression algorithm operates on the core concept of breaking down a complex problem into a series of simpler tasks, each potentially leading to a more interpretable solution19,20. Decision trees are made up of a hierarchy of conditions that follow a sequential process from the tree’s root to its terminal nodes. The structure of decision trees is comprehensible and clear. Following training, decision trees allow the creation of logical rules that, by repeatedly splitting datasets into subgroups, can be used to forecast new datasets21.

Splitting the training set repeatedly produces a DT model. Using particular criteria, the algorithm divides the data at each internal node starting with the root node and continuing until the stopping condition is reached. Every tree leaf node produces a unique and straightforward regression model.

Ridge regression (RR)

Ridge Regression is a linear approach utilized for analyzing labels by exploring their statistical correlations22. Relative to the LSM, RR can produce lower variance figures. When implementing RR, regression coefficients tend to be more consolidated, leading to fewer complications. A mathematical expression characterizing the ridge regression methodology is available. Here, the parameter k varies between 0 and 1. In the scenario of the scatter plot for RR estimators, the parameter k encompasses the complete spectrum from 0 to 1. The dataset includes variables denoted by X and Y. M symbolizes the identity matrix23.

$$\:R{R}_{k}={\left(X{X}^{{\prime\:}}+kM\right)}^{-1}{X}^{{\prime\:}}Y$$

Support vector regression (SVR)

SVM is a useful and reliable tool, offering nonlinear function approximation and strong generalization ability24. SVR (Support Vector Regression) is a regression technique based on SVM, which utilizes a non-linear support vector regressor with a kernel function \(\:K\left({x}_{i},x\right)\) for a dataset \(\:[{x}_{i},{y}_{i}{]}_{i}^{n}\) represented as25:

$$\:y=f\left(x\right)=\sum\limits_{i=1}^{n}{{\upomega\:}}_{i}K\left({x}_{i},x\right)+b$$

where \(\:{\upomega\:}\) stands for the weight vector and b denotes the bias term. The non-linear SVR primal optimization problem (ε-insensitive loss) can be written as25:

$$\:mi{n}_{w,b,{{\upxi\:}}_{i},{{\upxi\:}}_{i}^{*}}\hspace{1em}\frac{1}{2}|w{|}^{2}+C\sum\limits_{i=1}^{n}\left({{\upxi\:}}_{i}+{{\upxi\:}}_{i}^{*}\right)$$

\(\:\text{subject}\:\text{to}\)

$$\:{y}_{i}-\langle w,{\upphi\:}\left({x}_{i}\right)\rangle-b\le\:{\upepsilon\:}+{{\upxi\:}}_{i},$$
$$\:\langle w,{\upphi\:}\left({x}_{i}\right)\rangle+b-{y}_{i}\le\:{\upepsilon\:}+{{\upxi\:}}_{i}^{*}$$
$$\:{{\upxi\:}}_{i},\:{{\upxi\:}}_{i}^{*}\ge\:0,\hspace{1em}i=1,\dots\:,n$$

where \(\:{\upphi\:}\left(\cdot\:\right)\) denotes the (possibly implicit) feature mapping induced by the kernel \(\:k\left({x}_{i},{x}_{j}\right)=\langle{\upphi\:}\left({x}_{i}\right),{\upphi\:}\left({x}_{j}\right)\rangle\).

By employing Lagrange multipliers \(\:{a}_{i}^{\text{*}}\) and \(\:{a}_{i}\), the dual form of the non-linear SVR can be expressed as26:

$$\:y=f\left(x\right)=\sum\limits_{i=1}^{n}\left({a}_{i}^{\text{*}}-{a}_{i}\right)K\left({x}_{i},x\right)+b$$

Various kernel functions such as linear, Gaussian, polynomial, and sigmoid have been proposed. Previous research indicates that the Gaussian kernel is effective and provides accurate results. The widely used Gaussian kernel is defined as27:

$$\:K\left(x,{x}_{i}\right)=\text{e}\text{x}\text{p}\left(-\frac{(x-{x}_{i}{)}^{2}}{2{{\sigma}}^{2}}\right)$$

In this equation, \(\:{\upsigma\:}\) represents the width of the Gaussian kernel.

Hyperparameter optimization (DA)

The swarming behavior of dragonflies served as inspiration for the development of the Dragonfly Algorithm (DA). As an alternative to well-established optimization algorithms like PSO and GA, it was proposed in 2013. The DA algorithm is an example of a population-based optimization procedure that models the foraging behavior of dragonflies28.

The Dragonfly Algorithm uses swarms of dragonflies to explore the search space. Every swarm embodies a prospective solution to the optimization conundrum, where dragonflies engage in interactions and communication to discover the optimal solution. The algorithm also integrates a mutation operator, which helps in exploring novel regions within the search space.

DA represents each dragonfly using a position and a velocity vector. Position vectors show where to go and at what speed as one solution to the optimization task; velocity vectors specify the direction and rate of travel29. Every solution is assessed for quality using a fitness function; solutions with higher fitness values are more likely to proliferate and flourish30,31.

Fig. 2
figure 2

Workflow of DA algorithm.

The primary workflow of the DA, illustrated in Fig. 2, begins with the initialization of a population of dragonflies, each representing a potential solution. These individuals iteratively update their positions and velocities based on five key behavioral factors, and distraction from enemies that collectively simulate natural swarming dynamics. At each iteration, candidate solutions are evaluated using the defined fitness function, and the swarm progressively converges toward more optimal regions of the search space. The figure highlights how exploration and exploitation phases are balanced, with mutation operators further assisting in avoiding local optima. This structured workflow underpins the algorithm’s effectiveness in tuning hyperparameters for complex machine learning models.

The Dragonfly Algorithm has been shown to be impactful in solving a broad range of optimization problems, including image segmentation, function optimization, and feature selection. In many cases, it outperforms other optimization algorithms like PSO and genetic algorithms32.

For hyperparameter optimization, the DA was employed with a population size of 40 dragonflies and a maximum of 120 iterations, while convergence was determined by a tolerance threshold of 1.2e-4 in the fitness improvement across consecutive iterations. In this study, the fitness function was defined as the mean 5-fold R² score of the model generated by each candidate solution, ensuring emphasis on generalization rather than overfitting. Compared with alternative meta-heuristic methods such as Genetic Algorithms (GA) and Particle Swarm Optimization (PSO), DA demonstrated competitive convergence behavior and offered better exploration–exploitation balance for the studied problem.

Evaluation metrics

Several performance metrics are used to compare and rank the outcomes of optimized models in this section. The equations for these performance metrics are as follows33:

1. Coefficient of Determination (R² Score):

$$\:{R}^{2}=1-\frac{\sum\:{\left({C}_{\text{true}}-{C}_{\text{pred}}\right)}^{2}}{\sum\:{\left({C}_{\text{true}}-\stackrel{-}{{C}_{\text{true}}}\right)}^{2}}$$

where \(\:{C}_{\text{true}}\)are the true concentration values, \(\:{C}_{\text{pred}}\)are the predicted concentration values, and \(\:\stackrel{-}{{C}_{\text{true}}}\) is the mean of the true concentration values.

2. Root Mean Square Error (RMSE):

$$\:\text{RMSE}=\sqrt{\frac{1}{n}\sum\:{\left({C}_{\text{true}}-{C}_{\text{pred}}\right)}^{2}}$$

Where n stands for the dataset size.

3. Mean Absolute Error (MAE):

$$\:\text{MAE}=\:\frac{1}{n}\sum\:\left|{C}_{\text{true}}-{C}_{\text{pred}}\right|\:$$

4. Maximum Error:

$$\:\text{Max}\:\text{Error}=\text{max}\left(\left|{C}_{\text{true}}-{C}_{\text{pred}}\right|\right)$$

Results and discussion

This section presents the results obtained from the application of three models—DT, RR, and SVR—to predict the concentration distribution in a pharmaceutical drying process. A range of performance metrics, including the Coefficient of Determination (R²), RMSE, MAE, and Maximum Error, were used to evaluate the models. The final optimized hyperparameters obtained using DA are listed in Table 1.

Table 1 Final optimized hyperparameters.

Model performance metrics

The R² scores for both train and test subset are summarized in Table 2. The SVR model demonstrated the highest R² scores, indicating superior predictive accuracy and generalization compared to DT and RR models.

Table 2 R2 scores of final models.

Table 3 compares RMSE, MAE, and Maximum Error test error rates. The SVR model consistently outperformed the other models, achieving the lowest error rates across all metrics.

Table 3 Error rates on test data.

Also, to show the impact of DA hyperparameter tuning, we compared the performance of each model before and after optimization, as shown in Table 4. For all three models, optimization resulted in noticeable improvements. Specifically, the SVR model demonstrated the most significant enhancement, with the R² test score increasing from 0.9047 to 0.9992 after applying the DA. Similarly, both the Decision Tree and Ridge Regression models also showed improvements in their R² scores, highlighting the effectiveness of DA in fine-tuning hyperparameters and improving predictive accuracy across all models. These results emphasize the critical role of hyperparameter optimization in enhancing model performance and generalization, particularly for complex tasks such as predicting concentration in pharmaceutical processes.

Table 4 Comparison of models with optimized hyperparameters and default values.

Comparative analysis

Figs. 34 and 5 display the actual concentration values contrasted with the predicted concentration values generated by three models. SVR, optimized by the Dragonfly Algorithm, outperformed both the Decision Tree and Ridge Regression models across all evaluated metrics. The exceptionally high R² scores for SVR (0.999234 on the test set and 0.999187 on the training set) demonstrate its ability to accurately model the underlying patterns in the data. The minimal difference between the train and test R² scores also indicates that the SVR model generalizes well to unseen data, reducing the risk of overfitting.

Fig. 3
figure 3

True concentration values compared to predicted concentration values using DT.

Fig. 4
figure 4

True concentration values compared to predicted concentration values using RR.

Fig. 5
figure 5

True concentration values compared to predicted concentration values using SVR.

The error metrics further validate the SVR model’s performance. The RMSE of 1.2619E-03 and MAE of 7.78946E-04 are notably lower than those of the other models, reflecting the model’s precision in predicting the concentration of the chemical compound. Additionally, the maximum error for SVR was the lowest at 5.18029E-03, indicating consistent accuracy even for the most challenging predictions.

Overall, the SVR model improved by the Dragonfly Algorithm accurately and reliably predicts chemical concentration in a spatial context, rendering it a suitable option for applications in chemical engineering and related domains. Thus, we designate this model as the top performer in our study. Through this model, the distinct impact of coordinates on the output is illustrated in Figs. 6, 7 and 8.

Although the SVR model clearly outperformed DT and RR in terms of predictive accuracy, it requires higher computational resources due to kernel operations and hyperparameter tuning. In contrast, DT and RR offer faster training and inference, which may be advantageous for real-time monitoring scenarios where rapid feedback is critical. Therefore, while SVR is most suitable for offline modeling and optimization, DT or RR could serve as lightweight alternatives in industrial drying systems where computational efficiency is prioritized over marginal gains in accuracy.

To further ensure robustness and mitigate possible bias from a single train–test split, a 5-fold cross-validation was performed on the optimized SVR model. The mean R² across folds was 0.99891, with fold-wise scores ranging from 0.99873 to 0.99912, indicating consistently high predictive accuracy. Similarly, the average RMSE and MAE values across folds were 1.39E-03 and 8.42E-04, respectively, both remaining close to the previously reported single-split results. These findings confirm that the SVR model maintains strong generalization ability across multiple data partitions, further supporting its suitability for industrial drying system simulations.

The excellent accuracy of the SVR model compared to DT and RR can be related to its ability to model complex non-linear relationships between spatial coordinates and concentration values more effectively. While the Decision Tree algorithm is highly interpretable, it often struggles with smooth function approximation and can overfit local patterns in dense datasets, even with pruning. Ridge Regression, being a linear method, is inherently limited in capturing non-linear trends present in the drying process, which involves intricate interactions between mass transfer dynamics and spatial variables. In contrast, SVR leverages the kernel trick—particularly the optimized radial basis function (RBF) kernel—to project data into a higher-dimensional space, enabling it to capture subtle non-linear dependencies. Hyperparameter optimization further enhanced its ability to generalize, as reflected by its near-perfect R² scores and minimal error metrics. This demonstrates that SVR is inherently better suited for modeling the highly non-linear and high-dimensional relationships in pharmaceutical drying simulations.

To further examine model robustness, we performed a simple uncertainty analysis by perturbing the input coordinates with small random fluctuations (± 0.48–1% of their values). The SVR predictions remained highly stable, with R² values decreasing only marginally (from 0.9992 to 0.9987) and RMSE increasing by less than 8%. This negligible degradation highlights the model’s resilience to input noise and indicates reliable generalization under realistic experimental variations. Such robustness is particularly important for industrial applications, where measurement inaccuracies and process disturbances are unavoidable.

Visualization interpretations

To better understand the predictive behavior of the optimized models, several visualizations are generated using the models that illustrate the relationships between spatial coordinates (X, Y, Z) and the predicted concentration values in the system. These plots provide intuitive insights into concentration gradients, spatial dependencies, and the impact of individual variables on model predictions, complementing the numerical performance metrics reported earlier.

Figure 6 shows how the X-coordinate affects the predicted concentration (C) with Y = 0 and Z = 0.01 fixed. The non-linear curve reveals smooth concentration changes along the X-axis, indicating its role in the drying process. The SVR model accurately captures these variations, with peaks at specific X values reflecting spatial gradients in the sample.

Fig. 6
figure 6

Concentration-X(m) partial dependency (y = 0, z = 0.01).

Figure 7 displays the Y-coordinate’s impact on concentration (C) with X = 0 and Z = 0.01 constant. The plot shows a non-linear trend, with concentration varying across Y values. The SVR model effectively detects these changes, highlighting areas of higher solute retention and the Y-coordinate’s influence on drying dynamics. The lowest concentration of sample observed can be attributed to the concentration gradient across the sample, and as a result of mass transfer in the system by molecular diffusion.

Fig. 7
figure 7

Concentration-Y(m) partial dependency (x = 0, z = 0.01).

Figure 8 shows the Z-coordinate’s effect on predicted concentration (C) with X = 0 and Y = 0 fixed. Concentration decreases nearly linearly with increasing Z, suggesting a strong, consistent decline along the Z-axis, likely tied to moisture removal in drying.

Fig. 8
figure 8

Concentration-Z(m) partial dependency (x = 0, y = 0).

Figs. 9, 10 and 11 are 3D plots illustrating the combined effects of spatial coordinates on predicted concentration (C) in the pharmaceutical drying process, modeled by the SVR. Figure 9 shows the X and Z coordinates’ impact (Y = 0 fixed), displaying a non-linear surface with Z driving steeper concentration gradients, indicating its strong influence. Figure 10 depicts X and Y’s effect (Z = 0.01 constant), revealing a non-linear surface with concentration peaks, highlighting complex X-Y interactions. Figure 11 illustrates Y and Z’s influence (X = 0 fixed), with Z dominating the concentration decline and Y contributing subtler variations. The SVR model accurately captures these non-linear spatial dependencies, showcasing its effectiveness in simulating drying dynamics.

Fig. 9
figure 9

Concentration as a function of X(m) and Z(m) - Constant Y = 0.

Fig. 10
figure 10

Concentration as a function of X(m) and Y(m) - Constant Z = 0.01.

Fig. 11
figure 11

Concentration as a function of Y(m) and Z(m) - Constant X = 0.

In addition, the 3D in Fig. 12 clearly shows a concentration gradient and effectively illustrates that the Z(m) mostly affects the predicted concentration. The concentration of solute can be observed to change during the drying process which is the outcome of mass transfer inside the sample. The model can be also built to track the concentration versus time to find out when the target point has been reached to stop the drying process. So, the overall results revealed that combination of mass transfer and machine learning is a useful strategy to optimize the freeze-drying process and find the optimum time of drying11. Similar trend has been observed for prediction of T distribution in freeze drying process via ML11,15.

Fig. 12
figure 12

Relationship between X(m), Y(m), and Z(m) with predicted concentration.

While this work concentrated on three representative baseline models (DT, RR, and SVR) to provide a transparent comparison under consistent optimization conditions, we recognize that more advanced approaches such as Random Forest, Gradient Boosting, and Neural Networks could potentially deliver further improvements. Moreover, emerging paradigms like graph representation learning have shown promising results in engineering applications and may offer powerful ways to capture spatial dependencies in future studies34,35. Nevertheless, for the present work, we intentionally restricted the scope to single models to maintain interpretability, reduce computational burden, and clearly demonstrate the performance gains achievable through Dragonfly Algorithm–based hyperparameter tuning. Despite the promising outcomes, some limitations should be acknowledged. First, the study relied on a simulated dataset, and real experimental validation would be necessary to confirm the robustness of the proposed models under practical drying conditions. Second, although the Dragonfly Algorithm enhanced predictive accuracy, its computational cost may limit scalability in large-scale industrial systems. Finally, the present work focused exclusively on spatial concentration prediction; future investigations could extend the framework to incorporate temporal dynamics, variability in excipient composition, or process disturbances commonly encountered in pharmaceutical drying environments.

Conclusion

In this paper, we used a big dataset to make accurate predictions on concentration (C). We employed Isolation Forest method for outlier detection and removal, followed by training three machine learning models to predict concentration (C). Hyper-parameter optimization was performed using the DA.

This work shows that, in predicting concentration (C) based on coordinates (X, Y, Z), SVR, optimized using the DA, considerably outperforms DT and RR models. At an RMSE of 1.2619E-03, an MAE of 7.78946E-04, and a R² test score of 0.999234, the SVR model showed the highest accuracy. These results provide important new information for chemical engineering applications and validate SVR as a very accurate and dependable technique for spatial concentration prediction.