CN118609814A

CN118609814A - A blood glucose concentration prediction method, prediction system, terminal device, and storage medium based on machine learning

Info

Publication number: CN118609814A
Application number: CN202410755598.9A
Authority: CN
Inventors: 王伟; 张紫鹏; 王召巴; 杜文斌; 赵晓宙
Original assignee: North University of China
Current assignee: North University of China
Priority date: 2024-06-12
Filing date: 2024-06-12
Publication date: 2024-09-06

Abstract

The application provides a machine learning-based blood glucose concentration prediction method, a prediction system, terminal equipment and a storage medium, wherein the blood glucose concentration prediction method comprises the following steps: acquiring an odor characteristic data set of a person to be tested and corresponding measured original blood glucose concentration data; performing wavelet transformation denoising and normalization processing on the odor characteristic data set; classifying the processed odor characteristic data set through a pre-constructed classification model to obtain a predicted hypoglycemia data set and a predicted hyperglycemia data set; searching the measured original blood sugar concentration data corresponding to the predicted low blood sugar data set and the predicted high blood sugar data set respectively; predicting a regression model; the blood glucose concentration of the person to be tested can be accurately predicted flexibly and efficiently by combining classification and regression, and the blood glucose prediction efficiency and generalization capability are improved; is applicable to the technical field of blood sugar concentration detection.

Description

A blood glucose concentration prediction method, prediction system, terminal device, and storage medium based on machine learning

技术领域Technical Field

本申请涉及血糖浓度检测的技术领域，具体涉及一种基于机器学习的血糖浓度预测方法、预测系统及终端设备、储存介质。The present application relates to the technical field of blood glucose concentration detection, and specifically to a blood glucose concentration prediction method, prediction system, terminal device, and storage medium based on machine learning.

背景技术Background Art

随着社会经济的发展、人口老龄化的增加以及人们生活方式的改变，糖尿病患病率和发病率在世界范围内呈上升趋势，糖尿病是一种由胰岛素绝对或相对分泌不足以及利用障碍引发的，以高血糖为标志的慢性疾病。该疾病主要分为1型、2型和妊娠糖尿病三种类型。病因主要归结为遗传因素和环境因素的共同作用，包括胰岛细胞功能障碍导致的胰岛素分泌下降，或者机体对胰岛素作用不敏感或两者兼备，使得血液中的葡萄糖不能有效被利用和储存。一部分糖尿病患者和家族有疾病聚集现象。因此，进行血糖的监控和精确预测是目前亟需解决的技术问题。With the development of social economy, the increase of population aging and the change of people's lifestyle, the prevalence and incidence of diabetes are on the rise worldwide. Diabetes is a chronic disease characterized by hyperglycemia caused by absolute or relative insufficient secretion of insulin and utilization disorder. The disease is mainly divided into three types: type 1, type 2 and gestational diabetes. The cause is mainly attributed to the combined effects of genetic and environmental factors, including decreased insulin secretion caused by islet cell dysfunction, or the body's insensitivity to insulin, or both, so that glucose in the blood cannot be effectively utilized and stored. Some diabetic patients and families have a clustering phenomenon of the disease. Therefore, monitoring and accurate prediction of blood sugar is a technical problem that needs to be solved urgently.

近年来，国内外研究人员开始探索利用气味识别技术作为糖尿病检测的工具，例如采用深度神经网络处理大规模的气味数据，并从中学习呼出气体的特定模式，从而辅助糖尿病的诊断；采用随机森林算法建立气味特征与糖尿病之间的关联模型，实现对糖尿病的诊断，但是采用上述算法对血糖的识别效果都不太理想，并没有得到准确的血糖水平。In recent years, researchers at home and abroad have begun to explore the use of odor recognition technology as a tool for diabetes detection. For example, deep neural networks are used to process large-scale odor data and learn specific patterns of exhaled gas from it to assist in the diagnosis of diabetes; random forest algorithms are used to establish an association model between odor characteristics and diabetes to achieve the diagnosis of diabetes. However, the above algorithms have not been very effective in identifying blood sugar and have not obtained accurate blood sugar levels.

发明内容Summary of the invention

为了解决上述技术缺陷之一，本申请提供了一种基于机器学习的血糖浓度预测方法、预测系统及终端设备、储存介质。In order to solve one of the above-mentioned technical defects, the present application provides a blood glucose concentration prediction method, prediction system, terminal device, and storage medium based on machine learning.

根据本申请的第一个方面，提供了一种基于机器学习的血糖浓度预测方法，包括以下步骤：According to a first aspect of the present application, a method for predicting blood glucose concentration based on machine learning is provided, comprising the following steps:

获取待测者的气味特征数据集及对应的已测原始血糖浓度数据，所述气味特征数据集的获取方式为通过气体传感器阵列进行采集；Acquire an odor characteristic data set of the subject to be tested and the corresponding measured original blood glucose concentration data, wherein the odor characteristic data set is acquired by collecting data through a gas sensor array;

将气味特征数据集进行小波变换去噪和归一化处理，得到处理后的气味特征数据集；Perform wavelet transform denoising and normalization processing on the odor feature data set to obtain a processed odor feature data set;

通过预先构建的分类模型将处理后的气味特征数据集进行分类，得到预测低血糖数据集和预测高血糖数据集；The processed odor feature data set is classified by a pre-built classification model to obtain a predicted hypoglycemia data set and a predicted hyperglycemia data set;

分别搜索预测低血糖数据集和预测高血糖数据集对应的已测原始血糖浓度数据；Searching for the measured original blood glucose concentration data corresponding to the predicted hypoglycemia dataset and the predicted hyperglycemia dataset respectively;

将预测低血糖数据集对应的已测原始血糖浓度数据通过预先构建的预测低血糖浓度的回归模型进行回归处理、将预测高血糖数据集对应的已测原始血糖浓度数据通过预先构建的预测高血糖浓度的回归模型进行回归处理，得到待测者的血糖浓度预测结果，并将血糖浓度预测结果进行输出。The measured original blood glucose concentration data corresponding to the predicted hypoglycemia data set are regressed through a pre-constructed regression model for predicting hypoglycemia concentration, and the measured original blood glucose concentration data corresponding to the predicted hyperglycemia data set are regressed through a pre-constructed regression model for predicting hyperglycemia concentration, to obtain the predicted result of the blood glucose concentration of the subject, and the predicted result of the blood glucose concentration is output.

优选地，所述分类模型的构建方法包括以下步骤：Preferably, the method for constructing the classification model comprises the following steps:

获取实验样本的气味特征数据集及对应的血糖浓度数据；Obtain the odor feature data set and corresponding blood glucose concentration data of the experimental samples;

对数据进行预处理，以对气味特征数据集和血糖浓度数据中的缺失值和异常值进行处理；Preprocess the data to handle missing values and outliers in the odor feature dataset and blood glucose concentration data;

将预处理后的实验样本的气味特征数据集划分为第一训练集和第一测试集；Dividing the odor feature data set of the preprocessed experimental samples into a first training set and a first test set;

初始化XGboost模型的参数，包括弱学习器的类型、叶节点分裂所需的最小损失减少阈值、学习率和最大数深度；Initialize the parameters of the XGboost model, including the type of weak learner, the minimum loss reduction threshold required for leaf node splitting, the learning rate, and the maximum log depth;

通过CFAWOA算法对XGboost模型的参数进行寻优，输出最优参数，将最优参数赋值给XGboost模型，得到优化后的XGboost模型；The parameters of the XGboost model are optimized through the CFAWOA algorithm, the optimal parameters are output, and the optimal parameters are assigned to the XGboost model to obtain the optimized XGboost model;

对优化后的XGboost模型进行训练，将第一训练集代入优化后的XGboost模型中，通过添加弱学习器更新XGboost模型的预测值进而训练XGboost模型，当满足XGboost模型的最大迭代次数或XGboost模型的预测性能不再提升时，停止训练，将训练好的XGboost模型作为分类模型；否则，继续添加弱学习器对上一轮训练后的XGboost模型进行训练；Train the optimized XGboost model, substitute the first training set into the optimized XGboost model, and train the XGboost model by adding weak learners to update the prediction value of the XGboost model. When the maximum number of iterations of the XGboost model is met or the prediction performance of the XGboost model is no longer improved, stop training and use the trained XGboost model as the classification model; otherwise, continue to add weak learners to train the XGboost model after the previous round of training;

评估分类模型，计算第一测试集通过训练好的XGboost模型得到的分类结果与其对应的血糖浓度数据的均方误差或均方根误差，进而对分类模型进行评估。Evaluate the classification model, calculate the mean square error or root mean square error of the classification result obtained by the trained XGboost model for the first test set and the corresponding blood glucose concentration data, and then evaluate the classification model.

更优选地，所述预测低血糖浓度的回归模型和预测高血糖浓度的回归模型的构建方法，均包括以下步骤：More preferably, the method for constructing the regression model for predicting low blood sugar concentration and the method for constructing the regression model for predicting high blood sugar concentration both comprise the following steps:

将预处理后的实验样本的气味特征数据集按照血糖浓度划分为低血糖训练集和高血糖训练集；其中低血糖训练集由S个实验样本组成：D低＝(g₁,y₁),(g₂,y₂),K,(g_S,y_S)，式中，g_s∈R^m，m为每个实验样本中的气味特征数量，y_s为第s个低血糖实验样本的血糖浓度，s＝1,2,K,S，S为低血糖训练集的实验样本个数；其中高血糖训练集由V个样本组成，D高＝(g₁,y₁),(g₂,y₂),K,(g_V,y_V)，式中，g_v∈R^m，m为每个实验样本中的气味特征数量，y_v为第v个高血糖实验样本的实际血糖值，v＝1,2,K,V，V为高血糖训练集的实验样本个数；The odor feature dataset of the preprocessed experimental samples is divided into a low-glucose training set and a high-glucose training set according to the blood glucose concentration; the low-glucose training set consists of S experimental samples: Dlow = (g ₁ , y ₁ ), (g ₂ , y ₂ ), K, (g _S , y _S ), where g _s ∈ R ^m , m is the number of odor features in each experimental sample, y _s is the blood glucose concentration of the s-th low-glucose experimental sample, s = 1, 2, K, S, and S is the number of experimental samples in the low-glucose training set; the high-glucose training set consists of V samples, Dhigh = (g ₁ , y ₁ ), (g ₂ , y ₂ ), K, (g _V , y _V ), where g _v ∈ R ^m , m is the number of odor features in each experimental sample, y _v is the actual blood glucose value of the v-th high-glucose experimental sample, v = 1, 2, K, V, and V is the number of experimental samples in the high-glucose training set;

通过低血糖训练集构建预测低血糖浓度的回归模型，通过高血糖训练集构建预测高血糖浓度的回归模型。A regression model for predicting low blood sugar concentration is constructed using the low blood sugar training set, and a regression model for predicting high blood sugar concentration is constructed using the high blood sugar training set.

优选地，所述通过CFAWOA算法对XGboost模型的参数进行寻优，输出最优参数，将最优参数赋值给XGboost模型，得到优化后的XGboost模型，包括以下步骤：Preferably, optimizing the parameters of the XGboost model by using the CFAWOA algorithm, outputting the optimal parameters, assigning the optimal parameters to the XGboost model, and obtaining the optimized XGboost model comprises the following steps:

S10，初始化WOA算法的代理数量和最大迭代次数，在取值范围内随机初始化所有代理各个维度上的位置值；S10, initialize the number of agents and the maximum number of iterations of the WOA algorithm, and randomly initialize the position values of all agents in each dimension within the value range;

S20，计算WOA算法的决策变量长度；S20, calculating the decision variable length of the WOA algorithm;

S30，设置WOA算法的参数，随机生成初始鲸鱼个体位置参数，WOA算法的参数包括：种群规模N、搜索空间的维度、鲸鱼速度、自适应权值；S30, setting parameters of the WOA algorithm, randomly generating initial whale individual position parameters, the parameters of the WOA algorithm include: population size N, dimension of search space, whale speed, and adaptive weight;

S40，令均方误差作为适应度值，计算每个鲸鱼个体的适应度f(x)，记录当前最优个体及位置；S40, taking the mean square error as the fitness value, calculating the fitness f(x) of each individual whale, and recording the current optimal individual and position;

S50，应用sin混沌自映射模型初始化种群，改善种群分布情况；sin混沌自映射模型为：n＝0,1,2,...,N-1≤x_n≤1,x_n≠0；S50, the sin chaos self-mapping model is applied to initialize the population and improve the population distribution; the sin chaos self-mapping model is: n＝0,1,2,...,N-1≤x _n ≤1,x _n ≠0;

S60，动态调整自适应权值ω；式中，f(x)是鲸鱼个体x的适应度，u是第一次迭代计算中鲸鱼种群中最佳的适应度值，iter表示当前的迭代次数；S60, dynamically adjust the adaptive weight ω; Where f(x) is the fitness of individual whale x, u is the best fitness value in the whale population in the first iteration calculation, and iter represents the current number of iterations;

S70，更新控制参数a的值，a的值从2到0线性下降；式中，t为当前迭代次数，T_max是最大迭代次数；S70, updating the value of control parameter a, The value of a decreases linearly from 2 to 0; where t is the current iteration number and T _max is the maximum iteration number;

更新当前鲸鱼个体所在位置X(t)；Update the current whale individual location X(t);

更新A、C、l、p值，A＝2ar₁-a，C＝2r₂，A是[-a,a]中的随机数，l是(-1,1)中的随机数；式中，r₁、r₂是(0,1)中的随机数；Update the values of A, C, l, and p, A = 2ar ₁ -a, C = 2r ₂ , A is a random number in [-a, a], l is a random number in (-1, 1); where r ₁ and r ₂ are random numbers in (0, 1);

选出种群中鲸鱼的最差个体的位置向量X_worst和最优个体的位置向量X^*(t)；Select the worst individual position vector X _worst and the best individual position vector X ^* (t) in the population;

S80，根据当前搜索情况动态调整b值，选择最优的常数b，用来定义螺线的形状；S80, dynamically adjusting the b value according to the current search situation, and selecting the optimal constant b to define the shape of the spiral;

S90，鲸鱼种群选择行为模型以更新当前鲸鱼个体的位置；S90, the whale population selects a behavioral model to update the current position of individual whales;

S100，根据最差鲸鱼位置X_worst来更新最优鲸鱼的位置X^*(t)，计算群体中个体的适应度f(x)；S100, update the optimal whale position X ^* (t) according to the worst whale position X _worst , and calculate the fitness f (x) of the individuals in the group;

S110，判断是否达到最大迭代次数，若是，则结束循环，输出最优参数；否则，返回执行步骤S30；S110, determining whether the maximum number of iterations has been reached, if so, ending the loop and outputting the optimal parameters; otherwise, returning to step S30;

或判断当前最优解的适应度值与之前迭代中的最小适应度值是否相同，若是，则结束循环，输出最优参数；否则，返回执行步骤S30；Or determine whether the fitness value of the current optimal solution is the same as the minimum fitness value in the previous iteration. If so, end the loop and output the optimal parameters; otherwise, return to step S30;

S120，将最优参数赋值给XGboost模型，得到优化后的XGboost模型。S120, assigning the optimal parameters to the XGboost model to obtain an optimized XGboost model.

更优选地，所述步骤S90，鲸鱼种群选择行为模型以更新当前鲸鱼个体的位置，包括以下步骤：More preferably, the step S90, wherein the whale population selects a behavior model to update the position of the current individual whale, comprises the following steps:

预先设定概率阈值为p_i；The probability threshold is pre-set as p _i ;

若p＜p_i，则执行以下步骤：If p＜ _pi , then perform the following steps:

若|A<1|，则采用收缩包围更新位置模型更新当前鲸鱼个体的位置；收缩包围更新位置模型为：X(t+1)＝ωX^*(t)-AD，式中，D＝|CX^*(t)-X(t)|；If |A<1|, the position of the current whale individual is updated using the shrinking and encircling position update model; the shrinking and encircling position update model is: X(t+1)＝ωX ^* (t)-AD, where D＝|CX ^* (t)-X(t)|;

若|A≥1|，则采用搜索觅食模式更新位置模型更新当前鲸鱼个体位置，并根据反馈模型更新当前最差鲸鱼个体位置；搜索模式更新位置模型为：X(t+1)＝X_rand-AD，式中，D＝|CX_rand-X_t|，X_rand是随机选择的鲸鱼位置向量；反馈模型为：X_worstnew＝X_worst-r·(X_p-X_worst)，式中，r表示如果X_worstnew优于X_worst，则接受X_worstnew；If |A≥1|, the search and foraging mode position update model is used to update the current individual whale position, and the current worst individual whale position is updated according to the feedback model; the search mode position update model is: X(t+1)＝X _rand -AD, where D＝|CX _rand -X _t |, X _rand is a randomly selected whale position vector; the feedback model is: X _worstnew ＝X _worst -r·(X _p -X _worst ), where r means that if X _worstnew is better than X _worst , then X _worstnew is accepted;

若p≥p_i，则采用螺旋更新位置模型更新当前鲸鱼个体位置；螺旋更新位置模型为：X(t+1)＝ωX^*(t)+D_pe^blcos(2πl)，式中，D_p＝|X^*(t)-X(t)|。If _p≥pi , the spiral update position model is used to update the current individual whale position; the spiral update position model is: X(t+1) ^＝ ωX ^* (t)+ _Dpeblcos (2πl), where _Dp ＝|X ^* (t)-X(t)|.

优选地，所述预测低血糖浓度的回归模型为自适应提升回归模型：Preferably, the regression model for predicting low blood sugar concentration is an adaptive boosting regression model:

式中，G(g)是所有ω'_kh_k(g_s)的中位数，k＝1,2,K,K，K为弱学习器的个数，s＝1,2,K,S，S为低血糖训练集的实验样本个数，ω'_k为第k个弱学习器的权重，h_k(g_s)为第k个弱学习器对第s个低血糖实验样本的预测结果。where G(g) is the median of all _ω'k _hk ( _gs ), k = 1, 2, K, K, K is the number of weak learners, s = 1, 2, K, S, S is the number of experimental samples in the hypoglycemia training set, _ω'k is the weight of the kth weak learner, and _hk ( _gs ) is the prediction result of the kth weak learner for the sth hypoglycemia experimental sample.

更优选地，所述预测高血糖浓度的回归模型为梯度提升回归模型：More preferably, the regression model for predicting high blood sugar concentration is a gradient boosting regression model:

式中，g∈R_qj，L(y_v,c)为损失函数，v＝1,2,K,V，V为高血糖训练集的实验样本个数，y_v为第v个高血糖实验样本的实际血糖值，c为第q-1轮获得的累积模型，Q为迭代轮数，c_qj为第q棵回归树在第j个叶节点处的最佳拟合值，j＝1,2,K J，J为叶节点个数。Where g∈R _qj , L(y _v ,c) is the loss function, v=1,2,K,V, V is the number of experimental samples in the hyperglycemia training set, y _v is the actual blood glucose value of the vth hyperglycemia experimental sample, c is the cumulative model obtained in the q-1th round, Q is the number of iterations, c _qj is the best fitting value of the qth regression tree at the jth leaf node, j=1,2,KJ, and J is the number of leaf nodes.

根据本申请的第二个方面，提供了一种基于机器学习的血糖浓度预测系统，包括：According to a second aspect of the present application, a blood glucose concentration prediction system based on machine learning is provided, comprising:

数据获取单元，用于获取待测者的气味特征数据集及对应的已测原始血糖浓度数据，所述气味特征数据集的获取方式为通过气体传感器阵列进行采集；A data acquisition unit, used to acquire a data set of odor characteristics of the subject to be tested and the corresponding original blood glucose concentration data, wherein the data set of odor characteristics is acquired by collecting data through a gas sensor array;

初步处理单元，用于将气味特征数据集进行小波变换去噪和归一化处理，得到处理后的气味特征数据集；A preliminary processing unit, used for performing wavelet transform denoising and normalization processing on the odor feature data set to obtain a processed odor feature data set;

分类单元，用于通过预先构建的分类模型将处理后的待测者的气味特征数据集进行分类，得到预测低血糖数据集和预测高血糖数据集；A classification unit, used to classify the processed odor feature data set of the test subject through a pre-built classification model to obtain a predicted hypoglycemia data set and a predicted hyperglycemia data set;

搜索单元，用于分别搜索预测低血糖数据集和预测高血糖数据集对应的已测原始血糖浓度数据；A search unit, used to search for the measured original blood glucose concentration data corresponding to the predicted hypoglycemia data set and the predicted hyperglycemia data set respectively;

血糖浓度预测单元，用于将预测低血糖数据集对应的已测原始血糖浓度数据通过预先构建的预测低血糖浓度的回归模型进行回归处理、将预测高血糖数据集对应的已测原始血糖浓度数据通过预先构建的预测高血糖浓度的回归模型进行回归处理，得到待测者的血糖浓度预测结果；A blood glucose concentration prediction unit is used to perform regression processing on the measured original blood glucose concentration data corresponding to the predicted hypoglycemia data set through a pre-built regression model for predicting hypoglycemia concentration, and to perform regression processing on the measured original blood glucose concentration data corresponding to the predicted hyperglycemia data set through a pre-built regression model for predicting hyperglycemia concentration, so as to obtain a predicted result of the blood glucose concentration of the subject to be tested;

预测结果输出单元，将血糖浓度预测结果进行输出。The prediction result output unit outputs the blood sugar concentration prediction result.

根据本申请的第三个方面，提供了一种终端设备，包括：According to a third aspect of the present application, a terminal device is provided, including:

存储器；Memory;

处理器；以及Processor; and

计算机程序；Computer programs;

其中，所述计算机程序存储在所述存储器中，并被配置为由所述处理器执行以实现如上面任一项内容所述的血糖浓度预测方法。Wherein, the computer program is stored in the memory and is configured to be executed by the processor to implement the blood glucose concentration prediction method as described in any one of the above contents.

根据本申请的第四个方面，提供了一种计算机可读存储介质，其上存储有计算机程序；所述计算机程序被处理器执行以实现如上面任一项内容所述的血糖浓度预测方法。According to a fourth aspect of the present application, a computer-readable storage medium is provided, on which a computer program is stored; the computer program is executed by a processor to implement the blood glucose concentration prediction method as described in any of the above contents.

本申请中将分类和回归结合起来，能够灵活、高效的对待测者的血糖浓度进行精确预测。分类模型提供了可以轻松解释和理解的类别输出，简化后续工作，还可以捕捉特征之间的复杂关系，有助于减少回归模型中的特征维度，通过分类模型能够确定血糖浓度的预测类别；分类后再通过对应的回归模型分别对不同类别的气味特征数据进行回归处理，进而实现对血糖浓度进行精确的数值预测。其中分类模型相对于回归模型来说更能抵抗异常值和数据错误，先分类后回归的方法能够简化问题，提高了血糖预测效率和泛化能力，提高了预测过程中的可解释性和鲁棒性，能够更好地捕捉气味特征数据的非线性关系。In this application, classification and regression are combined to flexibly and efficiently make accurate predictions of the blood glucose concentration of the subject. The classification model provides category outputs that can be easily interpreted and understood, simplifying subsequent work, and can also capture the complex relationship between features, which helps to reduce the feature dimension in the regression model. The classification model can determine the prediction category of blood glucose concentration; after classification, the corresponding regression model is used to regress the odor feature data of different categories, thereby achieving accurate numerical prediction of blood glucose concentration. Among them, the classification model is more resistant to outliers and data errors than the regression model. The method of classification first and then regression can simplify the problem, improve the efficiency and generalization ability of blood glucose prediction, improve the interpretability and robustness of the prediction process, and better capture the nonlinear relationship of odor feature data.

本申请的其它特征和优点将在随后的说明书中进行阐述，并且，部分地从说明书中变得显而易见，或者通过实施本申请而了解。本申请的目的和其他优点可通过在所写的说明书以及附图中所指出的内容来实现和获得。Other features and advantages of the present application will be described in the following description, and partly become apparent from the description, or understood by practicing the present application. The purpose and other advantages of the present application can be realized and obtained by the contents indicated in the written description and the accompanying drawings.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

此处所说明的附图用来提供对本申请的进一步理解，构成本申请的一部分，本申请的示意性实施例及其说明用于解释本申请，并不构成对本申请的不当限定。在附图中：The drawings described herein are used to provide a further understanding of the present application and constitute a part of the present application. The illustrative embodiments of the present application and their descriptions are used to explain the present application and do not constitute an improper limitation on the present application. In the drawings:

图1为本申请提供的一种基于机器学习的血糖浓度预测方法的流程图；FIG1 is a flow chart of a blood glucose concentration prediction method based on machine learning provided in the present application;

图2为采用分类方法1进行预测分类的ROC曲线图；FIG2 is a ROC curve diagram of prediction classification using classification method 1;

图3为采用本申请提供的分类方法2进行预测分类的ROC曲线图；FIG3 is a ROC curve diagram of prediction classification using classification method 2 provided in the present application;

图4为采用梯度提升回归模型直接进行血糖浓度预测的结果图；FIG4 is a graph showing the results of directly predicting blood glucose concentration using a gradient boosting regression model;

图5为采用本申请提供的一种基于机器学习的血糖浓度预测方法对血糖浓度进行预测的结果图；FIG5 is a graph showing the result of predicting blood glucose concentration using a blood glucose concentration prediction method based on machine learning provided in the present application;

图6为本申请提供的一种基于机器学习的血糖浓度预测系统的结构示意图；FIG6 is a schematic diagram of the structure of a blood glucose concentration prediction system based on machine learning provided in the present application;

图7为本申请提供的分类单元的结构示意图；FIG7 is a schematic diagram of the structure of a classification unit provided in the present application;

图中：100为数据获取单元，110为初步处理单元，120为分类单元，130为搜索单元，140为血糖浓度预测单元，150为预测结果输出单元，1201为样本数据获取模块，1202为数据预处理模块，1203为划分模块，1204为初始化参数模块，1205为寻优模块，1206为分类模型训练模块，1207为分类模型评估模块。In the figure: 100 is a data acquisition unit, 110 is a preliminary processing unit, 120 is a classification unit, 130 is a search unit, 140 is a blood glucose concentration prediction unit, 150 is a prediction result output unit, 1201 is a sample data acquisition module, 1202 is a data preprocessing module, 1203 is a partitioning module, 1204 is an initialization parameter module, 1205 is an optimization module, 1206 is a classification model training module, and 1207 is a classification model evaluation module.

具体实施方式DETAILED DESCRIPTION

为了使本申请实施例中的技术方案及优点更加清楚明白，以下结合附图对本申请的示例性实施例进行进一步详细的说明，显然，所描述的实施例仅是本申请的一部分实施例，而不是所有实施例的穷举。需要说明的是，在不冲突的情况下，本申请中的实施例及实施例中的特征可以相互组合。In order to make the technical solutions and advantages in the embodiments of the present application more clearly understood, the exemplary embodiments of the present application are further described in detail below in conjunction with the accompanying drawings. Obviously, the described embodiments are only part of the embodiments of the present application, rather than an exhaustive list of all the embodiments. It should be noted that the embodiments in the present application and the features in the embodiments can be combined with each other without conflict.

如图1所示，针对上述问题，本申请实施例中提供了一种基于机器学习的血糖浓度预测方法，包括以下步骤：As shown in FIG1 , in response to the above problems, a method for predicting blood glucose concentration based on machine learning is provided in an embodiment of the present application, comprising the following steps:

获取待测者的气味特征数据集及对应的已测原始血糖浓度数据，所述气味特征数据集的获取方式为通过气体传感器阵列进行采集；其中气体传感器阵列包括以下传感器：WSP2110、MICS-5524、TGS2602、TGS822、SMD-1015、ENS160、TGS826、MICS-4514；Acquire an odor characteristic data set of the subject to be tested and the corresponding measured raw blood glucose concentration data, wherein the odor characteristic data set is acquired by collecting through a gas sensor array; wherein the gas sensor array includes the following sensors: WSP2110, MICS-5524, TGS2602, TGS822, SMD-1015, ENS160, TGS826, MICS-4514;

将气味特征数据集进行小波去噪和归一化处理，得到处理后的气味特征数据集；The odor feature data set is subjected to wavelet denoising and normalization processing to obtain a processed odor feature data set;

将预测低血糖数据集对应的已测原始血糖浓度数据通过预先构建的预测低血糖浓度的回归模型进行回归处理、将预测高血糖数据集对应的已测原始血糖浓度数据通过预先构建的预测高血糖浓度的回归模型进行回归处理，得到待测者的血糖浓度预测结果，并将该血糖浓度预测结果进行输出。The measured original blood glucose concentration data corresponding to the predicted hypoglycemia data set are regressed through a pre-constructed regression model for predicting hypoglycemia concentration, and the measured original blood glucose concentration data corresponding to the predicted hyperglycemia data set are regressed through a pre-constructed regression model for predicting hyperglycemia concentration, to obtain the blood glucose concentration prediction result of the subject to be tested, and the blood glucose concentration prediction result is output.

本申请中通过无创检测的气味识别系统对不同待测者的呼出气体均进行多次采集，得到气味特征数据，然后将分类和回归结合起来，能够灵活、高效的对待测者的血糖浓度进行精确预测。分类模型提供了可以轻松解释和理解的类别输出，简化后续工作，还可以捕捉特征之间的复杂关系，有助于减少回归模型中的特征维度，通过分类模型能够确定血糖浓度的预测类别；分类后再通过对应的回归模型分别对不同类别的气味特征数据进行回归处理，进而实现对血糖浓度进行精确的数值预测。其中分类模型相对于回归模型来说更能抵抗异常值和数据错误，先分类后回归的方法能够简化问题，提高了血糖预测效率和泛化能力，提高了预测过程中的可解释性和鲁棒性，能够更好地捕捉气味特征数据的非线性关系。In this application, the exhaled gas of different test subjects is collected multiple times through the odor recognition system of non-invasive detection to obtain odor characteristic data, and then classification and regression are combined to flexibly and efficiently make accurate predictions of the blood glucose concentration of the test subject. The classification model provides category outputs that can be easily interpreted and understood, simplifying subsequent work, and can also capture the complex relationship between features, which helps to reduce the feature dimensions in the regression model. The classification model can determine the prediction category of blood glucose concentration; after classification, the corresponding regression model is used to regress the odor characteristic data of different categories, thereby achieving accurate numerical prediction of blood glucose concentration. Among them, the classification model is more resistant to outliers and data errors than the regression model. The method of classification first and then regression can simplify the problem, improve the efficiency and generalization ability of blood glucose prediction, improve the interpretability and robustness of the prediction process, and better capture the nonlinear relationship of odor characteristic data.

进一步地，所述分类模型的构建方法包括以下步骤：Furthermore, the method for constructing the classification model comprises the following steps:

将预处理后的实验样本的气味特征数据集划分为第一训练集和第一测试集；其中第一训练集由Z个实验样本组成：D_{第一训练集}＝(g₁,y₁),(g₂,y₂),K,(g_Z,y_Z)；式中，g_z∈R^m，z＝1,2,K,Z，m为每个实验样本中的气味特征数量，y_z为第z个低血糖实验样本的血糖浓度，Z为第一训练集的实验样本个数；The odor feature data set of the preprocessed experimental samples is divided into a first training set and a first test set; wherein the _{first training set consists of Z experimental samples: Dfirst training set} = (g ₁ , y ₁ ), (g ₂ , y ₂ ), K, (g _Z , y _Z ); wherein, g _z ∈ R ^m , z = 1, 2, K, Z, m is the number of odor features in each experimental sample, y _z is the blood glucose concentration of the zth hypoglycemia experimental sample, and Z is the number of experimental samples in the first training set;

初始化分布式梯度增强模型(XGboost模型)的参数，包括弱学习器的类型、叶节点分裂所需的最小损失减少阈值、学习率和最大数深度；选择树的最大深度(max_depth)、决定每次迭代中权重更新步长的学习率(learning_rate)，以及叶节点分裂所需的最小损失减少阈值(gamma)作为主要参数；Initialize the parameters of the distributed gradient boosting model (XGboost model), including the type of weak learners, the minimum loss reduction threshold required for leaf node splitting, the learning rate, and the maximum tree depth; select the maximum tree depth (max_depth), the learning rate that determines the weight update step size in each iteration (learning_rate), and the minimum loss reduction threshold required for leaf node splitting (gamma) as the main parameters;

通过混沌反馈自适应鲸鱼优化算法(CFAWOA算法)对XGboost模型的参数进行寻优，输出最优参数，将最优参数赋值给XGboost模型，得到优化后的XGboost模型；优化后的XGboost模型中max_depth值为9，learning_rate值为0.15，gamma值为0；The parameters of the XGboost model are optimized through the chaotic feedback adaptive whale optimization algorithm (CFAWOA algorithm), the optimal parameters are output, and the optimal parameters are assigned to the XGboost model to obtain the optimized XGboost model; in the optimized XGboost model, the max_depth value is 9, the learning_rate value is 0.15, and the gamma value is 0;

对优化后的XGboost模型进行训练，将第一训练集代入优化后的XGboost模型中，通过添加弱学习器更新XGboost模型的预测值进而训练XGboost模型，当满足XGboost模型的最大迭代次数或XGboost模型的预测性能不再提升时，停止训练，将训练好的XGboost模型作为分类模型；否则，继续添加弱学习器对上一轮训练后的XGboost模型进行训练；所述分类模型F为：The optimized XGboost model is trained, the first training set is substituted into the optimized XGboost model, and the prediction value of the XGboost model is updated by adding weak learners to train the XGboost model. When the maximum number of iterations of the XGboost model is met or the prediction performance of the XGboost model is no longer improved, the training is stopped and the trained XGboost model is used as the classification model; otherwise, the weak learners are continued to be added to train the XGboost model after the previous round of training; the classification model F is:

F＝{f_e(z)＝W_Q(z)}；F = { _fe (z) = WQ _(z) };

式中：F是分类模型、是最终训练好的XGboost模型即树结构集合，e＝1,2,K,E，E是决策树的数量，f_e(z)是第e棵决策树对第z个实验样本的气味特征数据集的血糖浓度预测值，Q(z)是第z个实验样本映射到叶子节点的树结构，W是叶节点的实数分数(即叶节点权重)；Where: F is the classification model, is the final trained XGboost model, i.e., the tree structure set, e=1,2,K,E, E is the number of decision trees, fe _(z) is the predicted value of the blood glucose concentration of the odor feature data set of the zth experimental sample by the eth decision tree, Q(z) is the tree structure of the zth experimental sample mapped to the leaf node, and W is the real number score of the leaf node (i.e., the leaf node weight);

本申请通过预先构建的分类模型将待测者的气味特征数据集进行分类，通过CFAWOA算法对XGboost模型的参数进行寻优，避免了标准的鲸鱼优化算法处理复杂函数优化问题出现的寻优精度低和易陷入局部极小值的问题，CFAWOA算法具有很好的泛化能力，可以帮助XGboost模型更好地拟合数据，提高预测性能。本申请中通过第一训练集训练得到分类模型，然后通过第一测试集对分类模型进行评估。确保得到的分类模型的分类准确性。This application classifies the odor feature data set of the subject to be tested through a pre-built classification model, and optimizes the parameters of the XGboost model through the CFAWOA algorithm, avoiding the low optimization accuracy and easy to fall into local minima problems that occur when the standard whale optimization algorithm handles complex function optimization problems. The CFAWOA algorithm has good generalization ability and can help the XGboost model better fit the data and improve prediction performance. In this application, the classification model is obtained by training the first training set, and then the classification model is evaluated by the first test set. Ensure the classification accuracy of the obtained classification model.

进一步地，所述通过CFAWOA算法对XGboost模型的参数进行寻优，输出最优参数，将最优参数赋值给XGboost模型，得到优化后的XGboost模型，包括以下步骤：Furthermore, optimizing the parameters of the XGboost model by the CFAWOA algorithm, outputting the optimal parameters, assigning the optimal parameters to the XGboost model, and obtaining the optimized XGboost model includes the following steps:

S10，初始化WOA算法的代理数量和最大迭代次数，在取值范围内随机初始化所有代理各个维度上的位置值；具体地，将最大迭代次数初始化为500；S10, initialize the number of agents and the maximum number of iterations of the WOA algorithm, and randomly initialize the position values of all agents in each dimension within the value range; specifically, initialize the maximum number of iterations to 500;

具体地，将种群规模设置为30，搜索空间的维度设置为4，鲸鱼速度设置为0，自适应权值设置为0.5；Specifically, the population size is set to 30, the dimension of the search space is set to 4, the whale speed is set to 0, and the adaptive weight is set to 0.5;

S90，鲸鱼种群选择行为模型以更新当前鲸鱼个体的位置，包括以下步骤：S90, the whale population selects a behavior model to update the position of the current individual whale, including the following steps:

预先设定概率阈值为p_i；The probability threshold is pre-set as p _i ;

若p≥p_i，则采用螺旋更新位置模型更新当前鲸鱼个体位置；螺旋更新位置模型为：X(t+1)＝ωX^*(t)+D_pe^blcos(2πl)，式中，D_p＝|X^*(t)-X(t)|；If p ≥ p _i , the spiral update position model is used to update the current individual whale position; the spiral update position model is: X(t+1)＝ωX ^* (t)+D _p e ^bl cos(2πl), where D _p ＝|X ^* (t)-X(t)|;

本申请中，在鲸鱼优化算法的基础上引入了混沌理论生成初始种群，增加了种群多样性，为算法全局搜索奠定基础。此外，在鲸鱼位置更新后期增加了反馈模型，通过交流学习使最差鲸鱼快速向最优鲸鱼靠拢，进一步提高了算法的全局搜索能力，同时在鲸鱼个体位置更新公式中引入自适应惯性权值ω，通过平衡算法的开发和探索能力进一步改善算法的优化性能。In this application, chaos theory is introduced to generate the initial population based on the whale optimization algorithm, which increases the population diversity and lays the foundation for the global search of the algorithm. In addition, a feedback model is added in the later stage of whale position update, and the worst whale is quickly brought closer to the optimal whale through communication learning, which further improves the global search ability of the algorithm. At the same time, the adaptive inertia weight ω is introduced into the whale individual position update formula, and the optimization performance of the algorithm is further improved by balancing the development and exploration capabilities of the algorithm.

更进一步地，所述预测低血糖浓度的回归模型和预测高血糖浓度的回归模型的构建方法，均包括以下步骤：Furthermore, the method for constructing the regression model for predicting low blood sugar concentration and the regression model for predicting high blood sugar concentration both include the following steps:

将预处理后的实验样本的气味特征数据集按照血糖浓度划分为低血糖训练集和高血糖训练集；其中低血糖训练集由S个实验样本组成：D_低＝(g₁,y₁),(g₂,y₂),K,(g_S,y_S)，式中，g_s∈R^m，m为每个实验样本中的气味特征数量，y_s为第s个低血糖实验样本的血糖浓度，s＝1,2,K,S，S为低血糖训练集的实验样本个数；其中高血糖训练集由V个样本组成，D_高＝(g₁,y₁),(g₂,y₂),K,(g_V,y_V)，式中，g_v∈R^m，m为每个实验样本中的气味特征数量，y_v为第v个高血糖实验样本的实际血糖值，v＝1,2,K,V，V为高血糖训练集的实验样本个数；The odor feature dataset of the preprocessed experimental samples is divided into a low-glucose training set and a high-glucose training set according to the blood glucose concentration; the low-glucose training set consists of S experimental samples: _Dlow = (g ₁ , y ₁ ), (g ₂ , y ₂ ), K, (g _S , y _S ), where g _s ∈ R ^m , m is the number of odor features in each experimental sample, y _s is the blood glucose concentration of the s-th low-glucose experimental sample, s = 1, 2, K, S, and S is the number of experimental samples in the low-glucose training set; the high-glucose training set consists of V samples, _Dhigh = (g ₁ , y ₁ ), (g ₂ , y ₂ ), K, (g _V , y _V ), where g _v ∈ R ^m , m is the number of odor features in each experimental sample, y _v is the actual blood glucose value of the v-th high-glucose experimental sample, v = 1, 2, K, V, and V is the number of experimental samples in the high-glucose training set;

具体地，所述预测低血糖浓度的回归模型为自适应提升回归模型，其构建方法包括以下步骤：Specifically, the regression model for predicting low blood sugar concentration is an adaptive boosting regression model, and its construction method includes the following steps:

1、初始化权重，将初始状态下的低血糖训练集分布记为Dist₁分布，将每一个低血糖实验样本的权重初始化为1/n。其中Dist₁分布用于第一个学习器h₁的训练，Dist_k分布用于第k个弱学习器h_k的训练，以此类推；1. Initialize the weights, record the initial distribution of the hypoglycemia training set as Dist ₁ distribution, and initialize the weight of each hypoglycemia experimental sample to 1/n. The Dist ₁ distribution is used for the training of the first learner h ₁ , the Dist _k distribution is used for the training of the kth weak learner h _k , and so on;

2、循环迭代：2. Loop iteration:

按照学习器的编号逐一进行下列步骤的循环迭代，学习器的编号定义为k，且k∈1,2,3,K,K。The following steps are iterated one by one according to the number of the learner. The number of the learner is defined as k, and k∈1,2,3,K,K.

1)计算弱学习器h_k在低血糖训练集上的最大误差E_k；1) Calculate the maximum error E _k of the weak learner h _k on the hypoglycemia training set;

E_k＝max|y_s-h_k(g_s)|；E _k =max|y _s -h _k (g _s )|;

式中，h_k(g_s)为第k个弱学习器对第s个低血糖实验样本的预测结果；Where h _k (g _s ) is the prediction result of the kth weak learner for the sth hypoglycemia experimental sample;

2)计算h_k对每个低血糖实验样本的相对误差e_ks；2) Calculate the relative error e _ks of h _k for each hypoglycemia experimental sample;

3)计算当前弱学习器的误差率e_k；3) Calculate the error rate e _k of the current weak learner;

4)更新当前弱学习器的权重ω'_k；4) Update the weight ω' _k of the current weak learner;

5)更新低血糖训练集的权重分布；5) Update the weight distribution of the hypoglycemia training set;

式中，Z_k为归一化因子， Where Z _k is the normalization factor,

3、循环迭代完成后，计算多个弱学习器预测结果的加权平均值，得到强回归器，即自适应提升回归模型H(g)；3. After the loop iteration is completed, the weighted average of the prediction results of multiple weak learners is calculated to obtain a strong regressor, that is, the adaptive boosting regression model H(g);

式中，G(g)是所有ω'_kh_k(g_i)的中位数。Where G(g) is the median of all ω' _k h _k ( _gi ).

本申请提供的自适应提升回归模型中，在低血糖训练集分布的基础上，在低血糖训练集上训练弱学习器h_k，根据弱学习器h_k的预测和该血糖训练集对应的血糖浓度数据计算加权误差率e_k，根据加权误差率e_k计算弱学习器的权重ω'_k，增加被弱学习器错误分类的样本的权重，减少被正确分类的样本的权重，得到新的权重分布，最后将K个弱学习器加权组合成一个强学习器，即可得到自适应提升回归模型，能够对预测低血糖数据集的血糖浓度进行精准预测。In the adaptive boosting regression model provided in the present application, based on the distribution of the hypoglycemia training set, a weak learner h _k is trained on the hypoglycemia training set, and the weighted error rate e _k is calculated according to the prediction of the weak learner h _k and the blood glucose concentration data corresponding to the blood glucose training set. The weight ω' _k of the weak learner is calculated according to the weighted error rate e _k , the weight of the samples misclassified by the weak learner is increased, and the weight of the samples correctly classified is reduced to obtain a new weight distribution. Finally, K weak learners are weighted and combined into a strong learner to obtain an adaptive boosting regression model, which can accurately predict the blood glucose concentration of the predicted hypoglycemia data set.

更进一步地，所述预测高血糖浓度的回归模型为梯度提升回归模型，其构建方法包括以下步骤：Furthermore, the regression model for predicting high blood sugar concentration is a gradient boosting regression model, and its construction method includes the following steps:

循环迭代，Loop iteration,

1、初始化弱学习器H'₀(g)，在函数空间中找到一个弱学习器，使得加入该弱学习器之后的累积模型的损失最小；1. Initialize the weak learner H' ₀ (g), and find a weak learner in the function space so that the cumulative model loss after adding the weak learner is minimized;

式中，L(y_v,c)为损失函数，c为第q-1轮获得的累积模型；Where L(y _v ,c) is the loss function, c is the cumulative model obtained in the q-1th round;

2、对迭代轮数q＝1,2,K,Q，Q为迭代轮数，逐一执行以下步骤：2. For the number of iterations q = 1, 2, K, Q, where Q is the number of iterations, perform the following steps one by one:

1)对高血糖实验样本计算负梯度；1) Calculate the negative gradient for the hyperglycemic experimental sample;

负梯度能够保证在后续计算最佳拟合值时，新的弱学习器能够沿着减少累积模型损失的方向进行改进；Negative gradients ensure that when the best fit value is subsequently calculated, the new weak learner can be improved in the direction of reducing the cumulative model loss;

2)拟合一颗CART回归树，得到第q棵回归树，叶子节点区域R_qj，其中j＝1,2,K J，J为对应的叶节点个数；2) Fit a CART regression tree and obtain the qth regression tree, with a leaf node region R _qj , where j = 1, 2, KJ, and J is the number of corresponding leaf nodes;

3)更新并计算第q棵回归树在第j个叶节点处的最佳拟合值c_qj；3) Update and calculate the best fitting value c _qj of the qth regression tree at the jth leaf node;

4)得到强学习器，即梯度提升回归模型H'(g)；4) Obtain a strong learner, namely the gradient boosting regression model H'(g);

本申请提供的梯度提升回归模型，是一种从它错误数据中进行学习的技术，串行地生成多个弱学习器，每个弱学习器的目标是拟合先前累加模型的损失函数的负梯度，使加上该弱学习器后的累积模型损失往负梯度的方向减少，得到的强学习器能够对预测高血糖数据集的血糖浓度进行精准预测。The gradient boosting regression model provided in the present application is a technology for learning from its erroneous data, which generates multiple weak learners in series. The goal of each weak learner is to fit the negative gradient of the loss function of the previously accumulated model, so that the cumulative model loss after adding the weak learner is reduced in the direction of the negative gradient. The obtained strong learner can accurately predict the blood glucose concentration of the predicted hyperglycemia data set.

为了进一步说明本申请提供的分类模型对预处理后的待测者的气味特征数据集的分类效果，选取160个实验样本并对其进行不同方法的预测分类，将其按照70％和30％的比例划分为训练集和测试集。In order to further illustrate the classification effect of the classification model provided in this application on the preprocessed odor feature data set of the test subjects, 160 experimental samples were selected and predicted and classified by different methods, and divided into a training set and a test set according to the ratio of 70% and 30%.

分类方法1：使用XGboost模型进行预测分类；Classification method 1: Use XGboost model for prediction and classification;

分类方法2：使用本申请提供的分类模型进行预测分类；Classification method 2: Use the classification model provided in this application for prediction and classification;

如图2和图3所示，采用分类方法1进行预测分类后ROC曲线的AUC值为0.89，采用分类方法2进行预测分类后ROC曲线的AUC值为0.94，由此可知，采用本申请提供的分类模型的预测分类效果是更好的。为进一步通过回归模型对血糖浓度的精确检测提供了可靠基础。As shown in Figures 2 and 3, the AUC value of the ROC curve after prediction and classification using classification method 1 is 0.89, and the AUC value of the ROC curve after prediction and classification using classification method 2 is 0.94. It can be seen that the prediction and classification effect of the classification model provided by the present application is better, which provides a reliable basis for further accurate detection of blood glucose concentration through regression models.

为了进一步说明本申请提供的预测低血糖浓度的回归模型和预测高血糖浓度的回归模型的血糖浓度预测效果，设置对比例和实验例。In order to further illustrate the blood glucose concentration prediction effects of the regression model for predicting low blood glucose concentration and the regression model for predicting high blood glucose concentration provided in the present application, comparative examples and experimental examples are set.

对比例：直接采用梯度提升回归模型对实验样本的血糖浓度进行预测；Comparative example: Directly use the gradient boosting regression model to predict the blood glucose concentration of the experimental samples;

实验例：将经过上述分类方法2分类的预测低血糖数据集和预测高血糖数据集分别通过本申请提供的自适应提升回归模型H(g)、梯度提升回归模型H'(g)进行回归预测，得到血糖浓度预测结果。Experimental example: The predicted hypoglycemia dataset and predicted hyperglycemia dataset classified by the above classification method 2 are respectively subjected to regression prediction using the adaptive boosting regression model H(g) and the gradient boosting regression model H'(g) provided in this application to obtain blood glucose concentration prediction results.

如图4所示，直接采用梯度提升回归模型对实验样本的血糖浓度进行预测时，真实血糖浓度和血糖浓度预测结果的平均绝对误差为0.84mmol/L，血糖浓度预测的均方根误差(RMSE)为0.32，R²为0.76，训练时间为370ms。As shown in Figure 4, when the gradient boosting regression model is directly used to predict the blood glucose concentration of the experimental samples, the mean absolute error between the true blood glucose concentration and the predicted blood glucose concentration is 0.84 mmol/L, the root mean square error (RMSE) of the blood glucose concentration prediction is 0.32, ^R2 is 0.76, and the training time is 370 ms.

如图5所示，采用本申请提供的基于机器学习的血糖浓度预测方法对血糖浓度进行预测时：在血糖浓度小于0.7时，平均绝对误差为0.53(均方根误差为0.13，R²为0.93)；在血糖浓度大于等于0.7时，平均绝对误差为0.84(均方根误差为0.24，R²为0.82)，可以得出采用本申请提供的基于机器学习的血糖浓度预测方法的平均绝对误差约为0.60mmol/L(均方根误差为0.21，R²为0.87)，与对比例相比，说明在本申请提供的血糖浓度预测方法在血糖预测方面具有显著改善，进一步减少了误差，提升了预测结果的准确性，相较于现有技术中不分类、直接使用回归模型进行预测的效果更好。As shown in Figure 5, when the blood glucose concentration prediction method based on machine learning provided in the present application is used to predict the blood glucose concentration: when the blood glucose concentration is less than 0.7, the mean absolute error is 0.53 (the root mean square error is 0.13, and ^R2 is 0.93); when the blood glucose concentration is greater than or equal to 0.7, the mean absolute error is 0.84 (the root mean square error is 0.24, and ^R2 is 0.82). It can be concluded that the mean absolute error of the blood glucose concentration prediction method based on machine learning provided in the present application is approximately 0.60 mmol/L (the root mean square error is 0.21, and ^R2 is 0.87). Compared with the comparative example, it is shown that the blood glucose concentration prediction method provided in the present application has significant improvement in blood glucose prediction, further reduces the error, and improves the accuracy of the prediction results, which is better than the prediction effect of directly using the regression model without classification in the prior art.

如图6所示，相应地，本申请实施例还提供了一种基于机器学习的血糖浓度预测系统，包括：As shown in FIG6 , accordingly, the embodiment of the present application further provides a blood glucose concentration prediction system based on machine learning, comprising:

数据获取单元100，用于获取待测者的气味特征数据集及对应的已测原始血糖浓度数据，所述气味特征数据集的获取方式为通过气体传感器阵列进行采集；The data acquisition unit 100 is used to acquire the odor characteristic data set of the subject to be tested and the corresponding measured original blood glucose concentration data, wherein the odor characteristic data set is acquired by collecting data through a gas sensor array;

初步处理单元110，用于将气味特征数据集进行小波变换去噪和归一化处理，得到处理后的气味特征数据集；A preliminary processing unit 110 is used to perform wavelet transform denoising and normalization processing on the odor feature data set to obtain a processed odor feature data set;

分类单元120，用于通过预先构建的分类模型将处理后的气味特征数据集进行分类，得到预测低血糖数据集和预测高血糖数据集；A classification unit 120, configured to classify the processed odor feature data set by using a pre-built classification model to obtain a predicted hypoglycemia data set and a predicted hyperglycemia data set;

搜索单元130，用于分别搜索预测低血糖数据集和预测高血糖数据集对应的已测原始血糖浓度数据；A search unit 130, used to search for the measured original blood glucose concentration data corresponding to the predicted hypoglycemia data set and the predicted hyperglycemia data set respectively;

血糖浓度预测单元140，用于将预测低血糖数据集对应的已测原始血糖浓度数据通过预先构建的预测低血糖浓度的回归模型进行回归处理、将预测高血糖数据集对应的已测原始血糖浓度数据通过预先构建的预测高血糖浓度的回归模型进行回归处理，得到待测者的血糖浓度预测结果；The blood glucose concentration prediction unit 140 is used to perform regression processing on the measured original blood glucose concentration data corresponding to the predicted hypoglycemia data set through a pre-built regression model for predicting hypoglycemia concentration, and to perform regression processing on the measured original blood glucose concentration data corresponding to the predicted hyperglycemia data set through a pre-built regression model for predicting hyperglycemia concentration, so as to obtain the predicted result of the blood glucose concentration of the subject to be tested;

预测结果输出单元150，将血糖浓度预测结果进行输出。The prediction result output unit 150 outputs the blood sugar concentration prediction result.

如图7所示，具体地，分类单元120包括：As shown in FIG. 7 , specifically, the classification unit 120 includes:

样本数据获取模块1201，用于获取实验样本的气味特征数据集及对应的血糖浓度数据；The sample data acquisition module 1201 is used to acquire the odor characteristic data set of the experimental sample and the corresponding blood glucose concentration data;

数据预处理模块1202，用于对数据进行预处理，以对气味特征数据集和血糖浓度数据中的缺失值和异常值进行处理；A data preprocessing module 1202 is used to preprocess the data to process missing values and abnormal values in the odor characteristic data set and the blood sugar concentration data;

划分模块1203，用于将预处理后的实验样本的气味特征数据集划分为第一训练集和第一测试集；A division module 1203 is used to divide the odor feature data set of the preprocessed experimental sample into a first training set and a first test set;

初始化参数模块1204，用于初始化分布式梯度增强模型(XGboost模型)的参数，包括弱学习器的类型、叶节点分裂所需的最小损失减少阈值、学习率和最大数深度；Initialization parameter module 1204, used to initialize the parameters of the distributed gradient boosting model (XGboost model), including the type of weak learner, the minimum loss reduction threshold required for leaf node splitting, the learning rate, and the maximum log depth;

寻优模块1205，用于通过混沌反馈自适应鲸鱼优化算法(CFAWOA算法)对XGboost模型的参数进行寻优，输出最优参数，将最优参数赋值给XGboost模型，得到优化后的XGboost模型；The optimization module 1205 is used to optimize the parameters of the XGboost model through the chaotic feedback adaptive whale optimization algorithm (CFAWOA algorithm), output the optimal parameters, assign the optimal parameters to the XGboost model, and obtain the optimized XGboost model;

分类模型训练模块1206，用于对优化后的XGboost模型进行训练，将第一训练集代入优化后的XGboost模型中，通过添加弱学习器更新XGboost模型的预测值进而训练XGboost模型，当满足XGboost模型的最大迭代次数或XGboost模型的预测性能不再提升时，停止训练，将训练好的XGboost模型作为分类模型；否则，继续添加弱学习器对上一轮训练后的XGboost模型进行训练；The classification model training module 1206 is used to train the optimized XGboost model, substitute the first training set into the optimized XGboost model, and train the XGboost model by adding weak learners to update the prediction value of the XGboost model. When the maximum number of iterations of the XGboost model is met or the prediction performance of the XGboost model is no longer improved, the training is stopped and the trained XGboost model is used as the classification model; otherwise, the weak learners are continued to be added to train the XGboost model after the previous round of training;

分类模型评估模块1207，用于评估分类模型，计算第一测试集通过训练好的XGboost模型得到的分类结果与其对应的血糖浓度数据的均方误差或均方根误差，进而对分类模型进行评估。The classification model evaluation module 1207 is used to evaluate the classification model, calculate the mean square error or root mean square error between the classification result obtained by the trained XGboost model of the first test set and the corresponding blood glucose concentration data, and then evaluate the classification model.

相应地，本申请实施例还提供了一种终端设备，包括：Accordingly, an embodiment of the present application further provides a terminal device, including:

存储器；Memory;

处理器；以及Processor; and

计算机程序；Computer programs;

相应地，本申请实施例还提供了一种计算机可读存储介质，其上存储有计算机程序；所述计算机程序被处理器执行以实现如上面任一项内容所述的血糖浓度预测方法。Accordingly, an embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored; the computer program is executed by a processor to implement the blood glucose concentration prediction method as described in any of the above contents.

在实际应用中，待测者需要监控并获取准确的血糖浓度水平，以进一步对自己的健康状态进行实时管理，目前获得准确的血糖浓度的方式还停留在有创阶段，其为待测者带来了不好的使用体验。而本申请中提供了一种基于机器学习的血糖浓度预测方法，其采用无创方式，通过待测者的气味特征数据以及对应的已测原始血糖浓度数据即可完成血糖的预测，将分类和回归结合起来，采用先分类后回归的方式简化了问题，提高了血糖预测效率和泛化能力，提高了预测过程中的可解释性和鲁棒性，能够更好地捕捉气味特征数据的非线性关系，使预测结果更准确，为待测者提供可靠的血糖预测数据。In practical applications, the test subject needs to monitor and obtain accurate blood sugar concentration levels to further manage their health status in real time. The current method of obtaining accurate blood sugar concentration is still in the invasive stage, which brings a bad user experience to the test subject. The present application provides a blood sugar concentration prediction method based on machine learning, which adopts a non-invasive method. It can complete the blood sugar prediction through the odor characteristic data of the test subject and the corresponding measured original blood sugar concentration data. It combines classification and regression, and simplifies the problem by using the method of classification first and then regression, improves the efficiency and generalization ability of blood sugar prediction, improves the interpretability and robustness of the prediction process, can better capture the nonlinear relationship of the odor characteristic data, make the prediction result more accurate, and provide reliable blood sugar prediction data for the test subject.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。本申请实施例中的方案可以采用各种计算机语言实现，例如，C语言、VHDL语言、Verilog语言、面向对象的程序设计语言Java和直译式脚本语言JavaScript等。Those skilled in the art will appreciate that the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, the present application can adopt the form of complete hardware embodiment, complete software embodiment, or the embodiment in combination with software and hardware. Moreover, the present application can adopt the form of the computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) that contain computer-usable program code. The scheme in the embodiment of the present application can be implemented in various computer languages, for example, C language, VHDL language, Verilog language, object-oriented programming language Java and literal scripting language JavaScript, etc.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to the flowchart and/or block diagram of the method, device (system) and computer program product according to the embodiment of the present application. It should be understood that each process and/or box in the flowchart and/or block diagram, and the combination of the process and/or box in the flowchart and/or block diagram can be realized by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for realizing the function specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

尽管已描述了本申请的优选实施例，但本领域内的技术人员一旦得知了基本创造性概念，则可对这些实施例作出另外的变更和修改。所以，所附权利要求意欲解释为包括优选实施例以及落入本申请范围的所有变更和修改。Although the preferred embodiments of the present application have been described, those skilled in the art may make other changes and modifications to these embodiments once they have learned the basic creative concept. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments and all changes and modifications falling within the scope of the present application.

显然，本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样，倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内，则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the present application without departing from the spirit and scope of the present application. Thus, if these modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include these modifications and variations.

Claims

1. A method for predicting blood glucose concentration based on machine learning, characterized in that it comprises the following steps:

Acquire an odor characteristic data set of the subject to be tested and the corresponding measured original blood glucose concentration data, wherein the odor characteristic data set is acquired by collecting data through a gas sensor array;

The odor feature data set is subjected to wavelet transform denoising and normalization processing to obtain a processed odor feature data set;

The processed odor feature data set is classified by a pre-built classification model to obtain a predicted hypoglycemia data set and a predicted hyperglycemia data set;

Searching for the measured original blood glucose concentration data corresponding to the predicted hypoglycemia dataset and the predicted hyperglycemia dataset respectively;

The measured original blood glucose concentration data corresponding to the predicted hypoglycemia data set are regressed through a pre-constructed regression model for predicting hypoglycemia concentration, and the measured original blood glucose concentration data corresponding to the predicted hyperglycemia data set are regressed through a pre-constructed regression model for predicting hyperglycemia concentration, to obtain the predicted result of the blood glucose concentration of the subject, and the predicted result of the blood glucose concentration is output.

2. The method for predicting blood glucose concentration based on machine learning according to claim 1, wherein the method for constructing the classification model comprises the following steps:

Obtain the odor feature data set and corresponding blood glucose concentration data of the experimental samples;

Preprocess the data to handle missing values and outliers in the odor feature dataset and blood glucose concentration data;

Dividing the odor feature data set of the preprocessed experimental samples into a first training set and a first test set;

Initialize the parameters of the XGboost model, including the type of weak learner, the minimum loss reduction threshold required for leaf node splitting, the learning rate, and the maximum log depth;

The parameters of the XGboost model are optimized through the CFAWOA algorithm, the optimal parameters are output, and the optimal parameters are assigned to the XGboost model to obtain the optimized XGboost model;

Train the optimized XGboost model, substitute the first training set into the optimized XGboost model, and train the XGboost model by adding weak learners to update the prediction value of the XGboost model. When the maximum number of iterations of the XGboost model is met or the prediction performance of the XGboost model is no longer improved, stop training and use the trained XGboost model as the classification model; otherwise, continue to add weak learners to train the XGboost model after the previous round of training;

Evaluate the classification model, calculate the mean square error or root mean square error of the classification result obtained by the trained XGboost model for the first test set and the corresponding blood glucose concentration data, and then evaluate the classification model.

3. The method for predicting blood glucose concentration based on machine learning according to claim 2, characterized in that the method for constructing the regression model for predicting low blood glucose concentration and the regression model for predicting high blood glucose concentration both comprise the following steps:

The odor feature dataset of the preprocessed experimental samples is divided into a low-glucose training set and a high-glucose training set according to the blood glucose concentration; the low-glucose training set consists of S experimental samples: Dlow = (g ₁ , y ₁ ), (g ₂ , y ₂ ), K, (g _S , y _S ), where g _s ∈ R ^m , m is the number of odor features in each experimental sample, y _s is the blood glucose concentration of the s-th low-glucose experimental sample, s = 1, 2, K, S, and S is the number of experimental samples in the low-glucose training set; the high-glucose training set consists of V samples, Dhigh = (g ₁ , y ₁ ), (g ₂ , y ₂ ), K, (g _V , y _V ), where g _v ∈ R ^m , m is the number of odor features in each experimental sample, y _v is the actual blood glucose value of the v-th high-glucose experimental sample, v = 1, 2, K, V, and V is the number of experimental samples in the high-glucose training set;

A regression model for predicting low blood sugar concentration is constructed using the low blood sugar training set, and a regression model for predicting high blood sugar concentration is constructed using the high blood sugar training set.

4. The method for predicting blood glucose concentration based on machine learning according to claim 2, characterized in that the optimization of the parameters of the XGboost model by the CFAWOA algorithm, outputting the optimal parameters, assigning the optimal parameters to the XGboost model, and obtaining the optimized XGboost model comprises the following steps:

S10, initialize the number of agents and the maximum number of iterations of the WOA algorithm, and randomly initialize the position values of all agents in each dimension within the value range;

S20, calculating the decision variable length of the WOA algorithm;

S30, setting parameters of the WOA algorithm, randomly generating initial whale individual position parameters, the parameters of the WOA algorithm include: population size N, dimension of search space, whale speed, and adaptive weight;

S40, taking the mean square error as the fitness value, calculating the fitness f(x) of each individual whale, and recording the current optimal individual and position;

S50, the sin chaos self-mapping model is applied to initialize the population and improve the population distribution; the sin chaos self-mapping model is: n＝0,1,2,...,N-1≤x _n ≤1,x _n ≠0;

S60, dynamically adjust the adaptive weight ω; Where f(x) is the fitness of individual whale x, u is the best fitness value in the whale population in the first iteration calculation, and iter represents the current number of iterations;

S70, updating the value of control parameter a, The value of a decreases linearly from 2 to 0; where t is the current iteration number and T _max is the maximum iteration number;

Update the current whale individual location X(t);

Update the values of A, C, l, and p, A = 2ar ₁ -a, C = 2r ₂ , A is a random number in [-a, a], l is a random number in (-1, 1); where r ₁ and r ₂ are random numbers in (0, 1);

Select the worst individual position vector X _worst and the best individual position vector X ^* (t) in the population;

S80, dynamically adjusting the b value according to the current search situation, and selecting the optimal constant b to define the shape of the spiral;

S90, the whale population selects a behavioral model to update the current position of individual whales;

S100, update the optimal whale position X ^* (t) according to the worst whale position X _worst , and calculate the fitness f (x) of the individuals in the group;

S110, determining whether the maximum number of iterations has been reached, if so, ending the loop and outputting the optimal parameters; otherwise, returning to step S30;

Or determine whether the fitness value of the current optimal solution is the same as the minimum fitness value in the previous iteration. If so, end the loop and output the optimal parameters; otherwise, return to step S30;

S120, assigning the optimal parameters to the XGboost model to obtain an optimized XGboost model.

5. The blood glucose concentration prediction method based on machine learning according to claim 4 is characterized in that, in step S90, the whale population selection behavior model is used to update the position of the current whale individual, comprising the following steps:

The probability threshold is pre-set as p _i ;

If p＜ _pi , then perform the following steps:

If |A<1|, the position of the current whale individual is updated using the shrinking and encircling position update model; the shrinking and encircling position update model is: X(t+1)＝ωX ^* (t)-AD, where D＝|CX ^* (t)-X(t)|;

If |A≥1|, the search and foraging mode position update model is used to update the current individual whale position, and the current worst individual whale position is updated according to the feedback model; the search mode position update model is: X(t+1)＝X _rand -AD, where D＝|CX _rand -Xt|, X _rand is a randomly selected whale position vector; the feedback model is: X _worstnew ＝X _worst -r·(X _p -X _worst ), where r means that if X _worstnew is better than X _worst , then X _worstnew is accepted;

If _p≥pi , the spiral update position model is used to update the current individual whale position; the spiral update position model is: X(t+1) ^＝ ωX ^* (t)+ _Dpeblcos (2πl), where _Dp ＝|X ^* (t)-X(t)|.

6. The method for predicting blood glucose concentration based on machine learning according to claim 3, characterized in that the regression model for predicting low blood glucose concentration is an adaptive boosting regression model:

where G(g) is the median of all _ω'k _hk ( _gs ), k = 1, 2, K, K, K is the number of weak learners, s = 1, 2, K, S, S is the number of experimental samples in the hypoglycemia training set, _ω'k is the weight of the kth weak learner, and _hk ( _gs ) is the prediction result of the kth weak learner for the sth hypoglycemia experimental sample.

7. The method for predicting blood glucose concentration based on machine learning according to claim 6, wherein the regression model for predicting high blood glucose concentration is a gradient boosting regression model:

Where g∈R _qj , L(y _v ,c) is the loss function, v=1,2,K,V, V is the number of experimental samples in the hyperglycemia training set, y _v is the actual blood glucose value of the vth hyperglycemia experimental sample, c is the cumulative model obtained in the q-1th round, Q is the number of iterations, c _qj is the best fitting value of the qth regression tree at the jth leaf node, j=1,2,KJ, and J is the number of leaf nodes.

8. A blood glucose concentration prediction system based on machine learning, comprising:

A data acquisition unit (100) is used to acquire an odor characteristic data set of a subject to be tested and corresponding original measured blood glucose concentration data, wherein the odor characteristic data set is acquired by collecting data through a gas sensor array;

A preliminary processing unit (110) is used to perform wavelet transform denoising and normalization processing on the odor feature data set to obtain a processed odor feature data set;

A classification unit (120), used to classify the processed odor feature data set by using a pre-built classification model to obtain a predicted hypoglycemia data set and a predicted hyperglycemia data set;

A search unit (130), used to search for the measured original blood glucose concentration data corresponding to the predicted hypoglycemia data set and the predicted hyperglycemia data set respectively;

A blood glucose concentration prediction unit (140) is used to perform regression processing on the measured original blood glucose concentration data corresponding to the predicted hypoglycemia data set through a pre-constructed regression model for predicting hypoglycemia concentration, and to perform regression processing on the measured original blood glucose concentration data corresponding to the predicted hyperglycemia data set through a pre-constructed regression model for predicting hyperglycemia concentration, so as to obtain a predicted result of the blood glucose concentration of the subject to be tested;

The prediction result output unit (150) outputs the blood sugar concentration prediction result.

9. A terminal device, comprising:

Memory;

Processor; and

Computer programs;

The computer program is stored in the memory and is configured to be executed by the processor to implement the blood glucose concentration prediction method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that a computer program is stored thereon; the computer program is executed by a processor to implement the blood glucose concentration prediction method according to any one of claims 1 to 7.