CN118333397A

CN118333397A - Prediction method for severity of marine traffic accident

Info

Publication number: CN118333397A
Application number: CN202410492527.4A
Authority: CN
Inventors: 王新建; 冯胤伟; 李键; 王焕新; 王津; 王欣; 刘正江
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2024-04-23
Filing date: 2024-04-23
Publication date: 2024-07-12

Abstract

The present invention provides a method for predicting the severity of marine traffic accidents, comprising: using marine accident investigation reports to construct a data set of factors affecting marine accident risks; based on the constructed data set of factors affecting marine accident risks, using a feature selection method to train the accuracy of a machine learning model and the interpretability of feature selection; using a three-stage performance evaluation method of stability evaluation, prediction performance evaluation, comprehensive evaluation and statistical test to evaluate the performance of the feature selection method; using six machine learning models for comparison to measure the performance of different prediction factors, and using the machine learning model with the best prediction performance of the severity of ship accidents as a benchmark model; using the selected model with the highest prediction performance and the best features to predict the severity of the accident, and performing benefit evaluation, and counterfactually analyzing the effects of risk control measures from a quantitative perspective. The method of the present invention can effectively analyze and predict the severity of marine accidents.

Description

A method for predicting the severity of marine traffic accidents

技术领域Technical Field

本发明涉及海上事故风险预测技术领域，具体而言，尤其涉及一种海上交通事故严重程度预测方法。The present invention relates to the technical field of marine accident risk prediction, and in particular to a method for predicting the severity of marine traffic accidents.

背景技术Background technique

近年来，随着人工智能技术的发展，机器学习(ML)被广泛用于分析海上事故，提高航行安全和效率。其应用对于预防事故和建立安全的海洋环境不可或缺。而事故预测是科学地进行决策的基础，对事故预防至关重要。机器学习在识别风险影响因素(RIFs)和预测事故方面效果显著。然而，其目前的应用更多地局限于事故频率分析和风险影响因素识别，其在预测海上事故严重性方面的能力尚待开发。In recent years, with the development of artificial intelligence technology, machine learning (ML) has been widely used to analyze maritime accidents and improve navigation safety and efficiency. Its application is indispensable for preventing accidents and establishing a safe marine environment. Accident prediction is the basis for scientific decision-making and is crucial for accident prevention. Machine learning is effective in identifying risk influencing factors (RIFs) and predicting accidents. However, its current application is more limited to accident frequency analysis and risk influencing factor identification, and its ability to predict the severity of maritime accidents has yet to be developed.

机器学习技术可以学习和探索不同严重程度事故的规律，对于有针对性地开展事故预防、事故后识别和模拟等工作至关重要。然而当前很少有研究尝试估计或预测海上事故的严重程度。有研究提出根据海上事故的空间分布去识别事故易发区域和预测事故严重程度的叠加模型。通过分析事故聚集和空间相关性的特点，以确定事故严重性的模式，并在叠加模型的帮助下，确定与事故严重程度密切相关的交通特征，并根据这些特征进行了预测。还有使用数据驱动方法来预测海难的严重程度，这是通过分析此类事故的风险影响因素之间的关系，进而确定预测船舶碰撞事故严重性的关键方面。尽管上述研究为海上事故分析提供了宝贵的资料和见解，但仍有一些研究空白需要填补：Machine learning technology can learn and explore the patterns of accidents of different severity, which is essential for targeted accident prevention, post-accident identification and simulation. However, few studies have attempted to estimate or predict the severity of maritime accidents. Some studies have proposed an overlay model to identify accident-prone areas and predict the severity of accidents based on the spatial distribution of maritime accidents. By analyzing the characteristics of accident clustering and spatial correlation, the pattern of accident severity is determined, and with the help of the overlay model, traffic characteristics that are closely related to the severity of the accident are determined, and predictions are made based on these characteristics. There is also the use of data-driven methods to predict the severity of maritime casualties, which is to determine the key aspects of predicting the severity of ship collision accidents by analyzing the relationship between the risk influencing factors of such accidents. Although the above studies provide valuable information and insights for maritime accident analysis, there are still some research gaps that need to be filled:

1.大多数海上事故严重程度预测研究仅仅考察了少数几个风险影响因素，尚未从安全系统工程角度进行全面分析。1. Most studies on the prediction of the severity of marine accidents have only examined a few risk factors and have not yet conducted a comprehensive analysis from the perspective of safety system engineering.

2.关注海上事故严重程度预测的研究较少，其中在数据处理方法或预测模型的准确性方面存在不足，如存在数据泄露、召回率高等问题。2. There are relatively few studies focusing on the prediction of the severity of maritime accidents, among which there are deficiencies in the accuracy of data processing methods or prediction models, such as data leakage and high recall rate.

3.现有研究很少有同时呈现事故严重程度和相关风险控制措施，更少有涉及定量分析风险控制措施的效果。3. Existing studies rarely present both the severity of accidents and related risk control measures, and even fewer involve quantitative analysis of the effectiveness of risk control measures.

上述背景技术旨在辅助理解本发明的发明构思及技术方案，其并不必然属于本专利申请的现有技术，在没有明确的证据表明上述内容在本专利申请的申请日前已公开的情况下，上述背景技术不应当用于评价本申请技术方案的新创性。The above-mentioned background technology is intended to assist in understanding the inventive concept and technical solution of the present invention. It does not necessarily belong to the prior art of this patent application. In the absence of clear evidence that the above-mentioned content has been disclosed before the filing date of this patent application, the above-mentioned background technology should not be used to evaluate the novelty of the technical solution of this application.

发明内容Summary of the invention

根据上述提出当前海上事故严重性预测研究不足的现状，以预防严重海上事故及提供安全的海洋环境为目标，提供一种海上交通事故严重程度预测方法。本发明首先从安全系统工程角度分析相关的海上事故调查报告，以识别风险影响因素并建立事故严重程度标签。其次，开发并使用三种特征选择(FS)方法来确定关键风险影响因素，并对其性能进行评估。然后，采用6种最先进的机器学习模型来训练包含关键风险影响因素的数据集，并预测事故的严重程度。最终，对关键风险影响因素进行解释和分析，以了解控制特定风险影响因素的显著益处。According to the above-mentioned current situation of insufficient research on the prediction of the severity of marine accidents, a method for predicting the severity of marine traffic accidents is provided with the goal of preventing serious marine accidents and providing a safe marine environment. The present invention first analyzes relevant marine accident investigation reports from the perspective of safety system engineering to identify risk influencing factors and establish accident severity labels. Secondly, three feature selection (FS) methods are developed and used to determine key risk influencing factors and evaluate their performance. Then, 6 state-of-the-art machine learning models are used to train a data set containing key risk influencing factors and predict the severity of the accident. Finally, the key risk influencing factors are interpreted and analyzed to understand the significant benefits of controlling specific risk influencing factors.

为解决上述背景技术中提及的至少一种技术问题，本发明旨在提供一种海上交通事故严重程度预测方法，涉及的内容主要包括五个方面：从安全系统工程角度建立数据库；开发特征选择方法；特征选择方法的性能评估；事故严重程度预测评估；以及基于反事实分析的风险控制措施的有效性评估。本发明不仅可以为安全评估和风险防控研究提供基础，还有效规避了前人研究在数据泄露方面的问题，填补海上事故严重程度预测领域的研究空白。In order to solve at least one of the technical problems mentioned in the above background technology, the present invention aims to provide a method for predicting the severity of maritime traffic accidents, which mainly involves five aspects: establishing a database from the perspective of safety system engineering; developing a feature selection method; performance evaluation of the feature selection method; accident severity prediction evaluation; and effectiveness evaluation of risk control measures based on counterfactual analysis. The present invention can not only provide a basis for safety assessment and risk prevention and control research, but also effectively avoid the problems of data leakage in previous studies, filling the research gap in the field of maritime accident severity prediction.

本发明采用的技术手段如下：The technical means adopted by the present invention are as follows:

一种海上交通事故严重程度预测方法，包括：A method for predicting the severity of a marine traffic accident, comprising:

S1、利用海上事故调查报告，构建海上事故风险影响因素数据集；S1. Using marine accident investigation reports, construct a dataset of factors affecting marine accident risks;

S2、基于构建的海上事故风险影响因素数据集，采用特征选择方法训练机器学习模型的准确性和特征选择的可解释性；S2. Based on the constructed dataset of factors affecting marine accident risks, feature selection methods are used to train the accuracy of machine learning models and the interpretability of feature selection;

S3、采用稳定性评价、预测性能评价、综合评价与统计检验的三阶段性能评价方法，评价特征选择方法的性能；S3, using a three-stage performance evaluation method of stability evaluation, prediction performance evaluation, comprehensive evaluation and statistical test to evaluate the performance of the feature selection method;

S4、采用六种机器学习模型进行比较，衡量不同预测因子的性能，将船舶事故严重程度预测性能最好的机器学习模型作为基准模型；S4. Six machine learning models were used for comparison to measure the performance of different predictors, and the machine learning model with the best performance in predicting the severity of ship accidents was used as the benchmark model;

S5、利用筛选出的预测性能最高的模型和最优的特征进行事故严重程度预测，并进行效益评估，从定量角度反事实分析风险控制措施的效果。S5. Use the selected model with the highest prediction performance and the best features to predict the severity of the accident, conduct benefit evaluation, and analyze the effectiveness of risk control measures from a quantitative perspective.

进一步地，步骤S1具体包括：Furthermore, step S1 specifically includes:

S11、从事故调查报告中提取人、船、环境、管理因素以及事故基本信息作为海上事故风险影响因素数据集的标准级特征；S11. Extract human, ship, environment, management factors and basic accident information from the accident investigation report as standard-level features of the marine accident risk influencing factor dataset;

S12、在数据处理阶段将标准级特征归类为68个索引级特征；S12, classifying standard-level features into 68 index-level features during the data processing stage;

S13、将连续性数据和字符型数据转化为离散的类别数据。S13. Convert continuous data and character data into discrete categorical data.

进一步地，步骤S2具体包括：Furthermore, step S2 specifically includes:

S21、基于关联规则挖掘和三权重排序算法的特征选择方法，以挖掘影响海上事故的事故特征之间的相互作用，并对其进行排序；S21. Feature selection method based on association rule mining and three-weight ranking algorithm to mine the interactions between accident features affecting marine accidents and rank them;

S22、采用以事故后果为导向的基于新的互信息熵(NMIE)的特征选择方法，利用各影响因素与事故严重程度的信息熵对影响因素进行排序；S22. Adopting the feature selection method based on the new mutual information entropy (NMIE) guided by accident consequences, the influencing factors are ranked by using the information entropy of each influencing factor and the severity of the accident;

S23、利用步骤S21采用的特征选择方法，挖掘各特征之间的关系并对特征进行排序，再利用步骤S22采用的特征选择方法，计算步骤S22采用的特征选择方法未挖掘出的特征与目标之间的互信息熵，并对其进行排序，将两部分排序结果结合得到最终的特征排序结果。S23. Using the feature selection method adopted in step S21, the relationship between each feature is mined and the features are sorted. Then, using the feature selection method adopted in step S22, the mutual information entropy between the features not mined by the feature selection method adopted in step S22 and the target is calculated, and the mutual information entropy between the features and the target is sorted. The two sorting results are combined to obtain the final feature sorting result.

进一步地，步骤S21具体包括：Furthermore, step S21 specifically includes:

S211、利用关联规则挖掘技术查找事故特征之间的关联、共现或因果关系：为了精确有效地找到事故特征之间的关系，使用FP-Growth算法来挖掘频繁项集，并通过构建FP树生成关联规则，筛选出重要的关联规则；如果一个关联规则满足最小支持度和置信度，且提升度大于1，则该关联规则被认为是有效的，有效的关联规则用表示，计算公式如下：S211. Use association rule mining technology to find the association, co-occurrence or causal relationship between accident features: In order to accurately and effectively find the relationship between accident features, the FP-Growth algorithm is used to mine frequent item sets, and the association rules are generated by constructing an FP tree to screen out important association rules; if an association rule meets the minimum support and confidence, and the lift is greater than 1, then the association rule is considered valid. Valid association rules are used The calculation formula is as follows:

其中，Supp(X)表示X事务的支持度；表示的置信度；表示的提升度；N_X表示项目集合中X事务出现的次数；N表示项目集合中事务的数量；Among them, Supp(X) represents the support of transaction X; express confidence level; express The degree of improvement; N _X represents the number of occurrences of X transaction in the project set; N represents the number of transactions in the project set;

S212、利用图论方法将不同事故特征之间的关联规则映射到复杂影响因素交互网络上；复杂影响因素交互网络中的节点代表事故特征，而边代表它们之间的关联；边的权重与关联规则的置信度相对应；复杂影响因素交互网络反映出各种事故特征在海洋事故发展过程中的相互作用；利用基于关联规则挖掘和三权重排序算法的特征选择方法确定复杂影响因素交互网络中各事故特征的重要性，并对事故特征进行相应的排序。S212. Use graph theory methods to map the association rules between different accident characteristics to the complex influencing factor interaction network; the nodes in the complex influencing factor interaction network represent accident characteristics, and the edges represent the associations between them; the weights of the edges correspond to the confidence of the association rules; the complex influencing factor interaction network reflects the interaction of various accident characteristics in the development process of marine accidents; use the feature selection method based on association rule mining and three-weight ranking algorithm to determine the importance of each accident feature in the complex influencing factor interaction network, and rank the accident features accordingly.

进一步地，步骤S21中采用的基于关联规则挖掘和三权重排序算法的特征选择方法包括两项优化改进，具体如下：Furthermore, the feature selection method based on association rule mining and three-weight ranking algorithm adopted in step S21 includes two optimization improvements, which are as follows:

第一项改进：提出基于权重的转移概率矩阵概念，并使用置信度作为相邻风险影响因素之间的转移概率，以取代原算法中的定义，计算公式如下：The first improvement: The concept of a weighted transition probability matrix is proposed, and confidence is used as the transition probability between adjacent risk influencing factors to replace the definition in the original algorithm. The calculation formula is as follows:

其中，PM表示转移概率矩阵；n表示被FP-Growth算法挖掘出的影响因素数目；g表示在原有网络中加入的Ground节点；δ_ij表示节点n_i到节点n_j的转移概率；i→j表示存在节点n_i到节点n_j的边；表示节点n_j的入度；e_ji表示影响因素n_j到影响因素n_i的关联情况；t表示迭代次数；表示节点n_i在第t次迭代时获得的分数；(matrix)_ij表示只有第i行，第j列元素为1，其余元素全为0的n+1阶方阵；(zeros-one)_j表示只有第j列元素为1，其余元素全为0的1×(n+1)阶矩阵；(ones)表示元素全为1的(n+1)×1阶矩阵；*表示哈达玛积；·表示矩阵乘积；Among them, PM represents the transition probability matrix; n represents the number of influencing factors mined by the FP-Growth algorithm; g represents the Ground node added to the original network; δ _ij represents the transition probability from node _ni to node _nj ; i→j represents the existence of an edge from node _ni to node _nj ; represents the in-degree of node n _j ; e _ji represents the association between influencing factor n _j and influencing factor n _i ; t represents the number of iterations; represents the score obtained by node n _i at the tth iteration; (matrix) _ij represents an n+1-order square matrix with only the i-th row and j-th column elements being 1 and the rest being 0; (zeros-one) _j represents a 1×(n+1)-order matrix with only the j-th column elements being 1 and the rest being 0; (ones) represents an (n+1)×1-order matrix with all elements being 1; * represents the Hadamard product; · represents the matrix product;

第二项改进：根据图论中的节点外强度和节点间中心度系数理论，提出了一种新的综合权重，算法迭代完成后，利用该综合权重从Ground节点的最终得分中获得额外得分，计算公式如下：The second improvement: Based on the theory of node external strength and node centrality coefficient in graph theory, a new comprehensive weight is proposed. After the algorithm iteration is completed, the comprehensive weight is used to obtain additional scores from the final score of the Ground node. The calculation formula is as follows:

其中，表示影响因素n_i的出强度；CC_i表示影响因素n_i的介数中心系数；σ_ij(v)表示经过影响因素n_v的影响因素n_i到影响因素n_j的最短路数目；σ_ij表示影响因素n_i到影响因素n_j的最短路数目；μ_i表示本研究提出的综合权重；t_end表示算法迭代结束时的迭代次数。in, represents the output intensity of influencing factor n _i ; CC _i represents the betweenness central coefficient of influencing factor n _i ; σ _ij (v) represents the number of shortest paths from influencing factor n _i to influencing factor n _j through influencing factor n _v ; σ _ij represents the number of shortest paths from influencing factor n _i to influencing factor n _j ; μ _i represents the comprehensive weight proposed in this study; t _end represents the number of iterations at the end of the algorithm iteration.

进一步地，步骤S22中，以事故后果为导向的基于新的互信息熵(NMIE)的特征选择方法，互信息熵的计算公式如下：Furthermore, in step S22, the feature selection method based on the new mutual information entropy (NMIE) guided by accident consequences is calculated by the following formula:

其中，NMIE(X,Y)表示特征X与标签Y之间的互信息熵；m表示标签的状态数；n表示特征X的状态数；I(x_i,y_k)表示特征X的状态i与标签的状态k之间的互信息。Among them, NMIE(X,Y) represents the mutual information entropy between feature X and label Y; m represents the number of states of the label; n represents the number of states of feature X; I( _xi , _yk ) represents the mutual information between state i of feature X and state k of the label.

进一步地，步骤S3具体包括：Furthermore, step S3 specifically includes:

S31、稳定性评价：通过考虑随机性和训练集大小变化的影响来衡量特征选择方法的稳定性，采用类似k折叠的方法、随机数和重复实验，以确定随机性和训练集大小变化对特征选择过程的影响，使用Spearman's rho来衡量排名结果的稳定性，计算公式如下：S31. Stability evaluation: The stability of the feature selection method is measured by considering the influence of randomness and changes in the size of the training set. Methods similar to k-folding, random numbers, and repeated experiments are used to determine the influence of randomness and changes in the size of the training set on the feature selection process. Spearman's rho is used to measure the stability of the ranking results. The calculation formula is as follows:

其中，Sprr^R1,R2表示排序R1和排序R2之间的相关性；R1作为参考排名；R2作为实验排序；N表示主要特征的数目；M表示R1的主要特征中包含的所有特征；如果特征i不在R2的主要特征中，那么R2_i＝N+1；Where Sprr ^R1,R2 represents the correlation between ranking R1 and ranking R2; R1 is used as the reference ranking; R2 is used as the experimental ranking; N represents the number of main features; M represents all features included in the main features of R1; if feature i is not in the main features of R2, then R2 _i = N+1;

S32、预测性能评价：通过特征选择的风险影响因素来训练预测模型，并测量预测模型的准确性，找出以最少的风险影响因素实现最高准确率的风险影响因素子集，使用相应的指标对模型的预测性能进行评估；S32. Prediction performance evaluation: Train the prediction model through the risk influencing factors selected by features, measure the accuracy of the prediction model, find the subset of risk influencing factors that achieves the highest accuracy with the least risk influencing factors, and use the corresponding indicators to evaluate the prediction performance of the model;

S33、综合评价与统计检验：通过确定特征选择方法是否具有可接受的稳定性水平来评估特征选择过程的稳定性；如果特征选择方法的最佳稳定性低于0.6，或与具有最佳稳定性的特征选择方法的差异大于0.3，则认为该特征选择方法的稳定性不可接受；不可接受的稳定性表明特征选择方法无效；根据预测性能评估信息，分析具有可接受稳定性的特征选择方法的预测性能；如果不同特征选择方法的预测性能指标不相似，那么预测性能指标更好的方法被认为是性能更好的方法；如果预测性能指标相同，则进行5倍交叉检验，以检验机器学习模型预测性能的差异；如果检验结果显著性大于0.05，则两种方法的预测性能无显著差异；如果不同特征选择方法的预测性能存在相似性，则认为稳定性更好的特征选择方法具有更好的性能。S33. Comprehensive evaluation and statistical test: The stability of the feature selection process is evaluated by determining whether the feature selection method has an acceptable level of stability; if the best stability of the feature selection method is lower than 0.6, or the difference with the feature selection method with the best stability is greater than 0.3, the stability of the feature selection method is considered to be unacceptable; unacceptable stability indicates that the feature selection method is invalid; based on the prediction performance evaluation information, the prediction performance of the feature selection method with acceptable stability is analyzed; if the prediction performance indicators of different feature selection methods are not similar, then the method with better prediction performance indicators is considered to be the better performing method; if the prediction performance indicators are the same, a 5-fold cross-validation is performed to test the difference in the prediction performance of the machine learning model; if the test result is significant greater than 0.05, there is no significant difference in the prediction performance of the two methods; if there is similarity in the prediction performance of different feature selection methods, the feature selection method with better stability is considered to have better performance.

进一步地，步骤S4具体包括：Furthermore, step S4 specifically includes:

S41、引入支持向量机、朴素贝叶斯分类器、随机森林、AdaBoost、XGBoost、LightGBM六种机器学习模型；S41, introduce six machine learning models: support vector machine, naive Bayes classifier, random forest, AdaBoost, XGBoost, and LightGBM;

S42、采用准确度、精确度、召回率、F1值和曲线下面积，评估机器学习模型的预测性能，将船舶事故严重程度预测的预测性能最好的光梯度增强机作为基准模型。S42. Accuracy, precision, recall, F1 value and area under the curve were used to evaluate the prediction performance of the machine learning model, and the optical gradient enhancement machine with the best prediction performance for ship accident severity prediction was taken as the benchmark model.

较现有技术相比，本发明具有以下优点：Compared with the prior art, the present invention has the following advantages:

1、本发明提供的海上交通事故严重程度预测方法，可以有效地分析和预测海上事故的严重程度，为将人工智能技术应用于安全评估和事故预防研究提供理论依据和应用范例。航运企业和主管部门可以利用本发明提出的框架和改进后的机器学习模型开发海上事故严重性预测系统，建立更有效的风险控制体系，还可以利用本研究提出的特征选择方法，研究海上事故系统中风险影响因素的相互作用关系(即相互驱动与依赖、互为因果等)以及风险影响因素与事故严重性之间的相关性，从而找出对海上事故至关重要的风险影响因素，利用本发明提出的特征选择方法评估框架来评估所确定的关键风险影响因素的有效性。1. The method for predicting the severity of marine traffic accidents provided by the present invention can effectively analyze and predict the severity of marine accidents, and provide a theoretical basis and application example for applying artificial intelligence technology to safety assessment and accident prevention research. Shipping companies and competent authorities can use the framework and improved machine learning model proposed by the present invention to develop a marine accident severity prediction system and establish a more effective risk control system. They can also use the feature selection method proposed in this study to study the interaction relationship between risk influencing factors in the marine accident system (i.e. mutual drive and dependence, mutual causality, etc.) and the correlation between risk influencing factors and accident severity, so as to find out the risk influencing factors that are crucial to marine accidents, and use the feature selection method evaluation framework proposed by the present invention to evaluate the effectiveness of the key risk influencing factors identified.

2、本发明提供的海上交通事故严重程度预测方法，制定了新的海上洋事故严重性预测研究框架。该框架主要包括五个方面：从安全系统工程角度建立数据库；开发特征选择方法；特征选择方法的性能评估；事故严重性预测评估以及基于反事实分析的风险控制措施的有效性评估。2. The method for predicting the severity of marine traffic accidents provided by the present invention has established a new research framework for predicting the severity of marine accidents. The framework mainly includes five aspects: establishing a database from the perspective of safety system engineering; developing a feature selection method; performance evaluation of the feature selection method; accident severity prediction evaluation and effectiveness evaluation of risk control measures based on counterfactual analysis.

3、本发明提供的海上交通事故严重程度预测方法，针对海上事故开发了三种特征选择方法其中，第一种方法主要采用关联规则挖掘、改进的权重LeaderRank算法和复杂网络技术挖掘风险影响因素之间的关联关系；第二种方法主要采用改进的互信息熵算法分析风险影响因素与事故严重性之间的关联关系；第三种方法综合了第一种方法和第二种方法的优点，综合考虑了风险影响因素之间的相互作用以及风险影响因素不同的状态与事故严重性之间的关系。3. The method for predicting the severity of maritime traffic accidents provided by the present invention has developed three feature selection methods for maritime accidents. Among them, the first method mainly uses association rule mining, improved weighted LeaderRank algorithm and complex network technology to mine the correlation between risk influencing factors; the second method mainly uses the improved mutual information entropy algorithm to analyze the correlation between risk influencing factors and the severity of accidents; the third method combines the advantages of the first method and the second method, and comprehensively considers the interaction between risk influencing factors and the relationship between different states of risk influencing factors and the severity of accidents.

4、本发明提供的海上交通事故严重程度预测方法，开发了一种新的评价方法，从稳定性、性能改进和统计检验等多个方面综合衡量所提出的特征选择方法的性能。在稳定性评价过程中，提出了一种新的收敛性和稳定性算法，以提高稳定性评价的性能。4. The marine traffic accident severity prediction method provided by the present invention has developed a new evaluation method to comprehensively measure the performance of the proposed feature selection method from multiple aspects such as stability, performance improvement and statistical test. In the stability evaluation process, a new convergence and stability algorithm is proposed to improve the performance of stability evaluation.

5、本发明提供的海上交通事故严重程度预测方法，对主要经典机器学习模型的进行对比分析，比较了它们的预测性能，为开发海洋事故严重性预测模型提供了基线模型，使用统计方法验证了特征选择方法在提高机器学习模型性能方面的有效性。5. The method for predicting the severity of marine traffic accidents provided by the present invention conducts a comparative analysis of the main classic machine learning models, compares their prediction performance, provides a baseline model for developing a marine accident severity prediction model, and uses statistical methods to verify the effectiveness of the feature selection method in improving the performance of the machine learning model.

基于上述理由本发明可在海上事故风险预测等领域广泛推广。Based on the above reasons, the present invention can be widely promoted in the fields of marine accident risk prediction and the like.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图做以简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required for use in the embodiments or the description of the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying creative labor.

图1为本发明方法流程图。FIG1 is a flow chart of the method of the present invention.

图2为本发明实施例提供的海上交通事故报告的来源分布图。FIG. 2 is a source distribution diagram of marine traffic accident reports provided by an embodiment of the present invention.

图3为本发明实施例提供的基于关联规则挖掘和3-WLR的特征选择方法。FIG3 is a feature selection method based on association rule mining and 3-WLR provided by an embodiment of the present invention.

图4为本发明实施例提供的所有风险影响因素的频率分布树状图。FIG. 4 is a frequency distribution tree diagram of all risk influencing factors provided by an embodiment of the present invention.

图5为本发明实施例提供的事故严重程度的分布图。FIG. 5 is a distribution diagram of accident severity provided by an embodiment of the present invention.

图6为本发明实施例提供的特征选择方法的稳定性比较图。FIG6 is a stability comparison diagram of the feature selection method provided by an embodiment of the present invention.

图7为本发明实施例提供的基于不同特征选择方法的预测准确率。FIG. 7 shows the prediction accuracy based on different feature selection methods provided by an embodiment of the present invention.

图8为本发明实施例提供的基于三种特征选择方法所选风险影响因素的六个先进的机器学习模型的泛化性能图。FIG8 is a generalization performance diagram of six advanced machine learning models based on risk influencing factors selected by three feature selection methods provided in an embodiment of the present invention.

图9为本发明实施例提供的不同机器学习模型之间的胜、平、负百分比。FIG. 9 shows the win, draw and loss percentages between different machine learning models provided by an embodiment of the present invention.

图10为本发明实施例提供的单一风险控制措施的益处图。FIG. 10 is a diagram showing the benefits of a single risk control measure provided by an embodiment of the present invention.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本发明方案，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分的实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本发明保护的范围。In order to enable those skilled in the art to better understand the scheme of the present invention, the technical scheme in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are only part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without creative work should fall within the scope of protection of the present invention.

需要说明的是，本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second", etc. in the specification and claims of the present invention and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the data used in this way can be interchanged where appropriate, so that the embodiments of the present invention described herein can be implemented in an order other than those illustrated or described herein. In addition, the terms "including" and "having" and any variations thereof are intended to cover non-exclusive inclusions, for example, a process, method, system, product or device that includes a series of steps or units is not necessarily limited to those steps or units clearly listed, but may include other steps or units that are not clearly listed or inherent to these processes, methods, products or devices.

如图1所示，本发明提供了一种海上交通事故严重程度预测方法，包括：As shown in FIG1 , the present invention provides a method for predicting the severity of a marine traffic accident, comprising:

S2、基于构建的海上事故风险影响因素数据集，采用特征选择方法训练机器学习模型的准确性和特征选择的可解释性；在本实施例中，将本发明提出的三种特征选择方法以及作为基准方法的RFLV,SVM,RF和GBDT等四种传统的机器学习特征选择方法都放在步骤S1的训练集数据上进行训练。S2. Based on the constructed dataset of factors affecting marine accident risks, a feature selection method is used to train the accuracy of the machine learning model and the interpretability of feature selection. In this embodiment, the three feature selection methods proposed in the present invention and four traditional machine learning feature selection methods such as RFLV, SVM, RF and GBDT as benchmark methods are all trained on the training set data of step S1.

S4、采用六种机器学习(ML)模型进行比较，衡量不同预测因子的性能，将船舶事故严重程度预测性能最好的机器学习模型作为基准模型；S4. Six machine learning (ML) models were used for comparison to measure the performance of different predictors, and the machine learning model with the best performance in predicting the severity of ship accidents was used as the benchmark model;

具体实施时，作为本发明优选的实施方式，步骤S1中构建的海上事故风险影响因素数据集是由每起事故的风险影响因素(即事故特征)和事故严重程度标签(即目标类别)组成。而海上事故风险影响因素数据集的构建过程包括三个阶段：海上事故数据集提取、事故严重性标注和事故特征提取。步骤S1具体包括：In specific implementation, as a preferred embodiment of the present invention, the marine accident risk influencing factor dataset constructed in step S1 is composed of the risk influencing factors (i.e., accident characteristics) and accident severity labels (i.e., target categories) of each accident. The construction process of the marine accident risk influencing factor dataset includes three stages: marine accident dataset extraction, accident severity labeling, and accident feature extraction. Step S1 specifically includes:

在本实施例中，本发明将中国海事局(MSA)、美国联邦海事事故调查局(BSU)、美国国家运输安全委员会(NTSB)、日本运输安全委员会(JTSB)、澳大利亚运输安全委员会(ATSB)、加拿大运输安全委员会(TSB)以及海上事故调查局(MAIB)等7家海事调查机构数据库中2000-2019年的数据作为主要的数据来源，并对事故报告进行筛选最终得到了1294份海事事故报告数据。以这1294份海事事故调查报告为基础，确定了68个风险影响因素，并将其分为五类：人为因素、船舶因素、环境因素、管理因素和事故信息。其中，事故报告的来源分布如图2所示。特征的具体描述如附表A中的表A所示。图4显示了所有风险影响因素的频率分布。图4显示，在五类风险影响因素中，人为因素和管理因素出现的频率较高。在人为因素中，操作失误(H5)和违反规章制度(H6)出现的频率最高。在管理因素中，安全管理系统缺陷(M3)出现的频率较高。图4显示，碰撞(A1)和搁浅/搁浅(A2)是海上事故的主要类型。散货船(ST1)和渔船(ST7)是发生事故的主要船只类型。大部分事故发生时，航道内的交通状况普遍处于繁忙状态(ET)。同时，图5以柱状图显示了不同事故时间、事故类型、船型和船龄条件下的事故严重程度分布。从图5(a)可以发现，严重事故多发生在0时至4时(T6)，而图5(b)显示，严重事故发生频率最高的是倾覆/沉没事故(A5)。从图5(c)可以发现，事故的严重程度随船舶类型的不同而有很大差异。渔船(ST7)最容易发生严重事故。图5(d)显示，发生严重事故的船舶比例随着船龄的增加而增加。In this embodiment, the present invention uses the data from 2000 to 2019 in the databases of seven maritime investigation agencies, including the China Maritime Safety Administration (MSA), the United States Federal Maritime Accident Investigation Bureau (BSU), the United States National Transportation Safety Board (NTSB), the Japan Transport Safety Board (JTSB), the Australian Transport Safety Board (ATSB), the Canadian Transportation Safety Board (TSB) and the Marine Accident Investigation Bureau (MAIB) as the main data source, and screens the accident reports to finally obtain 1294 maritime accident report data. Based on these 1294 maritime accident investigation reports, 68 risk influencing factors were identified and divided into five categories: human factors, ship factors, environmental factors, management factors and accident information. Among them, the source distribution of the accident report is shown in Figure 2. The specific description of the features is shown in Table A in Appendix A. Figure 4 shows the frequency distribution of all risk influencing factors. Figure 4 shows that among the five types of risk influencing factors, human factors and management factors appear more frequently. Among the human factors, operational errors (H5) and violations of rules and regulations (H6) appear most frequently. Among the management factors, safety management system defects (M3) occur more frequently. Figure 4 shows that collision (A1) and grounding/stranding (A2) are the main types of marine accidents. Bulk carriers (ST1) and fishing vessels (ST7) are the main types of ships involved in accidents. When most accidents occurred, the traffic conditions in the waterway were generally busy (ET). At the same time, Figure 5 shows the distribution of accident severity under different accident time, accident type, ship type and ship age conditions in a bar chart. It can be seen from Figure 5(a) that serious accidents mostly occurred between 0 and 4 o'clock (T6), while Figure 5(b) shows that the most frequent serious accident is capsizing/sinking accident (A5). It can be seen from Figure 5(c) that the severity of accidents varies greatly depending on the type of ship. Fishing vessels (ST7) are most likely to have serious accidents. Figure 5(d) shows that the proportion of ships with serious accidents increases with the increase of ship age.

具体实施时，作为本发明优选的实施方式，步骤S2具体包括：In specific implementation, as a preferred embodiment of the present invention, step S2 specifically includes:

S21、基于关联规则挖掘和三权重排序算法(3-WLR)的特征选择方法，以挖掘影响海上事故的事故特征之间的相互作用，并对其进行排序；步骤S21具体包括：S21, a feature selection method based on association rule mining and a three-weight ranking algorithm (3-WLR) to mine the interactions between accident features that affect marine accidents and rank them; step S21 specifically includes:

S211、利用关联规则挖掘技术查找事故特征之间的关联、共现或因果关系：这一步提供了事故特征之间的宝贵信息。为了精确有效地找到事故特征之间的关系，使用FP-Growth算法来挖掘频繁项集，并通过构建FP树生成关联规则，三个常用的评估指标，即支持度、置信度和提升度，用于筛选出重要的关联规则；如果一个关联规则满足最小支持度和置信度，且提升度大于1，则该关联规则被认为是有效的，有效的关联规则用表示，计算公式如下：S211. Use association rule mining technology to find the association, co-occurrence or causal relationship between accident features: This step provides valuable information between accident features. In order to accurately and effectively find the relationship between accident features, the FP-Growth algorithm is used to mine frequent item sets, and association rules are generated by constructing FP trees. Three commonly used evaluation indicators, namely support, confidence and lift, are used to screen out important association rules; if an association rule meets the minimum support and confidence, and the lift is greater than 1, then the association rule is considered valid. Valid association rules are used The calculation formula is as follows:

步骤S21中采用的基于关联规则挖掘和三权重排序算法(3-WLR)的特征选择方法包括两项优化改进，具体如下：The feature selection method based on association rule mining and three-weight ranking algorithm (3-WLR) adopted in step S21 includes two optimization improvements, which are as follows:

S22、采用以事故后果为导向的基于新的互信息熵(NMIE)的特征选择方法，利用各影响因素(特征)与事故严重程度的信息熵对影响因素(特征)进行排序；步骤S22中，以事故后果为导向的基于新的互信息熵(NMIE)的特征选择方法，互信息熵的计算公式如下：S22, using a feature selection method based on a new mutual information entropy (NMIE) guided by accident consequences, and using the information entropy of each influencing factor (feature) and the severity of the accident to sort the influencing factors (features); in step S22, the feature selection method based on a new mutual information entropy (NMIE) guided by accident consequences, the calculation formula of the mutual information entropy is as follows:

具体实施时，作为本发明优选的实施方式，步骤S3具体包括：In specific implementation, as a preferred embodiment of the present invention, step S3 specifically includes:

在本实施例中，该稳定性评价方法中使用的度量指标既有上限也有下限。首先，由于稳定性评价方法中使用的Spearman's rho小于或等于1，因此该指标具有上限。其次，即Spearman's rho大于或等于-1，因此该指标具有下限。此外，如果排序结果更加相似，Spearman's rho就会变得更加显著。这意味着，如果特征选择方法更稳定，Spearman's rho将更接近于1，如果不那么稳定，Spearman's rho将更接近于-1。最后，该指标是单调递增的，可通过上述公式确定。In this embodiment, the metric used in the stability evaluation method has both an upper limit and a lower limit. First, since the Spearman's rho used in the stability evaluation method is less than or equal to 1, the metric has an upper limit. Second, That is, Spearman's rho is greater than or equal to -1, so this indicator has a lower limit. In addition, if the ranking results are more similar, Spearman's rho will become more significant. This means that if the feature selection method is more stable, Spearman's rho will be closer to 1, and if it is less stable, Spearman's rho will be closer to -1. Finally, this indicator is monotonically increasing and can be determined by the above formula.

S32、预测性能评价：通过特征选择的风险影响因素来训练预测模型，并测量预测模型的准确性，找出以最少的风险影响因素实现最高准确率的风险影响因素子集(在本实施例中，海上事故风险影响因素数据集通过不回放的随机抽样被分为80％的训练集和20％的测试集。同时，使用支持向量机-合成少数群体过度采样技术(SVM-SMOTE)对不平衡数据进行处理，该技术专门处理类别不平衡数据。)，使用相应的指标对模型的预测性能进行评估；S32. Prediction performance evaluation: train the prediction model through the risk influencing factors selected by features, measure the accuracy of the prediction model, find out the subset of risk influencing factors that achieves the highest accuracy with the least risk influencing factors (in this embodiment, the marine accident risk influencing factor data set is divided into 80% training set and 20% test set by random sampling without playback. At the same time, the unbalanced data is processed using the support vector machine-synthetic minority oversampling technique (SVM-SMOTE), which specializes in processing class imbalanced data.), and use the corresponding indicators to evaluate the prediction performance of the model;

在本实施例中，使用上文中提出的基于排序的稳定性算法测量每种特征选择方法的稳定性。如图6(a)所示，本发明中提出的3-WLR算法在特征选择方法稳定性方面表现出更强的鲁棒性。当风险影响因素选择数量超过21后，3-WLR算法的特征选择方法稳定性保持在0.9以上。相比之下，PageRank算法和WLR算法的最高稳定值分别为0.8461和0.8541。根据图6(b)，在海上事故风险影响因素数据集上测试时，传统的基于GBDT和SVM的嵌入式特征选择技术没有表现出足够的稳定性。这可能是由于分类器缺乏鲁棒性和通用性，导致它们无法在稳定性方面表现出令人满意的性能。相比之下，基于RFLV的滤波方法和基于随机森林的嵌入式特征选择方法的稳定性更好。这可能是因为RFLV的方差检验不受同分布数据大小变化的影响，而随机森林应用了Bagging的框架，可减少分类器预测误差并增强稳定性。与机器学习的传统特征选择方法(RF、RFLV、GBDT和SVM等)相比，本发明提出的三种特征选择方法在特征选择中都表现出更高的稳定性。3-WLR算法和步骤S23中采用的特征选择方法的稳定性明显强于NMIE算法。这可能是因为关联规则挖掘技术(FP-Growth算法)对数据扰动不敏感，更稳健。比较3-WLR算法和步骤S23中采用的特征选择方法，当风险影响因素数量较少时，步骤S23中采用的特征选择方法的稳定性高于3-WLR算法。随着风险影响因素数量的增加(超过30个)，3-WLR算法的稳定性逐渐高于步骤S23中采用的特征选择方法。In this embodiment, the stability of each feature selection method is measured using the ranking-based stability algorithm proposed above. As shown in Figure 6(a), the 3-WLR algorithm proposed in the present invention shows stronger robustness in terms of feature selection method stability. When the number of risk influencing factors selected exceeds 21, the feature selection method stability of the 3-WLR algorithm remains above 0.9. In contrast, the highest stability values of the PageRank algorithm and the WLR algorithm are 0.8461 and 0.8541, respectively. According to Figure 6(b), when tested on the marine accident risk influencing factor data set, the traditional embedded feature selection techniques based on GBDT and SVM did not show sufficient stability. This may be due to the lack of robustness and versatility of the classifiers, which prevents them from showing satisfactory performance in terms of stability. In contrast, the RFLV-based filtering method and the random forest-based embedded feature selection method have better stability. This may be because the variance test of RFLV is not affected by changes in the size of the same distribution data, while the random forest applies the Bagging framework to reduce the classifier prediction error and enhance stability. Compared with traditional feature selection methods (RF, RFLV, GBDT and SVM, etc.) of machine learning, the three feature selection methods proposed in the present invention all show higher stability in feature selection. The stability of the 3-WLR algorithm and the feature selection method used in step S23 is significantly stronger than that of the NMIE algorithm. This may be because the association rule mining technology (FP-Growth algorithm) is insensitive to data perturbations and is more robust. Comparing the 3-WLR algorithm and the feature selection method used in step S23, when the number of risk influencing factors is small, the stability of the feature selection method used in step S23 is higher than that of the 3-WLR algorithm. As the number of risk influencing factors increases (more than 30), the stability of the 3-WLR algorithm gradually becomes higher than that of the feature selection method used in step S23.

具体实施时，作为本发明优选的实施方式，步骤S4具体包括：In specific implementation, as a preferred embodiment of the present invention, step S4 specifically includes:

S41、引入支持向量机(SVM)、朴素贝叶斯分类器(NB)、随机森林(RF)、AdaBoost、XGBoost、LightGBM六种机器学习模型；S41, introduce six machine learning models: support vector machine (SVM), naive Bayes classifier (NB), random forest (RF), AdaBoost, XGBoost, and LightGBM;

支持向量机(SVM)：主要模型是定义在具有最大间隔的特征空间上的线性分类器；基本思想是求解分离的超平面，超平面能正确分割训练数据集，并具有巨大的几何间距。SVM还能利用核函数解决非线性分类问题；Support Vector Machine (SVM): The main model is a linear classifier defined on a feature space with maximum margins; the basic idea is to solve a separating hyperplane that correctly divides the training data set and has a large geometric gap. SVM can also use kernel functions to solve nonlinear classification problems;

朴素贝叶斯分类器(NB)：NB是一种基于每个特征和样本数据都是独立的假设的预测模型。它具有稳定的分类效率，同时在预测少量数据时表现出色。本研究使用基于伯努利分布的简单贝叶斯模型预测事故严重性。Naive Bayesian Classifier (NB): NB is a prediction model based on the assumption that each feature and sample data is independent. It has stable classification efficiency and performs well when predicting a small amount of data. This study uses a simple Bayesian model based on Bernoulli distribution to predict accident severity.

随机森林(RF)：RF是一种标准的集合学习模型。它主要采用袋式框架，即使用抽样技术选择样本数据并训练多个独立的决策树。最后，它使用投票来得出最终预测结果。Random Forest (RF): RF is a standard ensemble learning model. It mainly adopts a bagging framework, which uses sampling techniques to select sample data and train multiple independent decision trees. Finally, it uses voting to get the final prediction result.

AdaBoost：AdaBoost是一种基于提升框架的集合学习方法，它利用改变样本权重的技术来集中处理训练效果不佳的样本，从而提高训练精度。最后，使用加权求和法输出模型中的所有决策树结果，得到最终预测结果。AdaBoost: AdaBoost is an ensemble learning method based on a boosting framework. It uses the technique of changing sample weights to focus on samples with poor training results, thereby improving training accuracy. Finally, the weighted summation method is used to output the results of all decision trees in the model to obtain the final prediction result.

XGBoost：XGBoost也是一种基于提升框架的集合学习方法，主要使用基于二阶泰勒扩展的梯度提升方法。该技术主要针对训练不足的样本，不断调整实际样本的标签。XGBoost具有运行效率高、预测精度高的特点，同时还能自动处理缺失数据。XGBoost: XGBoost is also an ensemble learning method based on the boosting framework, which mainly uses the gradient boosting method based on the second-order Taylor expansion. This technology mainly targets under-trained samples and continuously adjusts the labels of actual samples. XGBoost has the characteristics of high operating efficiency and high prediction accuracy, and can also automatically handle missing data.

LightGBM：LightGBM和XGBoost是基于梯度提升和二阶泰勒扩展的预测模型。但是，LightGBM使用了直方图优化算法和基于梯度的单边采样等技术来提高模型的计算效率和泛化能力。同时，它采用最大深度限制策略来减少计算量并防止过拟合。LightGBM: LightGBM and XGBoost are prediction models based on gradient boosting and second-order Taylor expansion. However, LightGBM uses techniques such as histogram optimization algorithm and gradient-based unilateral sampling to improve the computational efficiency and generalization ability of the model. At the same time, it adopts a maximum depth limit strategy to reduce the amount of calculation and prevent overfitting.

S42、采用准确度、精确度、召回率、F1值和曲线下面积(AUC)，评估机器学习模型的预测性能，将船舶事故严重程度预测的预测性能最好的光梯度增强机作为基准模型。S42. Accuracy, precision, recall, F1 value and area under the curve (AUC) were used to evaluate the prediction performance of the machine learning model, and the optical gradient enhancement machine with the best prediction performance for ship accident severity prediction was taken as the benchmark model.

在本实施例中，准确度衡量的是正确识别的样本占样本总数的比例，而精确度衡量的是被识别为阳性类的样本中实际为阳性的样本所占的比例。召回率衡量的是正确识别的阳性样本占全部阳性样本的比例。准确度和召回率是相互矛盾的指标，精准度高则召回率低，反之亦然。F1值是精确度和召回率的加权平均值。曲线下面积表示接收者工作特征曲线下的面积。曲线下面积越接近1，说明模型结果越准确。In this embodiment, accuracy measures the proportion of correctly identified samples to the total number of samples, while precision measures the proportion of samples that are actually positive among the samples identified as positive classes. Recall measures the proportion of correctly identified positive samples to all positive samples. Accuracy and recall are contradictory indicators. A high precision leads to a low recall, and vice versa. The F1 value is the weighted average of accuracy and recall. The area under the curve represents the area under the receiver operating characteristic curve. The closer the area under the curve is to 1, the more accurate the model result is.

在本实施例中，以LightGBM模型为例，对特征选择后的训练数据进行5倍交叉验证。实验结果如表1和图7所示。不同特征选择方法对其他机器学习模型的预测性能见附录B。分析图7和表1可以发现，3-WLR算法选择的风险影响因素数量较少，与使用WLR算法的PageRank相比，预测性能更好。这意味着改进后的3-WLR算法在本发明中确实有效。图7和表1中显示的结果还表明，在本发明提出的三种特征选择方法选择的风险影响因素上训练的LightGBM预测器比四种传统特征选择方法的机器学习性能更好。尤其是步骤S23中采用的特征选择方法，选择的风险影响因素数量最少，预测性能最好。In this embodiment, taking the LightGBM model as an example, a 5-fold cross validation is performed on the training data after feature selection. The experimental results are shown in Table 1 and Figure 7. The prediction performance of different feature selection methods for other machine learning models is shown in Appendix B. By analyzing Figure 7 and Table 1, it can be found that the 3-WLR algorithm selects a small number of risk influencing factors, and has better prediction performance than PageRank using the WLR algorithm. This means that the improved 3-WLR algorithm is indeed effective in the present invention. The results shown in Figure 7 and Table 1 also show that the LightGBM predictor trained on the risk influencing factors selected by the three feature selection methods proposed in the present invention has better machine learning performance than the four traditional feature selection methods. In particular, the feature selection method used in step S23 selects the least number of risk influencing factors and has the best prediction performance.

表1基于不同方法选择的风险影响因素特征数量的预测性能Table 1 Prediction performance of the number of risk influencing factor features selected based on different methods

本发明旨在准确预测海上事故的严重程度，并开发事故严重程度预测的基准模型。为此，比较了六个先进的机器学习模型的泛化性能。首先，使用三种不同的特征选择方法从训练数据中选择风险影响因素。其次，使用SVM-SMOTE技术对训练数据进行平衡。然后，在平衡的训练数据上训练六个先进的机器学习模型。最后，使用测试数据评估模型的泛化性能。图8显示这六个机器学习模型的泛化性能。图8显示，NB模型的泛化性能较差，而LightGBM、AdaBoost和XGBoost这三种Boosting框架的集合学习模型的泛化性能更为突出。The present invention aims to accurately predict the severity of marine accidents and develop a benchmark model for accident severity prediction. To this end, the generalization performance of six advanced machine learning models was compared. First, three different feature selection methods were used to select risk influencing factors from the training data. Secondly, the training data was balanced using the SVM-SMOTE technique. Then, six advanced machine learning models were trained on the balanced training data. Finally, the generalization performance of the models was evaluated using test data. Figure 8 shows the generalization performance of the six machine learning models. Figure 8 shows that the generalization performance of the NB model is poor, while the generalization performance of the ensemble learning models of the three Boosting frameworks, LightGBM, AdaBoost and XGBoost, is more prominent.

为了更直观的比较各机器学习模型的泛化性能，本节根据图8中展示的各性能指标的最大值，将各机器学习模型的胜负两两比较(如图9所示)。从图9中可以发现LightGBM模型的泛化性能最好；其次是AdaBoost和XGBoost模型；与SVM模型相比，RF模型的泛化性能稍好；特别是RF模型的曲线下面积大于SVM模型。这说明RF模型结果的真实性高于SVM模型，而NB模型的泛化性能最差。In order to more intuitively compare the generalization performance of each machine learning model, this section compares the winners and losers of each machine learning model in pairs according to the maximum values of each performance indicator shown in Figure 8 (as shown in Figure 9). From Figure 9, it can be seen that the LightGBM model has the best generalization performance, followed by the AdaBoost and XGBoost models; compared with the SVM model, the RF model has slightly better generalization performance; in particular, the area under the curve of the RF model is larger than that of the SVM model. This shows that the authenticity of the RF model results is higher than that of the SVM model, while the NB model has the worst generalization performance.

根据图9所示结果，进行特征选择后的LightGBM模型的泛化性能最优。本发明的特征选择方法对LightGBM模型泛化性能的提升如表2所示。从表2中可以发现，3-WLR算法和步骤S23中采用的特征选择方法所选择的特征数量最小，即相较于NMIE算法，这两种特征选择方法能去除更多冗余信息。同时NMIE算法对LightGBM模型泛化性能的提升也较小。从表2可以发现，步骤S23中采用的特征选择方法对LightGBM模型整体泛化性能的提升最大。According to the results shown in Figure 9, the generalization performance of the LightGBM model after feature selection is optimal. The improvement of the generalization performance of the LightGBM model by the feature selection method of the present invention is shown in Table 2. It can be found from Table 2 that the number of features selected by the 3-WLR algorithm and the feature selection method used in step S23 is the smallest, that is, compared with the NMIE algorithm, these two feature selection methods can remove more redundant information. At the same time, the improvement of the generalization performance of the LightGBM model by the NMIE algorithm is also small. It can be found from Table 2 that the feature selection method used in step S23 has the greatest improvement on the overall generalization performance of the LightGBM model.

表2通过特征选择方法提高预测器(Light GBM)的泛化性能Table 2 Improving the generalization performance of the predictor (Light GBM) through feature selection methods

具体实施时，作为本发明优选的实施方式，步骤S5中，利用预测性能最高的模型和最优的特征对控制某些人为因素与管理因素所能达到的预防严重事故的效益进行评估。首先，本发明以步骤S23中采用的特征选择方法为例，对关键特征的选择进行解释。在最小支持度为0.1，最小置信度为0.3时，步骤S23中采用的特征选择方法的特征排序结果如表3所示。表3中前30个特征的顺序是先被关联规则挖掘出，然后经过3-WLR算法排序得到的，后38个特征的顺序是由互信息熵算法得到的。通过选择LightGBM模型和步骤S23中采用的特征选择方法中选择的前41个风险影响因素来训练模型，以获得分别为82.63％和81.98％的准确度和可靠性。其次，为尽量减少模型性能对评估结果的影响，本发明选取模型正确预测的914个严重事故数据作为评估数据。然后，通过“do-operator”(改变评估数据中某些因素的状态)生成新的评估数据。最后，利用模型预测新评估数据的严重性状态，并利用非严重事故的比例来评估各因素的有效性。In specific implementation, as a preferred embodiment of the present invention, in step S5, the model with the highest prediction performance and the optimal features are used to evaluate the benefits of preventing serious accidents that can be achieved by controlling certain human factors and management factors. First, the present invention takes the feature selection method used in step S23 as an example to explain the selection of key features. When the minimum support is 0.1 and the minimum confidence is 0.3, the feature sorting results of the feature selection method used in step S23 are shown in Table 3. The order of the first 30 features in Table 3 is first mined by association rules and then sorted by the 3-WLR algorithm, and the order of the last 38 features is obtained by the mutual information entropy algorithm. The model is trained by selecting the LightGBM model and the first 41 risk influencing factors selected in the feature selection method used in step S23 to obtain an accuracy and reliability of 82.63% and 81.98% respectively. Secondly, in order to minimize the impact of model performance on the evaluation results, the present invention selects 914 serious accident data correctly predicted by the model as evaluation data. Then, new evaluation data is generated by "do-operator" (changing the state of certain factors in the evaluation data). Finally, the model is used to predict the severity status of new assessment data, and the proportion of non-serious accidents is used to evaluate the effectiveness of each factor.

表3基于步骤S23中采用的特征选择方法的特征排序Table 3 Feature ranking based on the feature selection method used in step S23

注:min_sup＝0.1；min_con＝0.3Note: min_sup = 0.1; min_con = 0.3

对步骤S23中采用的特征选择方法的前41个风险影响因素中的人为因素和管理因素进行控制，得出控制后的具体效益，如图10所示。图10显示，航运公司通过雇用经验丰富的船员获得的收益最高，即控制H3，这有助于防止10.12％的严重事故。第二有效的策略是及时纠正安全问题(控制M4)，可减少7.00％的严重事故。然而，控制H2或M3是最不有效的策略。因为表3显示M3是最重要的风险影响因素，但在图10中，控制M3却是最无效的策略之一。控制M3在减少严重事故方面效果不佳的主要原因是M3促进了非严重事故的发生，抑制了严重事故的发生。关于M3在预测中最重要的结果，首先，从整个事故系统的角度来看，它是最重要的，因为整个事故系统既包括严重事故，也包括非严重事故。其次，提高预测的准确性主要取决于风险影响因素能否区分严重事故和非严重事故，而不是取决于控制风险影响因素能否有效降低事故的严重程度。The human factors and management factors in the first 41 risk influencing factors of the feature selection method adopted in step S23 are controlled, and the specific benefits after control are obtained, as shown in Figure 10. Figure 10 shows that the shipping company obtains the highest benefit by hiring experienced crew members, that is, controlling H3, which helps prevent 10.12% of serious accidents. The second most effective strategy is to correct safety problems in a timely manner (control M4), which can reduce 7.00% of serious accidents. However, controlling H2 or M3 is the least effective strategy. Because Table 3 shows that M3 is the most important risk influencing factor, but in Figure 10, controlling M3 is one of the least effective strategies. The main reason why controlling M3 is not effective in reducing serious accidents is that M3 promotes the occurrence of non-serious accidents and inhibits the occurrence of serious accidents. Regarding the most important results of M3 in prediction, first, it is the most important from the perspective of the entire accident system, because the entire accident system includes both serious accidents and non-serious accidents. Secondly, improving the accuracy of prediction mainly depends on whether the risk influencing factors can distinguish between serious accidents and non-serious accidents, rather than on whether controlling the risk influencing factors can effectively reduce the severity of accidents.

综上所述，本发明提供了一种基于机器学习的海上交通事故严重程度预测的研究框架。该框架能够有效的从数据集构建、特征选择、特征选择性能评估以及事故严重程度预测这四个方面对海上交通事故进行研究。它不仅从安全系统工程学的角度全面地分析事故影响因素，还将这些风险影响因素作为事故特征纳入了海上事故风险影响因素数据集。本发明不仅对海上事故预测研究提供了较为先进的方法启示，还在现实生活中有一定应用价值，即：使用本发明的框架，能够有效分析和预测海上交通事故的严重程度，为安全评估和风险防控研究提供基础。In summary, the present invention provides a research framework for predicting the severity of maritime traffic accidents based on machine learning. The framework can effectively study maritime traffic accidents from four aspects: data set construction, feature selection, feature selection performance evaluation, and accident severity prediction. It not only comprehensively analyzes accident influencing factors from the perspective of safety system engineering, but also incorporates these risk influencing factors as accident characteristics into the maritime accident risk influencing factor data set. The present invention not only provides a more advanced methodological inspiration for maritime accident prediction research, but also has certain application value in real life, that is: using the framework of the present invention, it is possible to effectively analyze and predict the severity of maritime traffic accidents, providing a basis for safety assessment and risk prevention and control research.

附表A.海上事故风险影响因素数据集的特征描述Appendix A. Characteristic description of the dataset of factors affecting marine accident risk

表A.68个风险影响因素的描述Table A. Description of 68 risk influencing factors

附表B.基于不同特征选择方法的预测精度表表B1.基于RF的预测精度.Appendix B. Prediction accuracy based on different feature selection methods Table B1. Prediction accuracy based on RF.

表B2.基于SVM的预测精度.Table B2. Prediction accuracy based on SVM.

表B3.基于NB的预测精度.Table B3. Prediction accuracy based on NB.

表B4.基于AdaBoost的预测精度.Table B4. Prediction accuracy based on AdaBoost.

表B5.基于XGBoost的预测精度.Table B5. Prediction accuracy based on XGBoost.

最后应说明的是：以上各实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述各实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的范围。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit it. Although the present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that they can still modify the technical solutions described in the aforementioned embodiments, or replace some or all of the technical features therein by equivalents. However, these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for predicting the severity of a marine traffic accident, comprising:

S1. Using marine accident investigation reports, construct a dataset of factors affecting marine accident risks;

S2. Based on the constructed dataset of factors affecting marine accident risks, feature selection methods are used to train the accuracy of machine learning models and the interpretability of feature selection;

S3, using a three-stage performance evaluation method of stability evaluation, prediction performance evaluation, comprehensive evaluation and statistical test to evaluate the performance of the feature selection method;

S4. Six machine learning models were used for comparison to measure the performance of different predictors, and the machine learning model with the best performance in predicting the severity of ship accidents was used as the benchmark model;

S5. Use the selected model with the highest prediction performance and the best features to predict the severity of the accident, conduct benefit evaluation, and analyze the effectiveness of risk control measures from a quantitative perspective.

2. A method for predicting the severity of a marine traffic accident according to claim 1, characterized in that step S1 specifically comprises:

S11. Extract human, ship, environment, management factors and basic accident information from the accident investigation report as standard-level features of the marine accident risk influencing factor dataset;

S12, classifying standard-level features into 68 index-level features during the data processing stage;

S13. Convert continuous data and character data into discrete categorical data.

3. The method for predicting the severity of a marine traffic accident according to claim 1, wherein step S2 specifically comprises:

S21. Feature selection method based on association rule mining and three-weight ranking algorithm to mine the interactions between accident features affecting marine accidents and rank them;

S22. Adopting the feature selection method based on the new mutual information entropy (NMIE) guided by accident consequences, the influencing factors are ranked by using the information entropy of each influencing factor and the severity of the accident;

S23. Using the feature selection method adopted in step S21, the relationship between each feature is mined and the features are sorted. Then, using the feature selection method adopted in step S22, the mutual information entropy between the features not mined by the feature selection method adopted in step S22 and the target is calculated, and the mutual information entropy between the features and the target is sorted. The two sorting results are combined to obtain the final feature sorting result.

4. The method for predicting the severity of a marine traffic accident according to claim 3, wherein step S21 specifically comprises:

S211. Use association rule mining technology to find the association, co-occurrence or causal relationship between accident features: In order to accurately and effectively find the relationship between accident features, the FP-Growth algorithm is used to mine frequent item sets, and the association rules are generated by constructing an FP tree to screen out important association rules; if an association rule meets the minimum support and confidence, and the lift is greater than 1, then the association rule is considered valid. Valid association rules are used The calculation formula is as follows:

Among them, Supp(X) represents the support of transaction X; express confidence level; express The degree of improvement; N _X represents the number of occurrences of X transaction in the project set; N represents the number of transactions in the project set;

S212. Use graph theory methods to map the association rules between different accident characteristics to the complex influencing factor interaction network; the nodes in the complex influencing factor interaction network represent accident characteristics, and the edges represent the associations between them; the weights of the edges correspond to the confidence of the association rules; the complex influencing factor interaction network reflects the interaction of various accident characteristics in the development process of marine accidents; use the feature selection method based on association rule mining and three-weight ranking algorithm to determine the importance of each accident feature in the complex influencing factor interaction network, and rank the accident features accordingly.

5. A method for predicting the severity of marine traffic accidents according to claim 3, characterized in that the feature selection method based on association rule mining and three-weight ranking algorithm adopted in step S21 includes two optimization improvements, which are as follows:

The first improvement: The concept of a weighted transition probability matrix is proposed, and confidence is used as the transition probability between adjacent risk influencing factors to replace the definition in the original algorithm. The calculation formula is as follows:

Among them, PM represents the transition probability matrix; n represents the number of influencing factors mined by the FP-Growth algorithm; g represents the Ground node added to the original network; δ _ij represents the transition probability from node _ni to node _nj ; i→j represents the existence of an edge from node _ni to node _nj ; represents the in-degree of node n _j ; e _ji represents the association between influencing factor n _j and influencing factor n _i ; t represents the number of iterations; represents the score obtained by node n _i at the tth iteration; (matrix) _ij represents an n+1-order square matrix with only the i-th row and j-th column elements being 1 and the rest being 0; (zeros-one) _j represents a 1×(n+1)-order matrix with only the j-th column elements being 1 and the rest being 0; (ones) represents an (n+1)×1-order matrix with all elements being 1; * represents the Hadamard product; · represents the matrix product;

The second improvement: Based on the theory of node external strength and node centrality coefficient in graph theory, a new comprehensive weight is proposed. After the algorithm iteration is completed, the comprehensive weight is used to obtain additional scores from the final score of the Ground node. The calculation formula is as follows:

in, represents the output intensity of influencing factor n _i ; CC _i represents the betweenness central coefficient of influencing factor n _i ; σ _ij (v) represents the number of shortest paths from influencing factor n _i to influencing factor n _j through influencing factor n _v ; σ _ij represents the number of shortest paths from influencing factor n _i to influencing factor n _j ; μ _i represents the comprehensive weight proposed in this study; t _end represents the number of iterations at the end of the algorithm iteration.

6. A method for predicting the severity of a marine traffic accident according to claim 3, characterized in that, in step S22, a feature selection method based on a new mutual information entropy (NMIE) guided by accident consequences is used, and the calculation formula of the mutual information entropy is as follows:

Among them, NMIE(X,Y) represents the mutual information entropy between feature X and label Y; m represents the number of states of the label; n represents the number of states of feature X; I( _xi , _yk ) represents the mutual information between state i of feature X and state k of the label.

7. A method for predicting the severity of a marine traffic accident according to claim 1, characterized in that step S3 specifically comprises:

S31. Stability evaluation: The stability of the feature selection method is measured by considering the influence of randomness and changes in the size of the training set. Methods similar to k-folding, random numbers, and repeated experiments are used to determine the influence of randomness and changes in the size of the training set on the feature selection process. Spearman's rho is used to measure the stability of the ranking results. The calculation formula is as follows:

Where Sprr ^R1,R2 represents the correlation between ranking R1 and ranking R2; R1 is used as the reference ranking; R2 is used as the experimental ranking; N represents the number of main features; M represents all features included in the main features of R1; if feature i is not in the main features of R2, then R2 _i = N+1;

S32. Prediction performance evaluation: Train the prediction model through the risk influencing factors selected by features, measure the accuracy of the prediction model, find the subset of risk influencing factors that achieves the highest accuracy with the least risk influencing factors, and use the corresponding indicators to evaluate the prediction performance of the model;

S33. Comprehensive evaluation and statistical test: The stability of the feature selection process is evaluated by determining whether the feature selection method has an acceptable level of stability; if the best stability of the feature selection method is lower than 0.6, or the difference with the feature selection method with the best stability is greater than 0.3, the stability of the feature selection method is considered to be unacceptable; unacceptable stability indicates that the feature selection method is invalid; based on the prediction performance evaluation information, the prediction performance of the feature selection method with acceptable stability is analyzed; if the prediction performance indicators of different feature selection methods are not similar, then the method with better prediction performance indicators is considered to be the better performing method; if the prediction performance indicators are the same, a 5-fold cross-validation is performed to test the difference in the prediction performance of the machine learning model; if the test result is significant greater than 0.05, there is no significant difference in the prediction performance of the two methods; if there is similarity in the prediction performance of different feature selection methods, the feature selection method with better stability is considered to have better performance.

8. The method for predicting the severity of a marine traffic accident according to claim 1, wherein step S4 specifically comprises:

S41, introduce six machine learning models: support vector machine, naive Bayes classifier, random forest, AdaBoost, XGBoost, and LightGBM;

S42. Accuracy, precision, recall, F1 value and area under the curve (AUC) were used to evaluate the prediction performance of the machine learning model, and the optical gradient enhancement machine with the best prediction performance for ship accident severity prediction was taken as the benchmark model.