CN119474171B

CN119474171B - Data mining method device, equipment and storage medium

Info

Publication number: CN119474171B
Application number: CN202411573239.8A
Authority: CN
Inventors: 曾诗然
Original assignee: Beijing Xitianjingjing Technology Co ltd
Current assignee: Beijing Xitianjingjing Technology Co ltd
Priority date: 2024-11-06
Filing date: 2024-11-06
Publication date: 2025-04-29
Anticipated expiration: 2044-11-06
Also published as: CN119474171A

Abstract

The invention provides a data mining method device, equipment and a storage medium, wherein the method comprises the steps of carrying out data preprocessing on multi-source heterogeneous scientific research data to generate a data set with a uniform format, and mining a multi-dimensional scientific research image feature set from the data set. And carrying out mode mining on academic output data of the researcher based on the feature sets to obtain a research interest mode set and research trend indexes. And generating a research interest prediction model and project matching degree scores for the historical research data through predictive mining, and carrying out graph structure mining on a researcher cooperation network by using the model to predict potential cooperation relations. And finally, carrying out knowledge mining on the scientific research project data according to the project matching degree scores, the network structural features and the cooperative relation prediction results, and generating a personalized project recommendation list and an interpretable recommendation basis thereof. The method can comprehensively consider interest evolution, project characteristics and cooperative potential of researchers, and improves accuracy and interpretability of recommendation.

Description

Data mining method device, equipment and storage medium

Technical Field

The present invention relates to the field of data mining, and in particular, to a data mining method apparatus, device, and storage medium.

Background

With the increasing complexity and diversity of research activities, researchers are facing significant challenges in selecting research directions and projects. The traditional scientific research interest identification and project recommendation methods mainly depend on keyword matching and simple statistical analysis, and multi-source heterogeneous scientific research data cannot be fully utilized. In the prior art, some research attempts have been made to improve accuracy of scientific interest identification and project recommendation using machine learning and data mining techniques. For example, there are studies that analyze the published literature of researchers using topic models to identify their interest in research, and studies that recommend scientific research projects using collaborative filtering algorithms. Still other approaches attempt to optimize recommendations in conjunction with social network analysis, taking into account the partnerships between researchers. These existing methods have a major drawback in that they typically treat interest recognition, item matching, and collaborative relationship analysis as independent tasks, lacking a unified framework to integrate these different dimensions of information. This results in recommendation results that often fail to fully consider aspects of researchers' interest evolution, project characteristics, collaboration potential, etc., thus limiting the accuracy and interpretability of the recommendation.

Disclosure of Invention

The invention mainly aims to solve the technical problems that in the existing data mining method, interest identification, item matching and cooperative relation analysis are treated as independent tasks, and the accuracy and the interpretability of recommendation are limited;

The first aspect of the present invention provides a data mining method, including:

performing data mining pretreatment on multi-source heterogeneous scientific research data to obtain a pretreated unified format data set, and performing feature mining on the unified format data set to obtain a multi-dimensional scientific research image feature set;

performing mode mining on the research activities of the researchers according to the academic output data in the multidimensional scientific research portrait feature set to obtain a research interest mode set and a research trend index;

predictive mining is carried out on the historical research data in the multidimensional scientific research portrait feature set according to the research interest mode set and the research trend index, so as to obtain a research interest prediction model and a project matching degree score;

According to the research interest prediction model, carrying out graph structure mining on the researcher cooperation network in the multidimensional scientific research portrait feature set to obtain network structure features and potential cooperation relation prediction results;

And carrying out knowledge mining on the research project data of the multi-source heterogeneous scientific research data according to the project matching degree scores, the network structural features and the potential cooperative relation prediction results to obtain mining results, and carrying out comprehensive processing and knowledge representation on the mining results to obtain a personalized project recommendation list and an interpretable recommendation basis.

Optionally, in a first implementation manner of the first aspect of the present invention, the performing mode mining on a research activity of a researcher according to the academic output data in the multi-dimensional scientific research portrait feature set to obtain a research interest mode set and a research trend index includes:

performing topic modeling and keyword extraction on academic output data in the multidimensional scientific research portrait feature set to obtain research topic distribution and keyword set;

according to the research topic distribution and the keyword set, performing time-series cluster analysis on academic achievements of researchers in the academic output data to obtain research interest evolution tracks;

based on the research interest evolution track, analyzing a research behavior sequence in the academic output data by using a frequent pattern mining algorithm to obtain a research interest pattern set;

and applying a trend prediction algorithm and an abnormality detection method to the research interest pattern set, and combining preset external subject development data to obtain a research trend index.

Optionally, in a second implementation manner of the first aspect of the present invention, the analyzing the research behavior sequence in the academic output data by using a frequent pattern mining algorithm based on the research interest evolution track, to obtain a research interest pattern set includes:

performing time window segmentation on the research interest evolution track to obtain a plurality of time sequence segments;

according to the time sequence segment, coding and serializing research behavior sequences in academic output data to obtain a standardized behavior sequence set;

Applying a preset sequence pattern mining algorithm to the behavior sequence set, and extracting frequent subsequence patterns to obtain candidate research interest patterns;

And combining the research topic distribution, classifying and merging the candidate research interest modes by using a hierarchical clustering algorithm, and obtaining a research interest mode set.

Optionally, in a third implementation manner of the first aspect of the present invention, performing predictive mining on historical research data in the multi-dimensional scientific research portrait feature set according to the research interest pattern set and the research trend index, to obtain a research interest prediction model and a project matching degree score includes:

Feature fusion is carried out on the research interest mode set and the research trend index, and comprehensive feature vectors are obtained by combining historical research data in the multidimensional scientific research portrait feature set;

According to the comprehensive feature vector, performing trend, period and random component separation on historical research data by using a time sequence decomposition technology to obtain a multi-dimensional time sequence component;

Generating a research interest prediction model by applying a deep learning model to the multi-dimensional time sequence component, and carrying out interest prediction on a preset future time point according to the research interest prediction model and research activity data in the multi-dimensional scientific research portrait feature set to obtain a predicted interest vector;

And according to the predicted interest vector, combining a preset project feature vector, and obtaining a project matching degree score through cosine similarity calculation and multi-factor weighting.

Optionally, in a fourth implementation manner of the first aspect of the present invention, according to the research interest prediction model, performing graph structure mining on a researcher cooperation network in the multi-dimensional scientific research portrait feature set, to obtain a network structure feature and a potential cooperation relationship prediction result includes:

performing graph embedding processing on the researcher cooperation network in the multi-dimensional scientific research portrait feature set to obtain a low-dimensional node representation vector;

According to the node representation vector and the output of the research interest prediction model, carrying out attribute enhancement on the researcher nodes in the researcher cooperation network to obtain enhanced node representation of the fused interest information;

Carrying out multi-level information propagation and aggregation on the enhanced node representation application graph attention network to obtain dynamically updated node characteristics;

according to the node characteristics, analyzing the cooperation network of the researchers by using a community detection algorithm to obtain network structure characteristics;

And based on the network structure characteristics and the node characteristics, carrying out link prediction through a graph neural network to obtain a potential cooperative relation prediction result.

Optionally, in a fifth implementation manner of the first aspect of the present invention, the knowledge mining is performed on the research project data of the multi-source heterogeneous scientific research data according to the project matching degree score, the network structural feature and the potential cooperative relationship prediction result to obtain a mining result, and comprehensive processing and knowledge representation are performed on the mining result to obtain a personalized project recommendation list and an interpretable recommendation basis, where the method includes:

carrying out multi-mode feature fusion on the item matching degree scores, the network structure features and the potential cooperative relation prediction results to obtain comprehensive feature representation;

According to the comprehensive characteristic representation, carrying out semantic analysis and topic modeling on research project data of the multi-source heterogeneous scientific research data to obtain a project semantic network;

Based on the project semantic network and the comprehensive feature representation, a knowledge graph construction algorithm is used for mining and reasoning the association relation among projects to obtain a scientific research project knowledge graph;

carrying out multidimensional correlation calculation and sequencing on the scientific research project knowledge graph by using a graph neural network and an attention mechanism to obtain a personalized project recommendation list;

and carrying out relationship path analysis and feature importance quantification on each item in the personalized item recommendation list based on the scientific research item knowledge graph to obtain an interpretable recommendation basis.

Optionally, in a sixth implementation manner of the first aspect of the present invention, the performing, for each item in the personalized item recommendation list, relationship path analysis and feature importance quantification based on the scientific research item knowledge graph, to obtain an interpretable recommendation basis includes:

performing bidirectional breadth-first search on nodes related to recommended items in the scientific research item knowledge graph to obtain a multi-level relation path set;

carrying out semantic embedding processing on nodes and edges of paths in the relation path set to obtain a path representation vector;

Applying an attention mechanism and a path pruning algorithm to the path representation vector to obtain a core interpretation path, and quantifying feature contribution based on the core interpretation path and the comprehensive feature representation to obtain feature importance ranking;

and according to the core interpretation path and the feature importance sequence, constructing a structured interpretation template by a natural language generation technology to obtain an interpretable recommendation basis.

A second aspect of the present invention provides a data mining apparatus, the data mining apparatus comprising:

The feature mining module is used for carrying out data mining pretreatment on the multi-source heterogeneous scientific research data to obtain a pretreated unified format data set, and carrying out feature mining on the unified format data set to obtain a multi-dimensional scientific research image feature set;

The mode mining module is used for carrying out mode mining on the research activities of the researchers according to the academic output data in the multidimensional scientific research portrait feature set to obtain a research interest mode set and a research trend index;

The predictive mining module is used for performing predictive mining on the historical research data in the multi-dimensional scientific research portrait feature set according to the research interest mode set and the research trend index to obtain a research interest prediction model and a project matching degree score;

The graph structure mining module is used for mining graph structures of the researcher cooperation networks in the multidimensional scientific research portrait feature set according to the research interest prediction model to obtain network structure features and potential cooperation relation prediction results;

The knowledge mining module is used for carrying out knowledge mining on the research project data of the multi-source heterogeneous scientific research data according to the project matching degree scores, the network structural features and the potential cooperative relation prediction results to obtain mining results, and carrying out comprehensive processing and knowledge representation on the mining results to obtain personalized project recommendation lists and interpretable recommendation bases.

The third aspect of the present invention provides a data mining apparatus comprising a memory and at least one processor, the memory having instructions stored therein, the memory and the at least one processor being interconnected by a line, the at least one processor invoking the instructions in the memory to cause the data mining device to perform the steps of the data mining method described above.

A fourth aspect of the present invention provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the steps of the data mining method described above.

According to the data mining method, the device and the storage medium, the data preprocessing is carried out on the multi-source heterogeneous scientific research data to generate the data set with the uniform format, and the multi-dimensional scientific research image characteristic set is mined from the data set. And carrying out mode mining on academic output data of the researcher based on the feature sets to obtain a research interest mode set and research trend indexes. And generating a research interest prediction model and project matching degree scores for the historical research data through predictive mining, and carrying out graph structure mining on a researcher cooperation network by using the model to predict potential cooperation relations. And finally, carrying out knowledge mining on the scientific research project data according to the project matching degree scores, the network structural features and the cooperative relation prediction results, and generating a personalized project recommendation list and an interpretable recommendation basis thereof. The method can comprehensively consider interest evolution, project characteristics and cooperative potential of researchers, and improves accuracy and interpretability of recommendation.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

FIG. 1 is a schematic diagram of a first embodiment of a data mining method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of an embodiment of a data mining apparatus according to an embodiment of the present invention;

fig. 3 is a schematic diagram of an embodiment of a data mining apparatus according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terms "comprising" and "having" and any variations thereof, as used in the embodiments of the present invention, are intended to cover non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed or inherent to such process, method, article, or apparatus but may optionally include other steps or elements not listed or inherent to such process, method, article, or apparatus.

For the sake of understanding the present embodiment, first, a data mining method disclosed in the present embodiment is described in detail. As shown in fig. 1, the method comprises the following steps:

101. performing data mining pretreatment on multi-source heterogeneous scientific research data to obtain a pretreated unified format data set, and performing feature mining on the unified format data set to obtain a multi-dimensional scientific research image feature set;

In one embodiment of the present invention, it is first necessary to determine the source of the multi-source heterogeneous scientific data. Such data typically includes academic paper databases, patent databases, research project applications and sponsored records, personal academic homepages, social media platforms, academic conference and seminar records, and research management systems internal to the institution, etc. For each data source, a corresponding data acquisition module needs to be developed, wherein methods such as web crawler technology, API interface calling, database query, natural language processing technology and the like are involved. The collected original data often has the problems of inconsistent format, information loss or redundancy and the like, so that data cleaning and preprocessing are needed. This process includes removing duplicate and invalid data, unifying data formats, processing missing and outliers, performing data normalization and normalization processes, and the like. In particular, for text data, natural language processing operations such as word segmentation, word deactivation, part-of-speech tagging and the like are also required. In addition, there is a need to address data inconsistencies between different data sources, such as name spelling differences, organization name changes, etc., which may be accomplished through entity alignment techniques. The preprocessed data needs to be integrated to form a data set with a uniform format. This process involves data integration techniques, where data from different sources is organized and stored in a predefined format. Structured data may be stored using a relational database, semi-structured data may be stored using a document-type database, and for large-scale raw data, storage using a distributed file system may be considered. Next, feature mining is performed on the data set in the unified format to construct a multi-dimensional scientific image feature set. This process first involves feature extraction, extracting meaningful features from different types of data. For text data, the technology of TF-IDF, word2Vec and the like can be used for extracting semantic features, for numerical data, various statistical indexes can be calculated, and for time series data, features of trend, periodicity and the like can be extracted. The most relevant and representative features are then selected by feature selection techniques such as chi-square test, information gain, etc. In order to capture the dynamics of the academic activity of researchers, it is also necessary to construct timing features. This can be achieved by time series analysis of the academic output of the researcher, for example using a sliding window technique to calculate the characteristic changes for different time periods. Meanwhile, in order to reflect the location and influence of researchers in academic networks, network features need to be built. The method can be realized by performing social network analysis on the cooperative relationship network and calculating indexes such as centrality, structural holes and the like. And finally, integrating and standardizing all the extracted features to form a multi-dimensional scientific research image feature set. This feature set should be able to fully reflect aspects of the researchers' academic background, research interests, collaboration patterns, impact, etc. To facilitate subsequent analysis and model training, these features may be represented in a high-dimensional vector or matrix form. Meanwhile, in consideration of the dynamics of scientific research activities, an incremental update mechanism needs to be designed so that the feature set can be efficiently updated when new data arrives.

102. Carrying out mode mining on the research activities of the researchers according to academic output data in the multidimensional scientific research portrait feature set to obtain a research interest mode set and a research trend index;

In one embodiment of the invention, the mode mining is carried out on the research activities of researchers according to the academic output data in the multi-dimensional scientific image feature set to obtain a research interest mode set and research trend indexes, wherein the method comprises the steps of carrying out subject modeling and keyword extraction on the academic output data in the multi-dimensional scientific image feature set to obtain research subject distribution and keyword set, carrying out time sequence clustering analysis on academic achievements of the researchers in the academic output data according to the research subject distribution and keyword set to obtain research interest evolution tracks, analyzing research behavior sequences in the academic output data by using a frequent mode mining algorithm based on the research interest evolution tracks to obtain a research interest mode set, and applying a trend prediction algorithm and an anomaly detection method to the research interest mode set to obtain the research trend indexes by combining preset external academic development data.

Specifically, firstly, subject modeling and keyword extraction are carried out on academic output data in a multi-dimensional scientific research portrait feature set, and the subject modeling can adopt a Latent Dirichlet Allocation (LDA) algorithm which can discover potential subject structures from a large number of documents. In practice, an academic paper, patent text and the like of researchers are taken as input documents, a proper number of topics is predefined, and then document-topic distribution and topic-word distribution are estimated through iterative optimization. Thus, the topic distribution of each document can be obtained, and then aggregating these distributions can be used for obtaining the overall research topic distribution of researchers. Meanwhile, keyword extraction may use a TF-IDF (word frequency-inverse document frequency) method capable of recognizing words having high importance in a specific document. By setting a threshold or selecting top-N, a group of keywords can be extracted for each document, and then a keyword set of researchers is obtained in a summarizing way. And then, carrying out time sequence cluster analysis on academic results of researchers in the academic output data according to the obtained research topic distribution and keyword set. The purpose of this step is to track the evolution trace of the research interest. First, the academic achievements need to be arranged in time sequence, and then a Dynamic Time Warping (DTW) algorithm can be used to measure the similarity between the academic achievements at different time points. The DTW algorithm can solve the problem of inconsistent time sequence length, and is suitable for analyzing research activities in different periods. Hierarchical clustering or density clustering algorithms (e.g., DBSCAN) can be used to identify clusters of research interest at different times based on the distance metric of DTW. By analyzing the evolution of these clusters, an evolution trace of the research interest can be obtained, which reflects the change and shift of the research interest over time. Based on the obtained research interest evolution track, the next step is to analyze the research behavior sequence in the academic output data by using a frequent pattern mining algorithm. Projection-based sequence pattern mining algorithms, such as PrefixSpan algorithms, may be employed herein. First, the study activities need to be encoded into a sequence, e.g., each study topic or keyword can be considered an item, and then the sequence is constructed chronologically. The PrefixSpan algorithm finds frequent sequence patterns by recursively constructing a prefix projection database, which can efficiently process long sequences and large-scale data sets. By setting the minimum support threshold, a series of frequently occurring patterns of research activity can be obtained, which patterns constitute a set of patterns of research interest. And finally, applying a trend prediction algorithm and an abnormality detection method to the obtained research interest pattern set, and combining preset external subject development data to obtain a research trend index. Trend prediction may use time series analysis methods such as ARIMA (autoregressive moving average) model or Prophet model. These models can capture trends, periodicity, and seasonal features in the time series data, thereby predicting future trends. For each research interest pattern, the change in frequency of occurrence thereof over a period of time in the past can be analyzed, and future trends can be predicted. The anomaly detection method can adopt a statistical or machine learning-based method, such as an One-Class SVM or an Isolation Forest algorithm, which can identify research behaviors which are significantly different from those of the conventional mode and possibly represent an emerging research direction or innovation point. In trend prediction and anomaly detection, preset external subject development data is also required to be combined. Such data includes overall trends in the discipline area, hot topics for important conferences or journals, preferential sponsoring directions for research sponsors, and the like. By comparing and fusing the personal research interest patterns of researchers with these external data, a more comprehensive and objective research trend index can be obtained. In particular, the matching of personal research interests to discipline hotspots can be calculated, the frontier and potential impact of research directions can be evaluated, and the future development potential of certain research topics can be predicted.

Further, the research behavior sequence in the academic output data is analyzed by using a frequent pattern mining algorithm based on the research interest evolution track to obtain a research interest pattern set, wherein the research interest pattern set comprises a plurality of time sequence segments obtained by segmenting a time window of the research interest evolution track, a standardized behavior sequence set is obtained by encoding and serializing the research behavior sequence in the academic output data according to the time sequence segments, a preset sequence pattern mining algorithm is applied to the behavior sequence set to extract frequent subsequence patterns to obtain candidate research interest patterns, and a hierarchical clustering algorithm is used for classifying and merging the candidate research interest patterns in combination with the research topic distribution to obtain the research interest pattern set.

Specifically, first, a time window segment is performed on the evolution track of interest. This step aims at slicing the continuous time series data into discrete segments to facilitate subsequent pattern mining. In particular, a sliding window technique may be used to set a fixed size window of time (e.g., 6 months or 1 year), and then slide the window along the time axis, each time by a certain step (e.g., 1 month). This results in a series of overlapping time series segments, each representing a particular period of research interest. To capture patterns of different scales, a plurality of time windows of different sizes may also be set, resulting in a set of multi-scale time series segments. Next, based on these time-series segments, a research behavior sequence in the academic output data is subjected to encoding and serialization processing. The key to the encoding process is to convert complex research activities into compact and meaningful symbolic or numeric representations. For example, each topic or keyword may be assigned a unique identifier, and the principal study activity of the researcher within each time window is then represented as a sequence of these identifiers. Specifically, the top N topics or keywords with the highest frequency of occurrence in each time window can be selected as the representatives of the window and are ranked according to their importance to form an ordered symbol sequence. the coding mode not only keeps the time sequence information of research behaviors, but also greatly simplifies the data structure, and is convenient for subsequent pattern mining. The serialization process organizes the encoded sequences in a time-sequential manner to form a standardized set of behavior sequences. This set of sequences includes the evolution of the study behaviour of the investigator throughout the observation period. After the standardized behavior sequence set is obtained, the next step is to apply a preset sequence pattern mining algorithm to extract frequent subsequence patterns. Projection-based sequence pattern mining algorithms, such as PrefixSpan or modified versions thereof, may be optionally used herein. The core idea of the PrefixSpan algorithm is to find frequent sequence patterns by recursively constructing a prefix projection database, which can efficiently handle long sequences and large-scale data sets. The algorithm first needs to set a minimum support threshold and only sub-sequences whose frequency of occurrence exceeds this threshold will be considered frequent patterns. The algorithm then expands the prefix step by step starting with a length-1 sequence, each expansion generating a projection database, and finding frequent items in this database. This process will recursively proceed until no longer frequent sequences can be found. In this way, the algorithm can efficiently find all frequent subsequences that meet the minimum support requirement, which form the candidate research interest pattern set. Finally, to further refine and organize these candidate research interest patterns, hierarchical clustering algorithms are employed for classification and merging in conjunction with research topic distribution. First, a feature vector needs to be calculated for each candidate study interest pattern, which can be constructed based on the distribution of study topics contained in the pattern. Specifically, the frequencies of occurrence of each topic in the pattern may be counted and normalized to form a topic distribution vector. These feature vectors are then clustered using a hierarchical clustering algorithm. The hierarchical clustering algorithm is advantageous in that it does not require a pre-specified number of clusters, but rather gradually merges or segments clusters in a bottom-up or top-down manner. In this process, a metric such as Euclidean distance or cosine similarity may be used to calculate the similarity between patterns. The clustering process may be controlled by setting a distance threshold, and stopping the merging when the distance between any two clusters is greater than this threshold. Each resulting cluster represents a similar type of research interest pattern, and the center of the cluster can be used as a representative of that type of pattern. Thus, a classified and combined research interest pattern set is obtained, which reflects the time sequence characteristics of research behaviors and considers the semantic similarity of research subjects.

103. Predictive mining is carried out on historical research data in the multidimensional scientific research portrait feature set according to the research interest mode set and the research trend index, so that a research interest prediction model and a project matching degree score are obtained;

In one embodiment of the invention, predictive mining is carried out on historical research data in the multi-dimensional scientific research image feature set according to the research interest pattern set and the research trend index to obtain a research interest prediction model and project matching degree score, wherein feature fusion is carried out on the research interest pattern set and the research trend index, historical research data in the multi-dimensional scientific research image feature set are combined to obtain comprehensive feature vectors, trend, period and random component separation is carried out on the historical research data by using a time sequence decomposition technology according to the comprehensive feature vectors to obtain a multi-dimensional time sequence component, a depth learning model is applied to the multi-dimensional time sequence component to generate a research interest prediction model, interest prediction is carried out on preset future time points according to the research interest prediction model and the research activity data in the multi-dimensional scientific research image feature set to obtain a prediction interest vector, and the project matching degree score is obtained by cosine similarity calculation and multi-factor weighting according to the prediction interest vector and preset project feature vector.

Specifically, feature fusion is firstly carried out on the research interest pattern set and the research trend index. This step aims to integrate the research interest patterns and trend information obtained in the previous step into one unified feature representation. In specific implementation, a feature stitching method can be adopted to represent each mode in the research interest mode set as a vector, wherein each element represents a certain feature (such as frequency, duration and the like) of the mode, and meanwhile, the research trend index is also converted into a vector form, and each element corresponds to a specific trend index value. Then, these two vectors are stitched together to form a preliminary fused feature vector. This fused feature vector then needs to be combined with historical study data in the multi-dimensional scientific representation feature set. The historical research data comprises information of a plurality of dimensions such as the number of published papers, the number of references, the characteristics of the cooperative relationship network and the like. The historical data are also converted into vector forms, and are further spliced or weighted fused with the fusion feature vectors obtained before, so that a comprehensive feature vector is finally obtained. This integrated feature vector contains both the historical research interests and trend information of the researchers and various aspects of their historical research performance. After the comprehensive feature vector is obtained, the next step is to separate the trend, period and random components of the historical research data by using a time sequence decomposition technology. The purpose of this step is to provide a thorough understanding of the inherent structure and rules of variation of the historical study data. Common time series decomposition methods include classical decomposition and STL (Seasonal and Trend decomposition using Loess) decomposition. Taking STL decomposition as an example, it can effectively cope with nonlinear trends and multiple seasonings. In particular, it is first necessary to extract time-related features in the integrated feature vector to form one or more time sequences. Then, an STL decomposition algorithm is applied to each time series to decompose it into a trend component, a seasonal component, and a residual (random) component. Trend components reflect long-term trends in research activities, seasonal components capture periodic patterns that may exist (e.g., annual academic conference cycles), and random components include short-term fluctuations and noise. Through this decomposition, multi-dimensional time series components are obtained, each representing an important aspect of the historical study data. Next, a deep learning model is applied to the resulting multidimensional time series component to generate a research interest prediction model. In view of the nature of time series data, it is suitable to use Recurrent Neural Networks (RNNs) or variants thereof, such as long short term memory networks (LSTM) or gated loop units (GRUs). These models can effectively capture long-term dependencies in time series data. In particular, the multidimensional time series component needs to be firstly arranged into a format suitable for the input of the deep learning model, usually a three-dimensional tensor, and the dimensions of the three-dimensional tensor represent the number of samples, the time step and the feature number respectively. Then, a network structure containing multiple layers of LSTM or GRU units is constructed, and the added attention mechanism can be considered to enhance the perception of the key time points by the model. The training process of the model adopts a supervised learning mode, uses a part of historical data as input, and predicts the research interest of the next time step. Model parameters are continuously adjusted through a back propagation algorithm and a gradient descent optimizer until the performance of the model on the verification set reaches the optimal. After training, a deep learning model capable of predicting future research interests, namely a research interest prediction model, is obtained. Having studied the interest prediction model, the next step is to use this model to predict the interest at a preset future point in time. The latest research activity data in the multi-dimensional scientific image feature set is needed to be combined as the input of the model. Specifically, the study activity data for the last period of time (e.g., the last year) may be selected and subjected to the same preprocessing and feature extraction process as the training data to form the model input. And then, using a trained deep learning model to forward propagate the input data to obtain the prediction results of a plurality of time points in the future (such as three years in the future). These predictions may be represented as a series of vectors, each representing a predicted interest distribution at a future point in time, collectively referred to as predicted interest vectors. And finally, calculating the item matching degree score according to the obtained predicted interest vector and combining with a preset item feature vector. Project feature vectors are feature descriptions for each possible study project, including factors such as the study topic of the project, the skill required, the size of funds, and the like. The process of calculating the matching degree first uses a cosine similarity method to calculate the similarity between the predicted interest vector and each item feature vector. The cosine similarity can effectively measure the directional similarity of two vectors in a high-dimensional space, and is suitable for measuring the matching degree of research interests and project features. Then, taking into account the different factors, which may have different importance, a multi-factor weighting mechanism is introduced. Factors here include the professional context of the researcher, past project experience, collaborative networks, and the like. By giving different weights to the factors and multiplying the factors with cosine similarity calculation results, a comprehensive item matching degree score is finally obtained. The scoring considers future interest prediction of researchers and characteristics and other relevant factors of the project, and can provide more personalized and prospective project recommendation for the researchers.

104. According to the research interest prediction model, carrying out graph structure mining on a researcher cooperation network in which multidimensional scientific research portrait features are concentrated, and obtaining network structure features and potential cooperation relation prediction results;

In one embodiment of the invention, the method comprises the steps of carrying out graph structure mining on the researcher cooperation network in the multi-dimensional scientific research portrait characteristic set according to the research interest prediction model to obtain network structure characteristics and potential cooperation relation prediction results, carrying out graph embedding processing on the researcher cooperation network in the multi-dimensional scientific research portrait characteristic set to obtain low-dimensional node representation vectors, carrying out attribute enhancement on researcher nodes in the researcher cooperation network according to the node representation vectors and the output of the research interest prediction model to obtain enhanced node representations integrating interest information, carrying out multi-level information propagation and aggregation on the enhanced node representation application graph attention network to obtain dynamically updated node characteristics, carrying out analysis on the researcher cooperation network according to the node characteristics by using a community detection algorithm to obtain network structure characteristics, and carrying out link prediction on the potential cooperation relation prediction results through a graph neural network based on the network structure characteristics and the node characteristics.

Specifically, firstly, a graph embedding process is carried out on a researcher cooperation network in which the multidimensional scientific research image features are concentrated. Graph embedding is a technique for mapping high-dimensional graph structure data into a low-dimensional vector space, and aims to reduce the dimension of the data while retaining the graph structure information, so that the subsequent processing and analysis are facilitated. In specific implementation, deepWalk, node vec or GRAPHSAGE algorithm may be used. Taking node2Vec as an example, the algorithm generates a sequence of nodes by performing an offset random walk on the graph, and then learns the vector representation of the nodes using a method similar to Word2 Vec. In processing a network of researchers' collaboration, each researcher is considered a node, and the collaboration relationship is represented as edges between nodes. By setting appropriate walk strategies (e.g., balancing depth-first and breadth-first searches), node2vec is able to capture local and global structural features of the network. Ultimately, each researcher node is mapped into a low-dimensional vector, which constitutes a set of node representation vectors. After the node representation vectors are obtained, the next step is to perform attribute enhancement on the researcher nodes in the researcher cooperation network according to the vectors and the output of the research interest prediction model. This step aims to incorporate the researcher's predicted interest information into the network structure. Specifically, a predictive interest vector is first generated for each researcher using a previously trained research interest predictive model. The predicted interest vector is then spliced or weighted fused with the node representation vector to form an enhanced node representation. The fusion not only maintains the original network structure information, but also introduces dynamic interest prediction information, so that the node representation is richer and multidimensional. The fusion mode can adopt simple vector splicing, and can also use a more complex attention mechanism to dynamically adjust the weight according to the importance of different characteristics. Next, multi-level information propagation and aggregation is performed on the resulting enhanced node representation application graph attention network (GAT). The core idea of GAT is to adaptively assign different importance weights to neighbor nodes through a self-attention mechanism, thereby achieving finer information aggregation. in an implementation, a multi-layer GAT structure is first constructed, each layer containing multiple attention headers. Each attention head independently calculates the attention coefficients between pairs of nodes and then sums the characteristics of neighboring nodes by weighting them according to the coefficients. By stacking multiple layers of GAT, multi-hop propagation of information is realized, so that each node can not only acquire the information of a direct neighbor, but also capture the influence of a node at a longer distance. This process is dynamic in that the attention weights are calculated in real time based on the characteristics of the current node and the neighboring nodes, so that changes in network structure and node properties can be accommodated. After the GAT processing, each node obtains a dynamic updated characteristic representation, and the characteristic not only contains the information of the node itself, but also fuses the structural position and neighbor information of the node in the network. Based on the obtained dynamic update node characteristics, the next step is to analyze the cooperative network of researchers by using a community detection algorithm so as to obtain network structure characteristics. The purpose of community detection is to identify tightly connected groups of nodes in a network that represent different research teams or academic circles in a research collaboration network. Common community detection algorithms include Louvain methods, tag propagation algorithms, spectral clustering, and the like. Taking the Louvain method as an example, it discovers community structure by optimizing modularity (modularity). In particular implementations, the algorithm first treats each node as an independent community, and then iteratively moves the nodes into neighboring communities that can maximize the overall modularity. This process is repeated until the modularity is no longer significantly increased. Through community detection, network structure characteristics such as community number, community size distribution, community attribution of nodes and the like can be obtained, and the characteristics reflect the organization structure and grouping mode of the cooperative network of researchers. And finally, based on the obtained network structure characteristics and node characteristics, carrying out link prediction through a graph neural network to obtain a potential cooperative relationship prediction result. Link prediction aims at predicting edges in the network that have not yet been formed but that may occur in the future, which represents a potential new partnership in studying a partnership network. In implementation, a graph roll-up network (GCN) or a graph-annotation network (GAT) may be employed as the base model. First, network structure features (such as node degree, centrality index, community attribution, etc.) and previously obtained dynamically updated node features are combined as input features for each node. Then, a multi-layer graph neural network is constructed, and each layer aggregates and converts node characteristics. At the last layer of the model, a link pre-header is used that accepts as input the final representation of two nodes, outputting the probability of an edge being formed between the two nodes. The training of the model may employ a negative sampling technique, i.e. randomly selecting some non-existent edges as negative samples, together with known partnerships (positive samples) for training. By minimizing the binary cross entropy loss function, the model learns how to distinguish between edges that are present and that are not present. after training is completed, all unconnected node pairs in the network are predicted, the probability that the unconnected node pairs form a cooperative relationship in the future is obtained, and therefore a potential cooperative relationship prediction result is obtained.

105. And carrying out knowledge mining on research project data of the multi-source heterogeneous scientific research data according to the project matching degree scores, the network structure characteristics and the potential cooperative relation prediction results to obtain mining results, and carrying out comprehensive processing and knowledge representation on the mining results to obtain a personalized project recommendation list and an interpretable recommendation basis.

In one embodiment of the invention, knowledge mining is carried out on research item data of multi-source heterogeneous scientific research data according to item matching degree scores, network structure characteristics and potential cooperative relation prediction results to obtain mining results, comprehensive processing and knowledge representation are carried out on the mining results to obtain personalized item recommendation lists and interpretable recommendation bases, wherein multi-mode characteristic fusion is carried out on the item matching degree scores, the network structure characteristics and the potential cooperative relation prediction results to obtain comprehensive characteristic representations, semantic analysis and subject modeling are carried out on the research item data of the multi-source heterogeneous scientific research data according to the comprehensive characteristic representations to obtain item semantic networks, a knowledge graph construction algorithm is used for carrying out mining and reasoning on association relations among the items to obtain scientific research item knowledge graphs, multidimensional relevance calculation and sequencing are carried out on the scientific research item knowledge graphs by utilizing a graph neural network and a attention mechanism to obtain personalized item recommendation lists, and each item in the personalized item recommendation lists is carried out on the basis of relation item paths and importance interpretation recommendation bases to obtain quantitative recommendation bases.

Specifically, multi-modal feature fusion is performed on item matching degree scores, network structure features and potential cooperative relation prediction results. This step aims to integrate features of different sources and forms into one unified representation. In particular, feature stitching or attention mechanisms may be employed for fusion. The feature stitching method directly connects different feature vectors together to form a high-dimensional vector. The attention mechanism can dynamically adjust the importance weights of different features according to the requirements of the current task. For example, using a multi-headed attention mechanism, each feature is assigned an attention head, and the different features are then weighted together by the learned weights. The method can adaptively adjust the contributions of different features, thereby obtaining a more refined and information-rich comprehensive feature representation. After the comprehensive characteristic representation is obtained, the next step is to perform semantic analysis and topic modeling on the research project data in the multi-source heterogeneous scientific research data according to the representation. The purpose of this step is to provide a thorough understanding of the content and topic structure of the study. First, pre-processing of the project description text is required, including word segmentation, word deactivation, morphological reduction, etc. Each Word may then be converted to a dense vector representation using Word embedding techniques (e.g., word2Vec or BERT). Next, a topic modeling algorithm is applied, such as Latent Dirichlet Allocation (LDA) or variants thereof. LDA assumes that each document is a mixture of topics, which in turn are lexically probability distributions. Through iterative optimization, the LDA is able to discover potential topic structures from a large number of documents. In processing the study project data, each project description is regarded as a document, and the topic distribution of each project and the keyword distribution of each topic can be obtained through the LDA. Based on these results, a project semantic network can be constructed in which nodes represent projects or topics and edges represent the semantic association strength between them. And then, mining and reasoning the association relation among the projects by using a knowledge graph construction algorithm based on the project semantic network and the comprehensive feature representation so as to obtain a scientific research project knowledge graph. The knowledge graph construction process comprises the key steps of entity identification, relation extraction and knowledge fusion. First, key entities, such as research topics, methods, application areas, etc., are identified from project descriptions by Named Entity Recognition (NER) techniques. Then, semantic relationships between entities are identified from the text using a relationship extraction algorithm, such as a remote supervised learning or neural network based approach. In addition, potential associations between items can be inferred through feature similarity calculations using the previously derived composite feature representations. In the knowledge fusion stage, the problems of entity alignment and relationship alignment need to be solved, and duplicate and contradictory information is eliminated. Finally, knowledge maps are extended by inference mechanisms, such as using transitive rules to infer new relationships. The constructed scientific research project knowledge graph not only contains direct association among projects, but also contains rich semantic information and implicit knowledge structures. After the scientific research project knowledge graph is provided, the next step is to calculate and sort the multidimensional relevance by utilizing a graph neural network and an attention mechanism so as to obtain a personalized project recommendation list. The core of this step is to design a deep learning model that can be inferred on the knowledge graph. A graph attention network (GAT) may be employed as a base model because it is able to adaptively assign importance weights to different neighbor nodes. First, each node in the knowledge-graph is initialized to a feature vector, which may be a combination of the attribute information of the node and the previously obtained composite feature representation. then, a multi-layer GAT structure is built, each layer containing multiple attention headers. In each layer, each node updates its own representation by aggregating the information of neighboring nodes. The attention mechanism allows the model to dynamically adjust the importance of different relationships and neighbors according to the current task. After the multi-layer propagation, each item node obtains a representation containing rich context information. Finally, a recommendation score is calculated based on the similarity of the user's research interests (which may be the final representation of the user nodes) to the representation of the project nodes. The items are ordered according to the scores, thereby obtaining a personalized item recommendation list. And finally, carrying out relationship path analysis and feature importance quantification on each item in the personalized item recommendation list based on the scientific research item knowledge graph to obtain an interpretable recommendation basis. This step is intended to provide an intuitive and understandable interpretation of the recommendation results. In particular implementations, the interpretation may be generated by finding the shortest path or the most significant path from the user node to the recommended item node in the knowledge graph. At the same time, the contribution of different features to the recommendation is quantified using attention weighting or SHAP (SHAPLEY ADDITIVE exPlanations) like techniques. Such explanatory information is ultimately integrated into an easily understood text or visual form for presentation to the user as a basis for the recommendation.

Further, the step of carrying out relationship path analysis and feature importance quantification on each item in the personalized item recommendation list based on the scientific research item knowledge graph to obtain an interpretable recommendation basis comprises the steps of carrying out bidirectional breadth-first search on nodes related to recommended items in the scientific research item knowledge graph to obtain a multi-level relationship path set; the method comprises the steps of obtaining a relation path set, carrying out semantic embedding processing on nodes and edges of paths in the relation path set to obtain a path representation vector, applying an attention mechanism and a path pruning algorithm to the path representation vector to obtain a core interpretation path, quantifying feature contribution degree based on the core interpretation path and the comprehensive feature representation to obtain feature importance ranking, and constructing a structured interpretation template through a natural language generation technology according to the core interpretation path and the feature importance ranking to obtain an interpretable recommendation basis.

Specifically, firstly, a bidirectional breadth-first search is performed on nodes related to recommended items in a scientific research item knowledge graph. This step aims at exploring the multi-level relationship between recommended items and other entities. In particular, breadth first search is performed with the recommended item node as a starting point, both forward (in-edge) and backward (out-edge). In the searching process, the hierarchy and path information of each access node need to be recorded. To control the breadth and depth of the search, a maximum search depth and a maximum number of expansion nodes per layer may be set. The bidirectional search strategy can comprehensively capture the context information of the item in the knowledge graph, including the preconditions, related topics, potential applications and the like. The search results form a multi-level set of relationship paths, each path representing a semantic association between the recommended item and other entities. After the set of relational paths is obtained, the next step is to perform semantic embedding processing on the nodes and edges in the paths to obtain a vector representation of the paths. The purpose of this step is to convert the discrete graph structure information into a continuous vector space representation, facilitating subsequent similarity calculation and machine learning processes. In the implementation process, the embedded vector is trained in advance for the nodes and edges in the knowledge graph. For nodes, a knowledge graph embedding method such as TransE, rotatE can be used, and for edges, specific vector representations can be allocated according to the types of the edges. Then, for each relation path, the embedded vectors of the nodes and the edges are spliced in sequence order to form a preliminary path representation. To capture the overall semantics of the path, this preliminary representation may be further encoded using a sequence model such as LSTM (long short term memory network) or Transformer to obtain the final path representation vector. The method not only maintains semantic information of each element in the path, but also captures the sequence relation and long-term dependence among the elements. Next, an attention mechanism and a path pruning algorithm are applied to the resulting path representation vector to screen out the most representative core interpretation path. The application of the attention mechanism allows the model to automatically learn the importance weights of different paths. In particular implementations, a multi-headed self-attention mechanism may be used, each of which calculates correlations between paths, and then merges the results of the multiple heads. In this way, the model is able to capture complex interactions between paths. Meanwhile, in order to reduce the computational complexity and improve the simplicity of interpretation, a path pruning algorithm needs to be applied. The pruning process can be based on attention weight, so that paths with higher weight can be reserved, meanwhile, the diversity of the paths is considered, and different types of interpretation can be reserved. The core interpretation path set obtained after pruning has high correlation and can comprehensively reflect different aspects of recommendation reasons. after the core interpretation paths are obtained, the next step is to quantify the feature contribution based on the paths and the previously obtained comprehensive feature representations, and obtain feature importance ranking. This step aims at defining the degree of influence of the individual features on the recommended results. In the implementation process, a SHAP (SHAPLEY ADDITIVE exPlanations) value calculation method may be used. SHAP values are based on Shapley concept in game theory, and the contribution of each feature to the predicted result can be distributed fairly. Specifically, an interpretation model needs to be built, the core interpretation path and the comprehensive feature representation are taken as inputs, and the recommendation score is taken as an output. Then, by calculating the SHAP value of each feature, the contribution degree of the SHAP value to the recommended result is obtained. These contributions may be normalized and ranked, forming a feature importance ranking. This approach not only takes into account the effects of individual features, but also captures interactions between features, providing a more comprehensive and accurate interpretation. And finally, according to the core interpretation path and the feature importance sequence, constructing a structured interpretation template by a natural language generation technology to obtain an interpretable recommendation basis. The goal of this step is to translate the mathematical representation obtained in the previous step into a natural language description that is easily understood by the user. In the implementation process, a series of interpretation templates need to be designed first, and the templates should be capable of covering different types of recommendation reasons, such as similarity, complementarity, novelty and the like. Then, based on the semantic information and feature importance ranking of the core interpretation path, the most appropriate template is selected and relevant content is filled. During the population process, a pre-trained language model may be used to generate fluent and natural text. To enhance the individualization of the interpretation, language style and expertise may be adjusted according to the user's background knowledge and preferences. In addition, the addition of visual elements, such as drawing a core interpretation path into a simplified knowledge graph, can also be considered, so that the association between the recommended item and the user interest can be intuitively displayed. In this way, the finally generated recommendation basis not only accurately reflects the decision logic of the recommendation system, but also can be presented in a user-friendly manner, thereby being beneficial to enhancing the understanding and trust of the user on the recommendation result.

In this embodiment, a data set in a unified format is generated by performing data preprocessing on multi-source heterogeneous scientific research data, and a multi-dimensional scientific research image feature set is mined from the data set. And carrying out mode mining on academic output data of the researcher based on the feature sets to obtain a research interest mode set and research trend indexes. And generating a research interest prediction model and project matching degree scores for the historical research data through predictive mining, and carrying out graph structure mining on a researcher cooperation network by using the model to predict potential cooperation relations. And finally, carrying out knowledge mining on the scientific research project data according to the project matching degree scores, the network structural features and the cooperative relation prediction results, and generating a personalized project recommendation list and an interpretable recommendation basis thereof. The method can comprehensively consider interest evolution, project characteristics and cooperative potential of researchers, and improves accuracy and interpretability of recommendation.

The data mining method in the embodiment of the present invention is described above, and the data mining apparatus in the embodiment of the present invention is described below, referring to fig. 2, where an embodiment of the data mining apparatus in the embodiment of the present invention includes:

The feature mining module 201 is configured to perform data mining preprocessing on multi-source heterogeneous scientific research data to obtain a preprocessed unified format data set, and perform feature mining on the unified format data set to obtain a multi-dimensional scientific research image feature set;

The pattern mining module 202 is configured to perform pattern mining on a research activity of a researcher according to the academic output data in the multi-dimensional scientific research portrait feature set, so as to obtain a research interest pattern set and a research trend index;

The predictive mining module 203 is configured to predictively mine historical research data in the multi-dimensional scientific research portrait feature set according to the research interest pattern set and the research trend index, so as to obtain a research interest prediction model and a project matching degree score;

The graph structure mining module 204 is configured to perform graph structure mining on the researcher cooperation network in the multi-dimensional scientific research portrait feature set according to the research interest prediction model, so as to obtain network structure features and a prediction result of potential cooperation relations;

the knowledge mining module 205 is configured to perform knowledge mining on the research project data of the multi-source heterogeneous scientific research data according to the project matching degree score, the network structure feature and the potential cooperative relationship prediction result, obtain a mining result, and perform comprehensive processing and knowledge representation on the mining result to obtain a personalized project recommendation list and an interpretable recommendation basis.

In the embodiment of the invention, the data mining device runs the data mining method, and the data mining device generates a data set with a uniform format by carrying out data preprocessing on multi-source heterogeneous scientific research data and mines a multi-dimensional scientific research image feature set from the data set. And carrying out mode mining on academic output data of the researcher based on the feature sets to obtain a research interest mode set and research trend indexes. And generating a research interest prediction model and project matching degree scores for the historical research data through predictive mining, and carrying out graph structure mining on a researcher cooperation network by using the model to predict potential cooperation relations. And finally, carrying out knowledge mining on the scientific research project data according to the project matching degree scores, the network structural features and the cooperative relation prediction results, and generating a personalized project recommendation list and an interpretable recommendation basis thereof. The method can comprehensively consider interest evolution, project characteristics and cooperative potential of researchers, and improves accuracy and interpretability of recommendation.

The data mining apparatus in the embodiment of the present invention is described in detail above in fig. 2 from the point of view of a modularized functional entity, and the data mining device in the embodiment of the present invention is described in detail below from the point of view of hardware processing.

Fig. 3 is a schematic diagram of a data mining apparatus according to an embodiment of the present invention, where the data mining apparatus 300 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPUs) 310 (e.g., one or more processors) and a memory 320, and one or more storage mediums 330 (e.g., one or more mass storage devices) storing applications 333 or data 332. Wherein memory 320 and storage medium 330 may be transitory or persistent storage. The program stored on the storage medium 330 may include one or more modules (not shown), each of which may include a series of instruction operations in the data mining apparatus 300. Still further, the processor 310 may be configured to communicate with the storage medium 330 and execute a series of instruction operations in the storage medium 330 on the data mining device 300 to implement the steps of the data mining method described above.

The data mining device 300 may also include one or more power supplies 340, one or more wired or wireless network interfaces 350, one or more input/output interfaces 360, and/or one or more operating systems 331, such as Windows Server, mac OS X, unix, linux, freeBSD, etc. It will be appreciated by those skilled in the art that the data mining apparatus structure illustrated in FIG. 3 is not limiting of the data mining apparatus provided by the present invention, and may include more or fewer components than shown, or may combine certain components, or may be a different arrangement of components.

The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, or a volatile computer readable storage medium, having stored therein instructions that, when executed on a computer, cause the computer to perform the steps of the data mining method.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system or apparatus and unit described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. The storage medium includes a U disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, an optical disk, or other various media capable of storing program codes.

While the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that the foregoing embodiments may be modified or equivalents may be substituted for some of the features thereof, and that the modifications or substitutions do not depart from the spirit and scope of the embodiments of the invention.

Claims

1. A data mining method, characterized in that the data mining method comprises:

Perform data mining preprocessing on multi-source heterogeneous scientific research data to obtain a preprocessed unified format data set, and perform feature mining on the unified format data set to obtain a multi-dimensional scientific research portrait feature set;

Based on the academic output data in the multi-dimensional scientific research portrait feature set, pattern mining is performed on the research activities of researchers to obtain a research interest pattern set and research trend indicators;

According to the research interest pattern set and research trend indicators, predictive mining is performed on the historical research data in the multi-dimensional scientific research portrait feature set to obtain a research interest prediction model and a project matching score;

According to the research interest prediction model, the graph structure mining is performed on the researcher cooperation network in the multi-dimensional scientific research portrait feature set to obtain the network structure characteristics and potential cooperation relationship prediction results;

According to the project matching scores, network structure characteristics and potential cooperation relationship prediction results, knowledge mining is performed on the research project data of the multi-source heterogeneous scientific research data to obtain mining results, and the mining results are comprehensively processed and knowledge represented to obtain a personalized project recommendation list. For each project in the personalized project recommendation list, relationship path analysis and feature importance quantification are performed based on the scientific research project knowledge graph to obtain an explainable recommendation basis, including: performing a bidirectional breadth-first search on the nodes related to the recommended projects in the scientific research project knowledge graph to obtain a multi-level relationship path set; semantic embedding of the nodes and edges of the paths in the relationship path set Input processing to obtain a path representation vector; apply the attention mechanism and path pruning algorithm to the path representation vector to obtain the core explanation path, and based on the core explanation path and the comprehensive feature representation, quantify the feature contribution to obtain the feature importance ranking; according to the core explanation path and the feature importance ranking, a structured explanation template is constructed through natural language generation technology to obtain an explainable recommendation basis, wherein the scientific research project knowledge graph is obtained based on mining and reasoning the correlation relationship between projects, and the comprehensive feature representation is obtained based on multimodal feature fusion of the project matching score, network structure characteristics and potential cooperation relationship prediction results.

2. The data mining method according to claim 1 is characterized in that the pattern mining of the researchers' research activities is performed based on the academic output data in the multi-dimensional scientific research portrait feature set to obtain the research interest pattern set and research trend indicators including:

Performing topic modeling and keyword extraction on the academic output data in the multi-dimensional scientific research portrait feature set to obtain research topic distribution and keyword set;

According to the research topic distribution and keyword set, a time series cluster analysis is performed on the academic achievements of researchers in the academic output data to obtain the evolution trajectory of research interests;

Based on the evolution trajectory of the research interests, a frequent pattern mining algorithm is used to analyze the research behavior sequence in the academic output data to obtain a research interest pattern set;

The trend prediction algorithm and anomaly detection method are applied to the research interest pattern set, combined with preset external discipline development data, to obtain research trend indicators.

3. The data mining method according to claim 2 is characterized in that, based on the research interest evolution trajectory, a frequent pattern mining algorithm is used to analyze the research behavior sequence in the academic output data, and the research interest pattern set obtained includes:

Segmenting the evolution trajectory of the research interest into time windows to obtain multiple time series segments;

According to the time series segments, the research behavior sequences in the academic output data are encoded and serialized to obtain a standardized behavior sequence set;

Applying a preset sequence pattern mining algorithm to the behavior sequence set to extract frequent subsequence patterns and obtain candidate research interest patterns;

In combination with the research topic distribution, a hierarchical clustering algorithm is used to classify and merge the candidate research interest patterns to obtain a research interest pattern set.

4. The data mining method according to claim 1 is characterized in that the predictive mining of historical research data in the multi-dimensional scientific research portrait feature set is performed based on the research interest pattern set and the research trend index to obtain the research interest prediction model and project matching score, including:

Performing feature fusion on the research interest pattern set and the research trend index, and combining the historical research data in the multi-dimensional scientific research portrait feature set to obtain a comprehensive feature vector;

Based on the comprehensive feature vector, the time series decomposition technique is used to separate the trend, cycle and random components of the historical research data to obtain multidimensional time series components;

Applying a deep learning model to the multidimensional time series component to generate a research interest prediction model, and performing interest prediction for a preset future time point based on the research interest prediction model and the research activity data in the multidimensional scientific research portrait feature set to obtain a predicted interest vector;

According to the predicted interest vector, combined with the preset project feature vector, the project matching score is obtained through cosine similarity calculation and multi-factor weighting.

5. The data mining method according to claim 1 is characterized in that the graph structure mining is performed on the researcher cooperation network in the multi-dimensional scientific research portrait feature set according to the research interest prediction model, and the network structure characteristics and potential cooperation relationship prediction results obtained include:

According to the node representation vector and the output of the research interest prediction model, the attributes of the researcher nodes in the researcher cooperation network are enhanced to obtain an enhanced node representation that incorporates interest information;

Applying a graph attention network to the enhanced node representation to perform multi-level information propagation and aggregation to obtain dynamically updated node features;

According to the node characteristics, the community detection algorithm is used to analyze the researcher cooperation network to obtain the network structure characteristics;

Based on the network structure characteristics and the node characteristics, link prediction is performed through a graph neural network to obtain a potential cooperation relationship prediction result.

6. The data mining method according to claim 1 is characterized in that the research project data of the multi-source heterogeneous scientific research data is subjected to knowledge mining according to the project matching score, network structure characteristics and potential cooperative relationship prediction results to obtain mining results, and the mining results are subjected to comprehensive processing and knowledge representation to obtain a personalized project recommendation list and an explainable recommendation basis, which includes:

Performing multimodal feature fusion on the project matching scores, network structure characteristics and potential cooperative relationship prediction results to obtain a comprehensive feature representation;

According to the comprehensive feature representation, semantic analysis and topic modeling are performed on the research project data of the multi-source heterogeneous scientific research data to obtain a project semantic network;

Based on the semantic network and comprehensive feature representation of the project, a knowledge graph construction algorithm is used to mine and infer the associations between projects to obtain a scientific research project knowledge graph;

Using graph neural networks and attention mechanisms, multi-dimensional correlation calculation and sorting are performed on the scientific research project knowledge graph to obtain a personalized project recommendation list;

For each item in the personalized project recommendation list, relationship path analysis and feature importance quantification are performed based on the scientific research project knowledge graph to obtain an explainable recommendation basis.

7. A data mining device, characterized in that the data mining device comprises:

A feature mining module is used to perform data mining preprocessing on multi-source heterogeneous scientific research data to obtain a preprocessed unified format data set, and perform feature mining on the unified format data set to obtain a multi-dimensional scientific research portrait feature set;

A pattern mining module is used to perform pattern mining on the research activities of researchers based on the academic output data in the multi-dimensional scientific research portrait feature set to obtain a research interest pattern set and research trend indicators;

A predictive mining module, for performing predictive mining on the historical research data in the multi-dimensional scientific research portrait feature set according to the research interest pattern set and the research trend index, to obtain a research interest prediction model and a project matching score;

A graph structure mining module is used to perform graph structure mining on the researcher cooperation network in the multi-dimensional scientific research portrait feature set according to the research interest prediction model, and obtain network structure characteristics and potential cooperation relationship prediction results;

The knowledge mining module is used to perform knowledge mining on the research project data of the multi-source heterogeneous scientific research data according to the project matching score, network structure characteristics and potential cooperation relationship prediction results, obtain mining results, and perform comprehensive processing and knowledge representation on the mining results to obtain a personalized project recommendation list. For each project in the personalized project recommendation list, relationship path analysis and feature importance quantification are performed based on the scientific research project knowledge graph to obtain an explainable recommendation basis, including: performing a bidirectional breadth-first search on the nodes related to the recommended projects in the scientific research project knowledge graph to obtain a multi-level relationship path set; and performing a search on the nodes and edges of the paths in the relationship path set. Perform semantic embedding processing to obtain a path representation vector; apply an attention mechanism and a path pruning algorithm to the path representation vector to obtain a core explanation path, and based on the core explanation path and the comprehensive feature representation, quantify the feature contribution to obtain a feature importance ranking; based on the core explanation path and the feature importance ranking, a structured explanation template is constructed through natural language generation technology to obtain an explainable recommendation basis, wherein the scientific research project knowledge graph is obtained based on mining and reasoning about the correlation between projects, and the comprehensive feature representation is obtained based on multimodal feature fusion of the project matching score, network structure characteristics and potential cooperation relationship prediction results.

8. A data mining device, characterized in that the data mining device comprises: a memory and at least one processor, wherein the memory stores instructions;

The at least one processor calls the instructions in the memory to enable the data mining device to perform the steps of the data mining method according to any one of claims 1 to 6.

9. A computer-readable storage medium having instructions stored thereon, wherein when the instructions are executed by a processor, the steps of the data mining method according to any one of claims 1 to 6 are implemented.