CN118171645B

CN118171645B - Business information analysis method and system based on text classification

Info

Publication number: CN118171645B
Application number: CN202410593096.0A
Authority: CN
Inventors: 郭桦宜; 丁靖洋; 周阳; 马雪; 王兵; 何幸珊
Original assignee: Yunnan Paidong Technology Co ltd
Current assignee: Yunnan Paidong Technology Co ltd
Priority date: 2024-05-14
Filing date: 2024-05-14
Publication date: 2024-07-09
Anticipated expiration: 2044-05-14
Also published as: CN118171645A

Abstract

The invention discloses a business information analysis method and a business information analysis system based on text classification, which relate to the technical field of natural language processing, wherein the method comprises the following steps: determining an information monitoring main body, carrying out information monitoring and acquisition on the information monitoring main body, and constructing an information data set; labeling the information data set to construct a business information analysis data set; training a business information analysis model based on the business information analysis data set to obtain an information analysis result; according to the preset analysis duration, acquiring an information data set in the preset analysis duration and an information analysis result obtained by a business information analysis model, and analyzing by using methods such as descriptive statistical analysis to obtain a comprehensive information analysis result; based on the information analysis result and the comprehensive information analysis result, an information report is generated and stored. The invention solves the technical problem that the network comments with complexity and variability are difficult to effectively and quickly process in the prior art, and achieves the technical effects of improving analysis accuracy and analysis efficiency.

Description

Business information analysis method and system based on text classification

Technical Field

The invention relates to the technical field of natural language processing, in particular to a business information analysis method and system based on text classification.

Background

In the age of social media and the internet today, network comments have become one of the important ways people express views, emotions, and opinions. However, with the rapid development and popularity of the internet, the vast amount of network comment data makes manual processing difficult and time consuming. Traditional NLP techniques based on rules and statistical methods face some challenges in handling network reviews. Network reviews are highly complex and variable, such that the accuracy and efficiency of traditional methods in processing network reviews is limited.

Disclosure of Invention

The application provides a business information analysis method and a business information analysis system based on text classification, which are used for solving the technical problem that the prior art is difficult to effectively and quickly process network comments with complexity and variability.

In view of the above problems, the present application provides a method and a system for analyzing business information based on text classification.

In a first aspect of the present application, there is provided a method of analyzing commercial information based on text classification, the method comprising:

Determining an information monitoring main body, connecting the information monitoring main body with the Internet to perform information monitoring and acquisition, and preprocessing acquired data, including data cleaning and information abstract generation, to construct an information data set; monitoring the correlation of the main body, topic type and emotion tendency marking of the information data set, and constructing a business information analysis data set; training a business information analysis model formed by a correlation analysis sub-model, a topic classification sub-model and an emotion tendency analysis sub-model based on the business information analysis data set, and utilizing the business information analysis model to perform information analysis to obtain an information analysis result; according to the preset analysis duration, acquiring all information data sets in the preset analysis duration and all information analysis results obtained by the business information analysis model, and analyzing by using descriptive statistical analysis, time sequence analysis and a visualization method to obtain comprehensive information analysis results; and generating an information report based on the information analysis results and the comprehensive information analysis results, and storing the information report.

The connection internet is used for carrying out information monitoring and acquisition on the information monitoring main body and preprocessing the acquired data, including data cleaning and information abstract generation, and constructing an information data set, and the method comprises the following steps:

Determining a source of internet platform network information based on the information monitoring subject, and periodically collecting information data from an internet platform based on a crawler technology, wherein the information data is disclosed and does not contain sensitive information;

the text data is cleaned by using a data cleaning technology, which comprises the steps of extracting effective business information text information from HTML source codes, removing illegal characters in the text, and carrying out standardized processing on the text characters;

Extracting keywords from the information data through a preset information abstract generation algorithm, and capturing context information of front and rear parts of the keywords to generate an information abstract;

And storing the acquired and preprocessed data by using a database, and constructing the information data set.

The obtaining the comprehensive information analysis result comprises the following steps:

according to the corpus of the monitoring main body, setting a special vocabulary and a stop vocabulary, and performing word segmentation on the acquired text;

Extracting a keyword sequence in a text by using a TF-IDF algorithm, calculating a Hash value of the text keyword sequence by using a Hash function, and performing weighting and dimension reduction processing to generate a value represented by a 16-bit hexadecimal number;

comparing the values of texts of different information through a Hamming distance algorithm, determining the similarity between text information, and determining the information duty ratio according to the number relation of the similarity;

And according to the analysis results of the information, analyzing the quantity proportion, the release time and the sound volume trend of the information emotion tendency data based on the information proportion to obtain the comprehensive information analysis result.

The determining the similarity between the text messages comprises the following steps:

decomposing text T into vocabulary sets Extracting TF-IDF values of word sequences in the information text by adopting a TF-IDF algorithm to obtain TF-IDF weight sequences:；

Calculating the Hash value of the keyword sequence through a Hash function For each vocabularyConstructing a 16-dimensional vectorThe calculation expression is as follows:

，

wherein, Is a hash valueIn the first placeA value of the location;

accumulating all the feature vectors to obtain a total feature vector WhereinA value represented as a 16-bit binary number;

and comparing the values of the two pieces of information by using a Hamming distance algorithm to obtain the similarity between the text information.

In a second aspect of the present application, there is provided a text-based classification of commercial information analysis system, the system comprising:

The information data set construction module determines an information monitoring main body, is connected with the Internet to carry out information monitoring and acquisition on the information monitoring main body, and carries out preprocessing on acquired data, including data cleaning and information abstract generation, so as to construct an information data set; the business information analysis data set construction module is used for monitoring the correlation of the main body, the topic type and the emotion tendency marking of the information data set and constructing a business information analysis data set; the information analysis result obtaining module is used for training a business information analysis model formed by a correlation analysis sub-model, a topic classification sub-model and an emotion tendency analysis sub-model based on the business information analysis data set and utilizing the business information analysis model to conduct information analysis to obtain an information analysis result; the comprehensive information analysis result acquisition module acquires all information data sets in the preset analysis duration and all information analysis results obtained by the business information analysis model according to the preset analysis duration, and analyzes the information data sets by using descriptive statistical analysis, time sequence analysis and visualization methods to obtain comprehensive information analysis results; and the information report generation module generates and stores an information report based on the information analysis results and the comprehensive information analysis results.

One or more technical schemes provided by the application have at least the following technical effects or advantages:

(1) According to the invention, the correlation analysis sub-model is used in information analysis, so that the accuracy of information analysis is effectively improved, a large amount of irrelevant noise information can be filtered, and the overall operation efficiency of the system is improved;

(2) The topic classification sub-model adopts the same network structure as the correlation analysis sub-model; the text information characteristics can be captured, and business topics which are preset and related to the monitoring main body can be automatically classified; according to the method, different information hot topics can be revealed during information analysis, and finer emotion tendency distribution and emotion change tendency can be provided during information report generation; moreover, the topic classification model can capture information context semantic information, and a special dictionary and a stop word list are not required to be designed. Meanwhile, the topic of classification is focused on the monitoring subject, and the classification result has higher interpretability and practicability and can assist more refined information analysis work. In addition, compared with a clustering method, the method has the advantages that the parameter selection scheme is clear, the calculation complexity is low, and the requirement of information real-time processing analysis can be met;

(3) The relevance analysis and topic classification are not used in isolation, but are designed to be tightly combined with other components of the whole system, so that a high-efficiency analysis system which works cooperatively is formed; the real-time performance and the efficiency of analysis are improved, the fineness and the accuracy of emotion tendency analysis are further enhanced through accurate correlation filtering and topic classification, the reliability and the accuracy of information analysis reports are improved, and high-quality service is provided for business information analysis;

(4) The invention not only presets the analysis duration, but also realizes the optimization of the analysis flow sensitive to the time length by combining the correlation analysis and the topic classification model; the optimization enables the system to provide more accurate and refined analysis results within preset time, and meets the quick response requirement of business information analysis; the specific combination and the cooperative mode of the preset analysis duration, the correlation analysis and the topic classification are combined, so that the efficient and accurate information analysis is realized in a limited time;

(5) The invention solves the technical problem that the prior art is difficult to effectively and quickly process the network comments with complexity and variability, not only improves the effectiveness and reliability of data preprocessing, but also provides high-quality data input for the subsequent information analysis assembly, thereby providing more accurate and comprehensive data support for business information analysis.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a business information analysis method based on text classification according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a business information analysis system based on text classification according to an embodiment of the present application;

FIG. 3 is a diagram of a correlation analysis sub-model;

FIG. 4 is a flow chart of the collaborative work between models;

Reference numerals illustrate: the system comprises an information data set construction module 11, a business information analysis data set construction module 12, an information analysis result acquisition module 13, a comprehensive information analysis result acquisition module 14 and an information report generation module 15.

Detailed Description

The application provides a business information analysis method and a business information analysis system based on text classification, which are used for solving the technical problem that the prior art is difficult to effectively and rapidly process network comments with complexity and variability, so as to achieve the technical effect of improving analysis accuracy and analysis efficiency.

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus.

Example 1: as shown in fig. 1, the present application provides a method for analyzing business information based on text classification, the method comprising:

step S100: determining an information monitoring main body, connecting the information monitoring main body with the Internet to perform information monitoring and acquisition, and preprocessing acquired data, including data cleaning and information abstract generation, to construct an information data set;

in an embodiment of the present application, an information monitoring subject refers to a specific object that needs to be focused on and analyzed, such as a company, brand, product, policy, event, etc. When determining the information monitoring subject, the purpose and the requirement of clear monitoring are needed, so that the most suitable monitoring object is selected. For example, a company may want to monitor its brand information, product feedback, or competitor's dynamics.

Information monitoring collection is performed by first determining information sources, including social media platforms, news websites, commentary areas, and the like, based on selected information monitoring subjects. And then utilizing the existing web crawler tool to grab data related to the monitoring subject from the determined information source. A large amount of online information can be automatically collected through a crawler technology, and the data acquisition efficiency is improved. When data is collected, attention is paid to obey relevant laws and regulations, data collection is ensured to be legal and compliant, and the information data is disclosed and does not contain sensitive information.

And (3) cleaning the collected data, removing irrelevant information, processing illegal characters, normalizing texts and removing stop words. And carrying out information abstract generation, keyword extraction and text abstract generation. After the above-described preprocessing steps, a structured information dataset is constructed.

Step S200: monitoring the correlation of the main body, topic type and emotion tendency marking of the information data set, and constructing a business information analysis data set;

In the embodiment of the application, the monitoring subject relevance marking refers to judging whether each piece of information is directly related to the monitoring subject. In the labeling process, it is necessary to determine whether the information is directly related to the brand, product, service, high-level, competitor, etc. of the monitoring subject. Relevance marking utilizes natural language processing technology, such as named entity recognition and keyword matching, automatically recognizes whether the information contains entities or keywords related to the monitoring subject, and performs marking.

Topic type labeling refers to classifying information according to different topics, such as product feedback, brand images, market dynamics, industry policies and the like. When labeling, a series of topic classifications, such as product feedback, brand image, competitor dynamics and the like, are predefined according to the characteristics and requirements of a monitoring main body. And training a topic classification model by using text classification technology, such as a transform model expressed by bidirectional coding, a deep learning algorithm and the like, and automatically classifying the information.

Emotional tendency labeling refers to judging emotional tendency of each piece of information, namely positive, negative or neutral. And automatically judging the emotion tendencies of the information by using an emotion analysis technology, such as a rule-based method, a machine learning-based method or a deep learning method.

After the labeling process, the labeled information data set can be used as a business information analysis data set. The data set comprises an original information text, whether the labeling information is related to a monitoring subject, topic classification to which the labeling information belongs and emotional tendency of the labeling information.

Step S300: training a business information analysis model formed by a correlation analysis sub-model, a topic classification sub-model and an emotion tendency analysis sub-model based on the business information analysis data set, and utilizing the business information analysis model to perform information analysis to obtain an information analysis result;

In embodiments of the present application, it is desirable to ensure the quality and integrity of the business information analysis data set before training begins. Including checking the missing values, outliers and duplicate items in the dataset and performing corresponding processing.

The correlation analysis sub-model is used for judging the correlation of the information and the monitoring subject, and can effectively identify and exclude the information which is not correlated with the monitored commercial subject. Machine learning algorithms, such as logistic regression, a transform model of bi-directional coded representation, or a deep learning model, may be employed to train this model. When training the correlation analysis submodel, firstly, extracting the characteristics related to the correlation analysis, such as keyword frequency, entity recognition result and the like, from the information. And selecting a proper machine learning or deep learning algorithm as a basis of a correlation analysis sub-model, using a transform model represented by bi-directional coding as a text sequence encoder to generate word embedding characteristics represented by fused context semantic information, introducing BiGRU a network layer to extract characteristics of the context and long-distance dependent information in a sequence, and finally using MLP (multi-level processor) and a Softmax function to realize a classification model structure for correlation of the information (see figure 3). And analyzing the data set by using the marked commercial information, and training the correlation analysis sub-model. The performance of the model is optimized by continuously adjusting the parameters and super parameters of the model. And evaluating the trained correlation analysis sub-model by using a test set, and calculating indexes such as accuracy, recall rate and the like of the model to ensure that the model has good performance.

In the embodiment of the application, the correlation analysis submodel is specifically represented as follows:

Provided with a text sequence WhereinRepresenting the first in the sequenceToken of individual word. A transducer model using bi-directional encoded representations encodes text sequences using a self-attention mechanism, generating contextualized word embedded representations:

；

wherein, Is a sequence of coded hidden states,Representing dimensions asIs the first of (2)Hidden representations of individual words. The sequence is then further processed using a bi-directional gating loop unit (BiGRU) network layer to extract context and long-range dependent information:

；

Wherein the method comprises the steps of Is a new hidden layer sequence after BiGRU network layer processing. Finally, using a multi-layer perceptron and a Softmax function to perform classification tasks:

；

wherein, Is the output vector of the MLP,Is given to text sequencesWhen predicting the labelIs a probability distribution of (c). The model is learned by using a classification cross entropy function in training, and the performance of the model on the correlation classification task is optimized.

The correlation analysis submodel is used in information analysis, so that the accuracy of information analysis is effectively improved, a large amount of irrelevant noise information can be filtered, and the overall operation efficiency of the system is improved.

Among information analysis techniques, conventional techniques for topic analysis have methods based on keyword extraction, text clustering, and the like. Taking these two types of methods as examples for explanation, topic classification methods based on keyword extraction have the following limitations:

1) Contextual information missing: keyword extraction is typically based on statistical frequency and cannot fully take into account the contextual meaning of the vocabulary. This may result in the extracted keywords not accurately reflecting the actual content or intent of the text;

2) Problem of ignorance of key phrases: some important concepts may be represented by phrases rather than individual words; traditional keyword extraction methods may not recognize these key phrases, missing important information;

3) Depending on language resources: in chinese internet information, efficient keyword extraction requires reliance on specially designed dictionaries, stop word lists, etc.

Topic classification techniques based on text clustering have the following limitations:

1) Parameter selection: clustering algorithms typically require preset parameters such as the number of clusters. Selecting incorrect parameters may lead to inaccurate clustering results;

2) Lack of interpretability: because clustering is based on the inherent structure of the data, the generated clusters may not be easily interpreted. The clustering result may require further analysis by a domain expert to obtain meaningful topic classification;

3) The calculation cost is as follows: text clustering requires a significant amount of computational resources and time for large-scale data samples and real-time requirements.

The topic classification sub-model is used for classifying information according to different topics. Machine learning algorithms or deep learning models can also be employed to train this model. A series of topic categories are predefined according to the characteristics and requirements of the monitoring subject when training is performed. Features related to topic classification, such as TF-IDF, word vectors, etc., are extracted from the information. And then training a transformation model dialog topic classification sub-model represented by bidirectional coding, and optimizing the performance of the model by continuously adjusting the parameters and super parameters of the model. And evaluating the trained topic classification sub-model by using a test set, and calculating indexes such as classification accuracy, F1 score and the like of the model to ensure that the model can accurately classify information into corresponding topic categories.

Specifically, a correlation analysis sub-model based on deep learning uses a Transformer model represented by bidirectional coding as a text sequence encoder to generate word embedding features represented by fused context semantic information, then a BiGRU network layer is introduced to extract features of the context and long-distance dependent information in the sequence, and finally MLP is combined with a Softmax function to realize classification of topics of the information. The topic classification model can capture information context semantic information without designing a special dictionary and disabling a word list.

The topic classification sub-model can capture text information features and automatically classify out commercial topics preset to be related to the monitoring subject. Different hot topics of information can be revealed during information analysis, and finer emotion tendency distribution and emotion change tendency can be provided during information report generation. Specifically, the topic classification sub-model is a supervised model based on deep natural language processing technology, and before training the model, a coarser-granularity classification scheme is formulated for topics concerned by a monitoring subject, and a certain amount of manually marked data is used for training the topic classification sub-model. After model training is completed, topics related to information can be effectively classified, and in an information report, finer granularity, namely emotion tendency distribution and emotion change tendency on topics concerned by a monitoring subject can be provided.

Therefore, the topic of classification is focused on the monitoring subject, and the classification result has higher interpretability and practicability and can assist more refined information analysis work. In addition, compared with a clustering method, the method has the advantages that the parameter selection scheme is clear, the calculation complexity is low, and the requirement of information real-time processing analysis can be met.

The emotion tendencies analysis sub-model is used for judging the emotion tendencies of the information, namely positive, negative or neutral. Machine learning algorithms or deep learning models can also be employed to train this model. Features related to emotion trend analysis, such as emotion dictionary matching results, emotion word vectors and the like, are extracted from the information during training. The transform model of the bi-directional coded representation is chosen as the basis for the emotion trend analysis sub-model. And (5) analyzing the data set by using the marked commercial information, and training the emotion tendency analysis sub-model. The performance of the model is optimized by continuously adjusting the parameters and super parameters of the model. And evaluating the trained emotion tendency analysis sub-model by using a test set, and calculating indexes such as emotion classification accuracy, emotion score and the like of the model to ensure that the model can accurately judge emotion tendency of information.

After the relevance analysis sub-model, the topic classification sub-model and the emotion tendency analysis sub-model are trained, the relevance analysis sub-model, the topic classification sub-model and the emotion tendency analysis sub-model are combined into a complete business information analysis model. Then, the new information is automatically analyzed by using the model, and an information analysis result is obtained.

The relevance analysis sub-model and the topic classification sub-model not only improve the analysis accuracy by adding labeling information, but also adopt a deep learning technology, and particularly combine a Transformer model and an advanced algorithm of BiGRU network layers to realize deep semantic understanding and accurate classification of text data.

The relevance analysis and topic classification are not used in isolation, but are tightly combined with other components of the whole system to form a high-efficiency analysis system which works cooperatively. The collaborative workflow between models is shown in figure 4; the monitoring subject is first set up and information data about the monitoring subject is collected. And then sampling part of information data, carrying out data standardization and data enhancement technology on the data, and manually marking the relativity, topic category and emotion tendency information of the data and the monitoring main body. And then training a relevance analysis model, a topic classification model and an emotion tendency analysis model. After model training is completed, the full amount of information data is used to call the model for analysis. The specific flow is that firstly, a correlation analysis model is called, whether single information is related to a monitoring main body or not is judged, and data which is not related to the monitoring main body is removed. And then, related data are called a topic classification sub-model to classify topics, data which are not interested in the detection subject are removed, and information data are grouped according to topic categories. And finally, invoking an emotion tendency analysis model for the grouped information data to analyze emotion tendency.

The business information analysis model formed by the correlation analysis sub-model, the topic classification sub-model and the emotion tendency analysis sub-model not only improves the real-time performance and the efficiency of analysis, but also further enhances the fineness and the accuracy of emotion tendency analysis through accurate correlation filtering and topic classification, improves the reliability and the accuracy of information analysis reports, and provides high-quality service for business information analysis.

Step S400: according to the preset analysis duration, acquiring all information data sets in the preset analysis duration and all information analysis results obtained by the business information analysis model, and analyzing by using descriptive statistical analysis, time sequence analysis and a visualization method to obtain comprehensive information analysis results;

in an embodiment of the application, a preset analysis duration, for example, all information data sets within a week, month or year, is obtained. And applying the business information analysis model to the data sets to obtain corresponding information analysis results, including relevance, topic classification, emotion tendencies and the like.

Descriptive statistical analysis is performed on the collected information data to understand the basic conditions of the data, such as quantity, distribution, average value, median, mode, standard deviation and the like. And analyzing the occurrence frequency and the proportion of different topics, and knowing which topics are most prominent within a preset time period. And counting the proportion of positive, negative and neutral emotions, and knowing the overall emotion attitude of the public.

And (3) observing the change trend of different topics and emotion tendencies along with time by using a time sequence analysis method. This helps to discover the laws and patterns of information evolution over time. Periodic variations, such as seasonal variations or the effects of periodic events on the information, which may be present in the information data, are explored. And analyzing the influence of specific events, such as product release, crisis events and the like on the information, and observing the change of the information before and after the events.

The method for acquiring the analysis result according to the preset analysis duration comprises the following steps: first, the analysis time axis is divided into successive time periodsEach time period may be set up to test subject requirements, depending on the needs of the analysis.

For each time periodThe following index is calculated:

1) Correlation trend analysis index

；

Wherein,Is shown in the time periodIn the information data related to the detection subject and belonging to the categoryIs a number of (3). By calculating this index, it is possible to obtain the time periodIn a specific categoryA ratio of the amount of related information of (a);

2) Topic trend analysis index

；

Wherein,Is shown in the time periodIn the method, information data related to a detection subject belongs to topicsIs a number of (3). By calculating this index, it is possible to obtain the time periodIn a specific topicA ratio of the amount of related information of (a);

3) Emotional tendency trend analysis index

；

Wherein,Is shown in the time periodIn the information data related to the detection subject and belonging to the categoryThe emotion tendencies areThe ratio of the information data of (a);

4) Comprehensive trend analysis

Finally, in order to comprehensively consider the relativity, topic and emotion tendency of the information data, the invention defines a comprehensive trend index:

；

this index combines three-dimensional information, reflected in the time period In a specific topic categoryAnd emotional tendencyIs a comprehensive trend of the information data of (a). The trend under interaction of the relativity, topic category and emotion tendencies which change with time can be obtained.

The method and the device have the advantages that the analysis duration is preset, and the optimization of the analysis flow sensitive to the duration is realized by combining the correlation analysis and the topic classification model. The optimization enables the system to provide more accurate and refined analysis results within preset time, and meets the quick response requirement of business information analysis. The specific combination and the cooperative mode of the preset analysis duration, the relevance analysis and the topic classification are combined, so that efficient and accurate information analysis can be realized in a limited time.

When the visual method is used for analysis, chart tools such as bar charts, pie charts, line charts and the like are used for intuitively displaying the statistical results and the analysis results of the information data. The strength and the change of different topics or emotional trends within a preset time period are displayed through thermodynamic diagrams. And (3) using the time sequence data to make a dynamic chart or video, and displaying the trend and mode of information change along with time.

After the analysis is completed, the main topics within the preset duration and the overall emotion attitude of the public are summarized, and a comprehensive and deep comprehensive information analysis result is obtained.

Step S500: and generating an information report based on the information analysis results and the comprehensive information analysis results, and storing the information report.

In the embodiment of the application, the information analysis results, including correlation analysis, topic classification, emotion tendency analysis and the like, and comprehensive information analysis results are summarized, so that the integrity and accuracy of data are ensured. And extracting key information such as main topics, emotional tendency, trend change and the like from the summarized data, and providing a basis for information reporting.

And using a preset report template, automatically filling the extracted analysis result into a corresponding position in the report template, and automatically generating visual contents such as charts, statistical diagrams and the like according to the analysis data so as to enhance the readability and the understandability of the report. And automatically checking the generated report, and checking whether the content is complete, the data is accurate, the logic is reasonable and the like. The problem-free generation of the information report is checked and the generated information report is stored in a designated location, such as a server, cloud storage, or other storage medium.

Further, step S100 in the method provided in the application embodiment further includes:

The text data is cleaned by using a data cleaning technology, which comprises the steps of extracting effective business information text information from HTML source codes, removing illegal characters in the text, and carrying out standardized processing on the text characters; the data cleaning technique focuses on the original sequence itself, rather than simply removing noise data. The removal of information not related to the monitoring subject is accomplished by a correlation analysis sub-model, which allows each processing step to more precisely optimize its task rather than a one-cut processing method.

Extracting keywords from the information data through a preset information abstract generation algorithm, and capturing context information of front and rear parts of the keywords to generate an information abstract; the information abstract generation algorithm focuses on keywords, more importantly captures the context information of the keywords, and extracts interaction relations among the keywords through a deep learning-based method. This context-sensitive feature extraction provides the system with the ability to understand and analyze network information deeply;

In the embodiment of the application, a proper crawler frame, such as Scrapy, beautifulSoup, is selected according to the requirement. These frameworks provide convenient data capture and processing functions. In order to ensure the real-time performance of the data, a timing task is set, so that a crawler periodically gathers information data from a selected internet platform. When the data are collected, only the data which are disclosed and do not contain sensitive information are ensured to be captured, related laws and regulations are complied with, and the privacy of a user is protected.

When the data cleaning technology is used for cleaning text data of the information data, an analysis library is used for extracting effective business information text information from the HTML source codes. Including removing extraneous content such as HTML tags, scripts, etc., leaving only plain text. And cleaning the extracted text to remove illegal characters, special symbols and the like, thereby ensuring the normalization of the data and the accuracy of subsequent analysis. And finally, uniformly converting texts with different coding formats into standard codes, and ensuring the consistency and the readability of the data.

And selecting a proper information abstract generation algorithm, such as TextRank, LDA and the like, according to the requirements. These algorithms are used to extract keywords from text and capture contextual information of the keywords. And processing the cleaned information data by applying the selected algorithm, extracting keywords, capturing the context information of the keywords, and generating an information abstract.

When using a database to store collected and preprocessed data, a suitable database system, such as MySQL, mongoDB, is first selected based on the amount of data and the needs. The database can be ensured to efficiently and stably store and manage the collected and preprocessed data. And then according to the characteristics and the requirements of the information data, a reasonable database structure is designed. This includes determining table structures, fields, indexes, etc. to ensure the storage and query efficiency of the data. And storing the collected and preprocessed information data in a database, so as to ensure the safety and accessibility of the data. And finally, extracting data from the database according to the requirements, and constructing an information data set.

Further, step S200 in the method provided in the application embodiment further includes:

Determining the sample number of each label based on the labeling information of the relevance, topic type and emotion tendency of the monitoring subject;

And carrying out quantity supplement equalization on samples of which the sample quantity of each tag does not meet the preset requirement by using a text data enhancement technology, wherein the quantity supplement equalization comprises the steps of generating supplementary sample data to carry out tag sample quantity supplement so as to equalize the sample data of each tag. The text data enhancement technology is based on the characteristics that the language of the Chinese Internet is irregular and a large number of emotion symbols are present to replace normal characters to express, and by utilizing the characteristics, topic category and emotion category data with fewer numbers are enhanced in a text sequence by inserting punctuation marks, microblog topic symbols and character codes of expressions, so that the effectiveness of model training is improved.

In an embodiment of the application, information data that has been annotated to monitor subject relevance, topic type, and emotional tendency is collected, which data covers all possible tags. For each tag, the number of samples corresponding to the tag is counted. The distribution of the sample numbers of the various types of tags is analyzed to see if there are too few or too many samples of certain tags.

If the number of samples of certain tags does not meet the preset requirement, e.g., the number is too small, text data enhancement techniques need to be utilized to increase the number of samples of these tags. For text data, new samples are generated by synonym substitution. For example, certain words in the replacement sentence are their synonyms or paraphraseology. Or the text may be translated into another language and then translated back to the original language. This process may introduce new words and expressions, thereby increasing the diversity of the sample.

According to the data enhancement technique described above, new samples are generated for tags with an insufficient number of samples. Meanwhile, the generated new sample is ensured to meet the requirement of the target label while keeping original meaning. The new samples generated are added to the original dataset to increase the number of samples for these tags.

After adding the new samples, the number of samples for each tag is reckoned, ensuring that they are now more balanced. If the sample number of some tags still does not meet the requirement, repeating the data enhancement and supplementation process until the preset equalization requirement is met.

Further, step S300 in the method provided in the application embodiment further includes:

building a network framework of the correlation analysis sub-model, the topic classification sub-model and the emotion tendency analysis sub-model, adopting a transform model expressed by bidirectional coding as an encoder, and adding a bidirectional gating circulation unit network layer and a multi-layer perceptron;

dividing the business information analysis data set by using a ten-fold cross validation technology to obtain a training set and a testing set;

And training each sub-model by using the training set and the testing set through a supervision type fine tuning method to obtain the correlation analysis sub-model, the topic classification sub-model and the emotion tendency analysis sub-model.

In the embodiment of the application, when a network framework is built, a bidirectional coding transducer model is adopted, and BERT, roBERTa and the like are used as basic encoders of the sub-models. Above the transducer encoder, a bi-directional gating loop cell network layer is added to further capture timing dependencies in the sequence. After bi-directional gating the cyclic cell layer, a multi-layer perceptron layer may be accessed to enhance the nonlinear mapping capability of the model. The multi-layer perceptron layer is composed of a plurality of fully connected layers, and can carry out more complex transformation on input data.

The commercial information analysis dataset was divided into 10 aliquots. In each iteration, 9 of them were used as training sets, the remaining 1 as test sets.

The correlation analysis sub-model is trained using a training set. The goal of this model is to predict the degree of correlation between a given text and the monitored subject. The model is enabled to achieve the best performance on the training set by adjusting model parameters such as learning rate, batch size and the like.

Training is performed using a training set of topic classification sub-models. The goal of the model is to automatically classify text into predefined topic categories. Likewise, by adjusting model parameters, the model achieves the best topic classification performance on the training set.

Training the emotion tendencies analysis sub-model using a training set. The goal of the model is to identify emotional tendencies expressed in the text, such as positive, negative, or neutral. Through supervised fine tuning, the model can accurately identify emotion tendencies in different texts.

In each iteration, the performance of each sub-model is evaluated using a test set. The evaluation index may include accuracy, recall, F1 score, etc. Further optimization and adjustment of the model may be performed based on the evaluation results. And finally obtaining a correlation analysis sub-model, a topic classification sub-model and an emotion tendency analysis sub-model with stable performance after multiple rounds of iteration and tuning.

Further, the method further comprises:

The commercial information analysis dataset was divided into 10 parts, 9 of which were used in turn for model training, with the remaining one for testing.

In the embodiment of the application, the business information analysis data sets are randomly disordered, so that the sequence of each data point is ensured not to influence the subsequent segmentation and model evaluation. The out-of-order dataset was then split equally into 10 aliquots, each containing the same or similar number of data points.

The out-of-order dataset was then split equally into 10 aliquots, each containing the same or similar number of data points. The model was trained using the selected 9 training set data and the remaining 1 test set data was used to evaluate the performance of the model.

Further, the obtaining the comprehensive information analysis result further includes:

In an embodiment of the application, the specialized vocabulary contains terms and keywords related to the monitoring agent. Based on the monitoring subject's industry characteristics, products or services, relevant technical terms and keywords are collected to form a specialized vocabulary. The stop vocabulary contains words that frequently appear in text but do not contribute much to the meaning of the text, such as "yes", "in", etc. Screening and supplementing are performed based on the existing stop word list so as to adapt to the text processing requirement of a specific field.

Word segmentation is the process of segmenting continuous text into individual lexical units. When the collected text is subjected to word segmentation, word segmentation tools such as jieba word segmentation and the like are used, and the special vocabulary and the stop vocabulary are combined for filtering and optimizing.

TF-IDF is a weighting technique used for information retrieval and text mining. The TF and the IDF value of each word in the text are calculated, the TF-IDF value of each word is obtained by multiplying the TF and the IDF value, and the word with the higher TF-IDF value is selected as the keyword. The Hash function can transform an arbitrary length input through a hashing algorithm into a fixed length output, which is the Hash value. And carrying out Hash calculation on the extracted keyword sequence to generate a Hash value with a fixed length, namely a value represented by 16-bit hexadecimal numbers.

The hamming distance can be used to measure the similarity between texts. The smaller the hamming distance, the more similar the two texts are; the larger the hamming distance, the more dissimilar the two texts are. A threshold is set to determine what hamming distances may be considered similar. For example, if texts having a hamming distance of 5 or less are considered to be similar, two texts may be considered to be similar when their hamming distances are 5 or less. After determining the text similarity, counting the number of similar texts in all the information texts, and dividing the number of similar texts by the total number of texts to obtain the duty ratio of the similar texts, namely the information duty ratio.

Based on the information duty ratio, the number duty ratio, the release time and the sound volume trend of the information emotion tendency data are deeply analyzed. The overall distribution of the emotional tendency of the information is known by counting the number proportion of positive, negative and neutral information. And analyzing the release conditions of the information in different time periods, and knowing the propagation speed and trend of the information. And (5) displaying the change trend of the information sound volume through time sequence analysis, and predicting the development trend of future information. And analyzing the quantity proportion, the release time and the sound volume trend to obtain a comprehensive information analysis result.

Further, the method further comprises:

，

wherein, Is a hash valueIn the first placeA value of the location;

In the embodiment of the application, firstly, a text T is preprocessed, then a word segmentation tool is used for segmenting the text, and the text T is decomposed into vocabulary sets. For the word collection after word segmentation, TF and IDF values for each word are calculated. And multiplying the TF and the IDF to obtain a TF-IDF weight value of each word. According to the calculated TF-IDF weight value, selecting a word with higher weight as a keyword to form a keyword sequence.

And calculating the Hash value of each keyword by using a Hash function for the extracted keyword sequence. The Hash function converts an arbitrary length input into a fixed length output, typically a longer hexadecimal string.

For each vocabulary hash value, a 16-dimensional vector is constructed. And accumulating all the feature vectors to obtain a total feature vector. This feature vector is a 16-bit hexadecimal number representation of a value that contains the key information of the text T. Finally, the feature vectors of the two texts are compared by using a Hamming distance algorithm, so that the similarity between the two texts is determined.

Algorithms currently widely regarded as more advanced in the natural language processing arts are selected for comparison, including Glove, chinese-BERT-wwm-ext and Bert-base-chinese.

The experimental data is derived from a Chinese Internet social platform, data annotation is carried out in a crowdsourcing mode, and strict verification is carried out on the data annotation so as to ensure the data quality.

In performance evaluation, three indices commonly used in classification tasks, precision, recall, and F1 score (F1-score), are employed. These metrics enable a comprehensive evaluation of the performance of the classification model, wherein:

accuracy rate: representing the proportion of relevant instances that are correctly identified by the model to all relevant instances identified by the model.

Recall rate: representing the proportion of the relevant instances correctly identified by the model to all actual relevant instances.

F1 fraction: the method is a harmonic average value of the accuracy rate and the recall rate, can balance the accuracy rate and the recall rate, and is an important index for evaluating the robustness of the model. The index calculation method is as follows:

；

comparing the experimental results:

the correlation analysis model and the comparison experiment result are shown in table 1:

TABLE 1 correlation analysis model comparison experiment results

；

The topic classification submodel comparison experiment results are shown in table 2:

table 2 topic classification submodel comparison experiment results

；

The results of the emotion analysis model comparison experiment are shown in table 3:

TABLE 3 Emotion analysis model comparison experiment results

；

From the comparative analysis, it can be seen that the performance index of Precision (Precision), recall (Recall) and F1 score (F1-score) are all superior to the comparative algorithm.

Therefore, the information monitoring main body is determined, connected with the Internet to monitor and collect information, and the collected data is preprocessed to construct an information data set; monitoring main body relativity, topic type and emotion tendency labeling are carried out on the information data set, and a business information analysis data set is constructed; based on the business information analysis data set, training a business information analysis model formed by a correlation analysis sub-model, a topic classification sub-model and an emotion tendency analysis sub-model, and carrying out information analysis by utilizing the business information analysis model to obtain an information analysis result; according to the preset analysis duration, acquiring all information data sets in the preset analysis duration and all information analysis results obtained by a business information analysis model, and analyzing by using descriptive statistical analysis, time sequence analysis and a visualization method to obtain comprehensive information analysis results; and generating an information report based on the information analysis results and the comprehensive information analysis results, and storing the information report. The application solves the technical problem that the network comments with complexity and variability are difficult to effectively and quickly process in the prior art, and achieves the technical effects of improving analysis accuracy and analysis efficiency.

Example 2: based on the same inventive concept as the text-based business information analysis method in the foregoing embodiments, as shown in fig. 2, the present application provides a text-based business information analysis system, and the system and method embodiments in the embodiments of the present application are based on the same inventive concept. Wherein the system comprises:

The information data set construction module 11, wherein the information data set construction module 11 determines an information monitoring main body, is connected with the Internet to carry out information monitoring and acquisition on the information monitoring main body, and carries out preprocessing on acquired data, including data cleaning and information abstract generation, so as to construct an information data set;

the business information analysis data set construction module 12 is used for monitoring the correlation of the main body, the topic type and the emotion tendency marking of the information data set by the business information analysis data set construction module 12, and constructing a business information analysis data set;

The information analysis result obtaining module 13, wherein the information analysis result obtaining module 13 trains a business information analysis model formed by a correlation analysis sub-model, a topic classification sub-model and an emotion tendency analysis sub-model based on the business information analysis data set, and utilizes the business information analysis model to conduct information analysis to obtain an information analysis result;

The comprehensive information analysis result acquisition module 14, wherein the comprehensive information analysis result acquisition module 14 acquires all information data sets within a preset analysis duration and all information analysis results obtained by the business information analysis model according to the preset analysis duration, and analyzes the information data sets by using descriptive statistical analysis, time sequence analysis and a visualization method to obtain a comprehensive information analysis result;

and an information report generation module 15, wherein the information report generation module 15 generates and stores an information report based on each of the information analysis results and the integrated information analysis results.

Further, the system further comprises:

And carrying out quantity supplement equalization on samples of which the sample quantity of each tag does not meet the preset requirement by using a text data enhancement technology, wherein the quantity supplement equalization comprises the steps of generating supplementary sample data to carry out tag sample quantity supplement so as to equalize the sample data of each tag.

Further, the system further comprises:

，

wherein, Is a hash valueIn the first placeA value of the location;

It should be noted that the sequence of the embodiments of the present application is only for description, and does not represent the advantages and disadvantages of the embodiments. And the foregoing description has been directed to specific embodiments of this specification. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims can be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.

The foregoing description of the preferred embodiments of the application is not intended to limit the application to the precise form disclosed, and any such modifications, equivalents, and alternatives falling within the spirit and scope of the application are intended to be included within the scope of the application.

The specification and figures are merely exemplary illustrations of the present application and are considered to cover any and all modifications, variations, combinations, or equivalents that fall within the scope of the application. It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the scope of the application. Thus, the present application is intended to include such modifications and alterations insofar as they come within the scope of the application or the equivalents thereof.

Claims

1. A method of business information analysis based on text classification, the method comprising:

Determining an information monitoring main body, connecting the information monitoring main body with the Internet to perform information monitoring and acquisition, and preprocessing acquired data, including data cleaning and information abstract generation, to construct an information data set;

monitoring the correlation of the main body, topic type and emotion tendency marking of the information data set, and constructing a business information analysis data set;

Training a business information analysis model formed by a correlation analysis sub-model, a topic classification sub-model and an emotion tendency analysis sub-model based on the business information analysis data set, and utilizing the business information analysis model to perform information analysis to obtain an information analysis result; based on the business information analysis data set, training a business information analysis model composed of a relevance analysis sub-model, a topic classification sub-model and an emotion tendency analysis sub-model, wherein the method specifically comprises the following steps of:

Building a network framework of the relevance analysis sub-model, the topic classification sub-model and the emotion tendency analysis sub-model, adopting a transform pre-training language model expressed by bidirectional coding as an encoder, and adding a bidirectional gating circulation unit network layer and a multi-layer perceptron;

training each sub-model by using the training set and the testing set through a supervision type fine tuning method to obtain a correlation analysis sub-model, a topic classification sub-model and an emotion tendency analysis sub-model;

According to the preset analysis duration, acquiring all information data sets in the preset analysis duration and all information analysis results obtained by the business information analysis model, and analyzing by using descriptive statistical analysis, time sequence analysis and a visualization method to obtain comprehensive information analysis results;

generating an information report based on the information analysis results and the comprehensive information analysis results, and storing the information report;

storing the collected and preprocessed data by using a database, and constructing the information data set;

According to each information analysis result, analyzing the quantity proportion, the release time and the sound volume trend of the information emotion tendency data based on the information proportion to obtain the comprehensive information analysis result;

，

wherein, Is a hash valueIn the first placeA value of the location;

2. The text-classification-based business information analysis method of claim 1, wherein monitoring subject relevance, topic type, and emotional tendency labeling of the information dataset to construct a business information analysis dataset comprises:

3. The text-classification-based business information analysis method of claim 1, wherein said segmenting the business information analysis dataset using a ten-fold cross-validation technique comprises:

4. A system for a text-based business information analysis method according to any one of claims 1 to 3, wherein the system comprises:

The information data set construction module determines an information monitoring main body, is connected with the Internet to carry out information monitoring and acquisition on the information monitoring main body, and carries out preprocessing on acquired data, including data cleaning and information abstract generation, so as to construct an information data set;

The business information analysis data set construction module is used for monitoring the correlation of the main body, the topic type and the emotion tendency marking of the information data set and constructing a business information analysis data set;

The information analysis result obtaining module is used for training a business information analysis model formed by a correlation analysis sub-model, a topic classification sub-model and an emotion tendency analysis sub-model based on the business information analysis data set and utilizing the business information analysis model to conduct information analysis to obtain an information analysis result;

the comprehensive information analysis result acquisition module acquires all information data sets in the preset analysis duration and all information analysis results obtained by the business information analysis model according to the preset analysis duration, and analyzes the information data sets by using descriptive statistical analysis, time sequence analysis and visualization methods to obtain comprehensive information analysis results;

and the information report generation module generates and stores an information report based on the information analysis results and the comprehensive information analysis results.