CN118227736A - Text processing method, text processing device, electronic equipment and readable storage medium - Google Patents
Text processing method, text processing device, electronic equipment and readable storage medium Download PDFInfo
- Publication number
- CN118227736A CN118227736A CN202410345211.2A CN202410345211A CN118227736A CN 118227736 A CN118227736 A CN 118227736A CN 202410345211 A CN202410345211 A CN 202410345211A CN 118227736 A CN118227736 A CN 118227736A
- Authority
- CN
- China
- Prior art keywords
- standardized
- query data
- data
- standardized query
- database
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/338—Presentation of query results
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The disclosure relates to the technical field of data processing, and provides a text processing method, a text processing device, electronic equipment and a readable storage medium. The method comprises the following steps: performing modal alignment processing on at least one standardized database list and standardized query data to obtain a semantic consistency score corresponding to at least one standardized query data; determining a target database list according to the preset selection quantity and the semantic consistency score; determining target prompt information according to standardized query data and a target database list through prompt engineering; based on a large language model, the standardized query data and the target prompt information are subjected to text generation processing to obtain query result data, so that the understanding capability of the input query data is improved, the accuracy of the query result is improved, the robustness of the model is enhanced, the analysis degree of the query data is enhanced, and the problem of inaccurate output query result caused by different forms of database information and input information is solved.
Description
Technical Field
The disclosure relates to the technical field of data processing, and in particular relates to a text processing method, a text processing device, electronic equipment and a readable storage medium.
Background
In the existing query system, related problems of query are required to be input, keywords in the problems are extracted and identified, resource screening is carried out in a resource library by taking similarity as a screening standard, all screening results are displayed for users to select, but the query system does not understand and identify the query problems, and a data management mode of intelligently understanding and giving an answer from a natural language text to a query language is generated according to the problems and scene requirements.
Therefore, the prior art has the problem that the output query result is inaccurate due to the fact that the database information is different from the input information in form.
Disclosure of Invention
In view of this, embodiments of the present disclosure provide a text processing method, apparatus, electronic device, and readable storage medium, so as to solve the problem in the prior art that the output query result is inaccurate due to the difference between the database information and the input information.
In a first aspect of an embodiment of the present disclosure, there is provided a text processing method, including: performing modal alignment processing on at least one standardized database list and standardized query data to obtain a semantic consistency score corresponding to at least one standardized query data; determining a target database list according to the preset selection quantity and semantic consistency scores corresponding to at least one standardized query data; determining target prompt information corresponding to standardized query data according to the standardized query data and a target database list through prompt engineering; and carrying out text generation processing on the standardized query data and target prompt information corresponding to the standardized query data based on the large language model to obtain query result data corresponding to the standardized query data.
In a second aspect of the embodiments of the present disclosure, there is provided a text processing apparatus, including: the first processing module is used for carrying out modal alignment processing on at least one standardized database list and standardized query data to obtain a semantic consistency score corresponding to the at least one standardized query data; the first determining module is used for determining a target database list according to the preset selection quantity and semantic consistency scores corresponding to at least one standardized query data; the second determining module is used for determining target prompt information corresponding to the standardized query data according to the standardized query data and the target database list through prompt engineering; the second processing module is used for generating and processing texts of the standardized query data and target prompt information corresponding to the standardized query data based on the large language model to obtain query result data corresponding to the standardized query data.
In a third aspect of the disclosed embodiments, an electronic device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.
In a fourth aspect of the disclosed embodiments, there is provided a readable storage medium storing a computer program which, when executed by a processor, implements the steps of the above method.
Compared with the prior art, the embodiment of the disclosure has the beneficial effects that: the method comprises the steps of carrying out modal alignment processing on at least one standardized database list and standardized query data through a modal alignment model to obtain at least one semantic consistency score corresponding to the standardized query data, screening to obtain a target database list consisting of a plurality of standardized query data through the semantic consistency score corresponding to the standardized query data and a preset selection quantity, carrying out prompt generation processing on the standardized query data and the target database list through prompt engineering to obtain target prompt information corresponding to the standardized query data, inputting the target prompt information corresponding to the standardized query data into a large language model, and carrying out text generation processing to obtain query result data corresponding to the standardized query data, so that the understanding capability of the input query data is improved, the accuracy of the query result is improved, the robustness of the model is enhanced, and the analysis degree of the query data is enhanced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required for the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.
Fig. 1 is a scene schematic diagram of an application scene of an embodiment of the present disclosure;
FIG. 2 is a flow chart of a text processing method according to an embodiment of the present disclosure;
FIG. 3 is a schematic diagram of a text processing model provided by an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a text processing device according to an embodiment of the present disclosure;
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.
A text processing method and apparatus according to embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
Fig. 1 is a scene diagram of an application scene of an embodiment of the present disclosure. The application scenario may include terminal devices 1,2 and 3, a server 4 and a network 5.
The terminal devices 1,2 and 3 may be hardware or software. When the terminal devices 1,2 and 3 are hardware, they may be various electronic devices having a display screen and supporting communication with the server 4, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like; when the terminal apparatuses 1,2, and 3 are software, they can be installed in the electronic apparatus as above. The terminal devices 1,2 and 3 may be implemented as a plurality of software or software modules, or as a single software or software module, to which the embodiments of the present disclosure are not limited. Further, various applications, such as a data processing application, an instant messaging tool, social platform software, a search class application, a shopping class application, and the like, may be installed on the terminal devices 1,2, and 3.
The server 4 may be a server that provides various services, for example, a background server that receives a request transmitted from a terminal device with which communication connection is established, and the background server may perform processing such as receiving and analyzing the request transmitted from the terminal device and generate a processing result. The server 4 may be a server, a server cluster formed by a plurality of servers, or a cloud computing service center, which is not limited in the embodiment of the present disclosure.
The server 4 may be hardware or software. When the server 4 is hardware, it may be various electronic devices that provide various services to the terminal devices 1, 2, and 3. When the server 4 is software, it may be a plurality of software or software modules providing various services to the terminal devices 1, 2, and 3, or may be a single software or software module providing various services to the terminal devices 1, 2, and 3, which is not limited by the embodiments of the present disclosure.
The network 5 may be a wired network using coaxial cable, twisted pair wire, and optical fiber connection, or may be a wireless network that can implement interconnection of various Communication devices without wiring, for example, bluetooth (Bluetooth), near Field Communication (NFC), infrared (Infrared), etc., which are not limited by the embodiments of the present disclosure.
The user can establish a communication connection with the server 4 via the network 5 through the terminal devices 1,2, and 3 to receive or transmit information or the like. Specifically, at least one standardized database list and standardized query data can be subjected to modal alignment processing through the server 4to obtain at least one semantic consistency score corresponding to the standardized query data, then a target database list formed by a plurality of standardized query data is obtained through screening through the semantic consistency score corresponding to the standardized query data and a preset selection quantity, prompt generation processing is performed on the standardized query data and the target database list through prompt engineering to obtain target prompt information corresponding to the standardized query data, the target prompt information corresponding to the standardized query data is input into a large language model, and query result data corresponding to the standardized query data is obtained through text generation processing.
It should be noted that the specific types, numbers and combinations of the terminal devices 1,2 and 3, the server 4 and the network 5 may be adjusted according to the actual requirements of the application scenario, which is not limited by the embodiment of the present disclosure.
Fig. 2 is a flow chart of a text processing method according to an embodiment of the disclosure. The text processing method of fig. 2 may be performed by the server of fig. 1. As shown in fig. 2, the text processing method includes:
Step 201, performing modal alignment processing on at least one standardized database list and standardized query data to obtain a semantic consistency score corresponding to at least one standardized query data.
Specifically, at least one standardized database list and standardized query data can be processed through a modal alignment model, so as to obtain a semantic consistency score corresponding to the at least one standardized query data, the accuracy of information retrieval is improved, the efficiency of data processing is improved, support is provided for subsequent processing and analysis, the modal alignment model can be a multi-modal learning model or a cross-modal matching model and the like, a processing process is not limited at the processing process, the standardized database list can be a list subjected to standardization processing for tables or fields in the database, the standardization processing for the tables or fields in the database can be used for ensuring the consistency and comparability of the data, so that the subsequent query data can be the query data subjected to modal alignment processing with the standardized query data, the standardized query data can be the query data subjected to preprocessing and standardization step processing, the standardized query data can come from documents, databases or other sources, the semantic similarity is ensured through standardization processing, the semantic similarity is ensured to be the semantic similarity is not limited to the semantic similarity standard, the semantic similarity is not limited to the semantic similarity standard is calculated through the standardized data, the similarity is not limited to the semantic similarity standard, the semantic similarity is calculated based on the comparison score is compared with the standardized data, and the semantic similarity is not limited to the semantic similarity standard is calculated, and the similarity is calculated based on the similarity score is compared with the semantic similarity standard.
For example, the standardized database list A, B, and the standardized query data a are field-aligned by the modal alignment model, so that the field-aligned database list A, B, and the standardized query data a are mapped to the same semantic space and the semantic consistency scores of the standardized query data a and the database lists A, B, and the standardized query data a are calculated, so as to obtain the semantic consistency score corresponding to the standardized query data, wherein the semantic consistency score of the standardized database list a and the standardized query data a is 0.2, the semantic consistency score of the standardized database list B and the standardized query data a is 0.3, and the semantic consistency score of the standardized database list C and the standardized query data a is 0.5.
Step 202, determining a target database list according to the preset selection quantity and the semantic consistency score corresponding to at least one standardized query data.
Specifically, the number of the selected standardized database lists corresponding to the semantic consistency scores of the preset number of the selected database lists can be selected from all the database lists according to score ordering in all the semantic consistency scores, so that the retrieval efficiency is improved, resource utilization is optimized, support is provided for subsequent analysis, wherein the preset number of the selected database lists can be the number of the list selected from all the database lists according to task requirements, which is preset, and can be determined according to specific application scenes and requirements, for example, in one information retrieval task, the number of the preset selected database lists can be 5 according to the first five results of the highest semantic consistency scores of the task, so that the system or algorithm can be assisted in determining how many database lists with the highest scores should be selected as candidates for output or further processing after the semantic consistency scores are obtained, the target database list can be at least one database which is most matched with or related to the standardized query data according to the preset number of the preset consistency scores, for example, the number of the preset database lists can be used for representing the highest semantic consistency scores of the system or the highest semantic consistency scores can be selected from the highest database list according to the task ordering requirement.
For example, in an information retrieval task, according to the task, the first five results with the highest semantic consistency scores need to be returned, the preset selection number may be 5, and the 5 standardized database lists corresponding to the 5 semantic consistency scores selected may be the target database list.
Step 203, determining target prompt information corresponding to the standardized query data according to the standardized query data and the target database list through prompt engineering.
Specifically, the required information can be screened from standardized query data and a target database list according to a preset prompt template in a prompt project, target prompt information corresponding to the standardized query data is generated, so that the query effect and the processing efficiency are improved, understanding and utilization of the query information are promoted, wherein the prompt project can be a method for improving the performance of an artificial intelligence model by designing and optimizing input prompts, the technology can be used for a large-scale language model based on a converter, including but not limited to a general artificial intelligence language model series of an open artificial intelligence research center, the prompt project can be used for constructing optimized questions or instructions to guide the artificial intelligence model to generate valuable answers, and can be used for analyzing the semantic content of the standardized query data, the characteristics of the target database list and the matching degree between the two, and generating corresponding prompts based on the information, for example, the standardized query data can be a problem about a specific theme, the prompt project can be generated for assisting the understanding and utilization of the artificial intelligence model, the standardized query data can be used for assisting in understanding or information related to the theme model in the target database, the standardized query data can be used for providing additional query data to the following the query data through the explanation about the additional query data and the target data or the related to the target language model, the additional query data can be provided for the query data and the related to the explanation about the relevant prompt data, or direct a large language model to further refine or expand the query, etc., without limitation.
Step 204, based on the large language model, text generation processing is performed on the standardized query data and the target prompt information corresponding to the standardized query data, so as to obtain query result data corresponding to the standardized query data.
Specifically, the method can generate query result data corresponding to standardized query data through a large language model according to target prompt information corresponding to the standardized query data, so that accuracy and correlation of the query result are improved, richness of the query result is enhanced, processing efficiency is improved, and application range of the text processing method is expanded, wherein the large language model can be a deep learning-based natural language processing model, the large language model can understand and generate natural language text, a large number of parameters and training data are included, language complexity and context information can be captured, the large language model can generate new and reasonable text content including but not limited to query data or prompt information and the like according to input text, the large language model can be obtained through training of a large number of text data, language intrinsic law and mode are obtained through a complex neural network structure, based on given input, the standardized query result data can be generated through the large language model to process the standardized query data and the corresponding target prompt information, the query result data can be generated through the large language model, and the corresponding to the query result data can be more limited by the input text prompt information, the query result data can be provided with more detailed query result information can be generated according to the input text prompt information, the query result information can be more limited by the query result information, and the query result information can be provided by the detailed query result information is more relevant to the query result information or the text information.
According to the technical scheme provided by the embodiment of the disclosure, at least one standardized database list and standardized query data are subjected to modal alignment processing through a modal alignment model to obtain a semantic consistency score corresponding to the at least one standardized query data, and then a target database list formed by a plurality of standardized query data is obtained through screening the semantic consistency score corresponding to the standardized query data and a preset selection quantity, the standardized query data and the target database list are subjected to prompt generation processing through prompt engineering to obtain target prompt information corresponding to the standardized query data, the target prompt information corresponding to the standardized query data is input into a large language model, and query result data corresponding to the standardized query data is obtained through text generation processing, so that the understanding capability of the input query data is improved, the accuracy of the query result is improved, the robustness of the model is enhanced, and the analysis degree of the query data is enhanced.
In some embodiments, before performing the modal alignment processing on the at least one standardized database list and the standardized query data to obtain the semantic consistency score corresponding to the at least one standardized query data, the method further includes: acquiring query data and at least one database list; performing field alignment processing on the query data to obtain standardized query data; and carrying out standardized rewrite processing on the at least one database list to obtain the at least one standardized database list.
Specifically, query data and at least one database list may be obtained, the query data may be field aligned, the input query data may be understood, and standardized query data may be generated by standardized rewrite processing including but not limited to character string preprocessing, query data rewrite, intent prediction, etc., and the standardized query data may be obtained by performing standardized rewrite processing on the at least one database list, where the content may include a database name, a data table name, and a field name, without limitation.
The query data may be input data that needs to be searched, matched or analyzed, the form of the query data may be text, number, date or the like, and the query data may be selected according to application scenarios and requirements, the query data may include specific information or conditions retrieved from a database or other data sources, for example, in a search function of an e-commerce website, the input keyword or phrase may be query data, and the text processing model may search the database for matched item information according to the query data.
The database list may be a collection of data containing a series of data records, either structured, such as tables in a relational database, or unstructured, such as a collection of documents in a document database. The database listing may be stored in a database system and may be used to store, manage, and retrieve data, not limited herein, and in connection with the foregoing embodiments, the item information of the e-commerce web site may be a series of data records stored in the database, the records constituting the database listing. When query data is entered, the system may search the database listing for item information that matches the query data.
According to the technical scheme provided by the embodiment of the disclosure, the query data and at least one database list are acquired, the query data are subjected to field alignment, the input query data are understood, standardized query data are generated through standardized rewrite processing including but not limited to character string preprocessing, query data rewrite, intention prediction and the like, the standardized query data are generated, and the at least one database list is subjected to standardized rewrite processing, so that the at least one standardized database list is obtained, the data quality and the query accuracy are improved, the query data comparability is enhanced, the data processing flow is simplified, the query matching effect is improved, and the flexibility and the expandability of the text processing model are enhanced.
In some embodiments, performing a modal alignment process on the at least one standardized database list and the standardized query data to obtain a semantic consistency score corresponding to the at least one standardized query data, including: extracting features of at least one standardized database list to obtain feature vectors corresponding to the standardized database list; extracting features of the standardized query data to obtain feature vectors corresponding to the standardized query data; and performing space mapping processing on the feature vectors corresponding to the standardized database list and the feature vectors corresponding to the standardized query data to obtain at least one semantic consistency score corresponding to the standardized query data.
Specifically, query data can be converted into feature vectors corresponding to standardized query data through natural language processing technologies such as word frequency-inverse document frequency or T5 model, so that each word or phrase in a text can be mapped into a vector in a high-dimensional space, feature extraction is performed on a standardized database list to obtain feature vectors corresponding to the standardized database list, and semantic consistency scores corresponding to at least one standardized query data can be obtained through modes including, but not limited to, cosine similarity calculation or Euclidean distance calculation.
The feature vector corresponding to the standardized database may be a numeric vector extracted from the standardized database list by the feature extraction technology and used for representing the semantic or content information of each item in the database list, and the feature vector corresponding to the standardized query data may be a numeric vector extracted from the standardized query data by the feature extraction technology and used for representing the semantic or content information of the query data.
For example, the query data may be converted into feature vectors corresponding to the standardized query data through the T5 model, each word or phrase in the text may be mapped into a vector in a high-dimensional space, feature extraction may be performed on the standardized database list to obtain feature vectors corresponding to the standardized database list, and the feature vectors corresponding to the standardized database list and the feature vectors corresponding to the standardized query data may be subjected to cosine similarity calculation to obtain a semantic consistency score corresponding to at least one standardized query data.
According to the technical scheme provided by the embodiment of the disclosure, query data are converted into the feature vectors corresponding to standardized query data through a natural language processing technology, so that each word or phrase in a text can be mapped into a vector in a high-dimensional space, feature extraction is performed on a standardized database list to obtain the feature vectors corresponding to the standardized database list, the feature vectors corresponding to the standardized database list and the feature vectors corresponding to the standardized query data can be mapped into the same semantic space through similarity calculation, and at least one semantic consistency score corresponding to the standardized query data is obtained, so that the accuracy and efficiency of query matching are improved, and the expandability of the text processing model is enhanced.
In some embodiments, before performing the modal alignment processing on the at least one standardized database list and the standardized query data to obtain the semantic consistency score corresponding to the at least one standardized query data, the method further includes: acquiring a plurality of standardized query training data, a plurality of standardized database training lists and labels corresponding to the standardized database training lists, wherein the labels corresponding to the standardized database training lists are used for representing the true semantic consistency scores of the standardized database training lists; inputting a plurality of standardized query training data into a modal alignment model for feature extraction to obtain feature vectors corresponding to the standardized query training data; inputting a plurality of standardized database training lists into a modal alignment model for feature extraction to obtain feature vectors corresponding to the standardized database training lists; performing space mapping processing on the feature vectors corresponding to the standardized query training data and the feature vectors corresponding to the standardized database training list to obtain semantic consistency scores corresponding to the standardized query training data; determining loss of a modal alignment model according to semantic consistency scores corresponding to standardized query training data and labels corresponding to standardized database training lists; and updating parameters in the modal alignment model according to the loss of the modal alignment model in a cyclic iteration mode.
Specifically, a plurality of standardized query training data, a plurality of standardized database training lists and labels corresponding to the database training lists may be obtained, the labels represent real semantic consistency scores of the database training lists, the standardized query training data may be input to a modal alignment model to perform feature extraction to obtain feature vectors corresponding to each query training data, the standardized database training list may be further input to the modal alignment model to perform feature extraction to obtain feature vectors corresponding to each database training list, spatial mapping processing may be performed on the feature vectors of the query training data and the feature vectors of the database training lists, the obtained similarity calculation result may calculate a corresponding semantic consistency score for each query training data, the obtained semantic consistency score may be compared with the real labels corresponding to the database training lists, loss of the modal alignment model may be determined, the loss may be calculated by mean square error loss, cross entropy loss, etc., and the parameters in the modal alignment model update according to the calculated loss may be calculated by using optimization algorithm in a cyclic iteration mode.
The labels corresponding to the standardized database training list can be pre-labeled or generated according to a certain standard, can be used for representing the real semantic consistency score of the standardized database training list and guiding the training process of the modal alignment model, the loss of the modal alignment model can be a loss function, can be used for measuring the difference between the model prediction result and the actual label, and can be used for reflecting the degree of inconsistency between the semantic consistency score calculated by the modal alignment model and the real label, so that parameters of the model can be optimized by minimizing the loss, and the prediction result is close to the real label.
According to the technical scheme provided by the embodiment of the disclosure, by acquiring a plurality of standardized query training data, a plurality of standardized database training lists and labels corresponding to the database training lists, the standardized query training data can be input into a modal alignment model to perform feature extraction to obtain feature vectors corresponding to each query training data, and then the standardized database training list can be input into the modal alignment model to perform feature extraction to obtain feature vectors corresponding to each database training list, the feature vectors of the query training data and the feature vectors of the database training lists can be subjected to spatial mapping processing, the similarity calculation result can calculate corresponding semantic consistency scores for each query training data, the loss of the modal alignment model can be determined by comparing the semantic consistency scores obtained by calculation with the real labels corresponding to the database training lists, and parameters in the modal alignment model can be updated by using an optimization algorithm according to the calculated loss in a cyclic iteration mode, so that the modal alignment effect is improved, the performance of the modal alignment model is enhanced, and the modal alignment capability of the modal alignment model is improved.
In some embodiments, determining the target database list according to the preset selection number and the semantic consistency score corresponding to the at least one standardized query data includes: sequencing semantic consistency scores corresponding to at least one standardized query data to obtain a database sequence list corresponding to the semantic consistency scores; and determining a target database list according to the database sequence list corresponding to the preset screening condition and the semantic consistency score.
Specifically, all semantic consistency scores corresponding to each standardized query data may be ranked, after ranking based on the magnitude of the value of the semantic consistency score, the higher the score is, the higher the semantic matching degree between the standardized database list corresponding to the score and the standardized query data is, a database sequence list may be generated for each query data according to the ranking result, the databases in the database sequence list corresponding to the semantic consistency score may be ranked according to the semantic consistency score of the standardized query data, the preset screening condition may include a score threshold value, or a selection number, etc., where the limitation is not made, for example, the lowest score threshold value may be set, the database with the semantic consistency score higher than the threshold value may be selected, the database list conforming to the condition may be selected from the database sequence list corresponding to each standardized query data according to the preset screening condition, so as to obtain a target database list, and the preset number database list set with the highest score may be selected as the target database list according to the preset selection number.
The database sequence list corresponding to the semantic consistency score may be a list obtained by sorting the standardized database list according to the semantic consistency score corresponding to the standardized query data, databases in the list may be sorted according to the semantic consistency score, and the preset screening condition may be a preset screening standard and condition before determining the target database list, including but not limited to a score threshold, or a selection number, etc., may be used to screen a database list meeting the condition from the database sequence list corresponding to the semantic consistency score as the target database list. The preset screening conditions can be flexibly set and adjusted according to the requirements of practical applications.
According to the technical scheme provided by the embodiment of the disclosure, all the semantic consistency scores corresponding to each standardized query data are ranked, and after the ranking is performed based on the numerical value of the semantic consistency score, the higher the score is, the higher the semantic matching degree between the corresponding standardized database list and the standardized query data is represented, one database sequence list can be generated for each query data according to the ranking result, and the databases in the database sequence list corresponding to the semantic consistency score can be ranked according to the semantic consistency score of the standardized query data, so that the data processing and query efficiency is improved, the query accuracy is improved, and the expandability and the adaptability of the system are enhanced.
In some embodiments, after determining target prompt information corresponding to the standardized query data according to the standardized query data and the target database list through prompt engineering, the method further includes: constraining target prompt information corresponding to standardized query data through a preset external key to obtain constraint prompt information corresponding to the standardized query data; and based on the large language model, carrying out text generation processing on the standardized query data and constraint prompt information corresponding to the standardized query data to obtain constraint query result data corresponding to the standardized query data.
Specifically, constraint prompt information corresponding to standardized query data can be obtained through constraining target prompt information corresponding to standardized query data by a preset foreign key, in a database design, the foreign key can be used for ensuring referential integrity between data, the foreign key can be in a field or a field combination, a value of the foreign key can depend on a main key of another table, the preset foreign key can be preset and can be used for constraining rules or conditions of the target prompt information, constraint processing is carried out on the target prompt information according to the preset foreign key rules, the constraint prompt information can comprise checking whether field values in the target prompt information accord with main key values of other tables or accord with specific task logic rules, further text generation processing can be carried out on the standardized query data and constraint prompt information corresponding to the standardized query data based on a large language model, constraint query result data corresponding to the standardized query data can be obtained, the standardized query data and the constraint prompt information can be used as input, and the large language model can be transmitted to the large language model, and new constraint query result data can be obtained according to the input.
The preset foreign key may be a rule or a condition which is defined and completed in advance and used for constraining target prompt information, the constraint prompt information corresponding to standardized query data may be target prompt information after constraint processing of the preset foreign key, the constraint prompt information may include a field value and related task logic constraint conforming to the preset foreign key rule and may be used for subsequent text generation processing, and constraint query result data corresponding to standardized query data may be result data obtained after text generation processing is performed on the standardized query data and the constraint prompt information based on a large language model.
For example, the preset foreign key may be used to select the first 20 standardized database lists with semantic consistency scores from high to low as the target database list, and input the screened target database list into the large language model to obtain constraint query result data corresponding to the standardized query data.
According to the technical scheme provided by the embodiment of the disclosure, the target prompt information corresponding to the standardized query data is constrained through the preset external key, constraint prompt information corresponding to the standardized query data is obtained, constraint processing is carried out on the target prompt information according to the preset external key rule, further text generation processing can be carried out on the standardized query data and the constraint prompt information corresponding to the standardized query data based on the large language model, constraint query result data corresponding to the standardized query data is obtained, the integrity and the accuracy of the target prompt information are guaranteed, the query efficiency and the query accuracy are improved, and the flexibility and the expandability are enhanced.
In some embodiments, determining target prompt information corresponding to the standardized query data according to the standardized query data and the target database list through prompt engineering includes: acquiring a preset prompting template; and filling the preset prompt template according to the standardized query data and the target database list to obtain target prompt information corresponding to the standardized query data.
Specifically, a preset prompt template may be obtained, standardized query data and a target database list are filled into the preset prompt template according to a form of the preset prompt template, so as to obtain target prompt information corresponding to the standardized query data, the preset prompt template may be a set of predefined and reusable text templates for generating the standardized prompt information, the preset prompt template may be designed according to a specific application scenario, may include placeholders or variables, is not limited herein, may be used for filling according to specific query data and database list, and the filling process may be a process of replacing related information in the standardized query data and the target database list into placeholders or variables in the preset prompt template, including but not limited to, operation such as character string replacement, formatting, and the like, so as to select a prompt template meeting requirements according to the attribute of the query data and the characteristics of the target database list, and may fill field values in the query data, metadata information of the database list, and the like into the preset prompt template.
The preset prompt templates can be set according to different application scenes, for example, in the field of electronic commerce, the prompt templates can be designed according to commodity query data and inventory database list, and the prompt templates are used for generating prompts about information such as commodity inventory states, price changes and the like; in the medical field, a prompt template can be designed according to medical record query data and a diagnosis database list for generating prompts about information such as disease diagnosis, treatment scheme and the like.
According to the technical scheme provided by the embodiment of the disclosure, the standardized query data and the target database list are filled into the preset prompt template according to the form of the preset prompt template by acquiring the preset prompt template, so that the target prompt information corresponding to the standardized query data is obtained, the standardization degree of the information is improved, and the processing efficiency is improved.
Fig. 3 is a schematic structural diagram of a text processing model provided in an embodiment of the present disclosure. As shown in fig. 3, the structure of the text processing model includes:
The modality alignment model 301 is configured to perform modality alignment processing on at least one standardized database list and standardized query data, so as to obtain a semantic consistency score corresponding to the at least one standardized query data;
the large language model 302 is configured to perform text generation processing on the normalized query data and the target prompt information corresponding to the normalized query data, so as to obtain query result data corresponding to the normalized query data.
According to the technical scheme provided by the embodiment of the disclosure, at least one standardized database list and standardized query data are subjected to modal alignment processing through a modal alignment model to obtain a semantic consistency score corresponding to the at least one standardized query data, and then a target database list formed by a plurality of standardized query data is obtained through screening the semantic consistency score corresponding to the standardized query data and a preset selection quantity, the standardized query data and the target database list are subjected to prompt generation processing through prompt engineering to obtain target prompt information corresponding to the standardized query data, the target prompt information corresponding to the standardized query data is input into a large language model, and query result data corresponding to the standardized query data is obtained through text generation processing, so that the understanding capability of the input query data is improved, the accuracy of the query result is improved, the robustness of the model is enhanced, and the analysis degree of the query data is enhanced.
Any combination of the above-mentioned optional solutions may be adopted to form an optional embodiment of the present disclosure, which is not described herein in detail.
The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.
Fig. 4 is a schematic structural diagram of a text processing device according to an embodiment of the present disclosure. As shown in fig. 4, the text processing apparatus includes:
A first processing module 401, configured to perform a modal alignment process on at least one standardized database list and standardized query data, to obtain a semantic consistency score corresponding to the at least one standardized query data;
A first determining module 402, configured to determine a target database list according to a preset selection number and a semantic consistency score corresponding to at least one standardized query data;
a second determining module 403, configured to determine, according to the standardized query data and the target database list, target prompt information corresponding to the standardized query data through prompt engineering;
the second processing module 404 is configured to perform text generation processing on the normalized query data and the target prompt information corresponding to the normalized query data based on the large language model, so as to obtain query result data corresponding to the normalized query data.
According to the technical scheme provided by the embodiment of the disclosure, at least one standardized database list and standardized query data are subjected to modal alignment processing through a modal alignment model to obtain a semantic consistency score corresponding to the at least one standardized query data, and then a target database list formed by a plurality of standardized query data is obtained through screening the semantic consistency score corresponding to the standardized query data and a preset selection quantity, the standardized query data and the target database list are subjected to prompt generation processing through prompt engineering to obtain target prompt information corresponding to the standardized query data, the target prompt information corresponding to the standardized query data is input into a large language model, and query result data corresponding to the standardized query data is obtained through text generation processing, so that the understanding capability of the input query data is improved, the accuracy of the query result is improved, the robustness of the model is enhanced, and the analysis degree of the query data is enhanced.
In some embodiments, the text processing device is further configured to obtain query data and at least one database list; performing field alignment processing on the query data to obtain standardized query data; and carrying out standardized rewrite processing on the at least one database list to obtain the at least one standardized database list.
In some embodiments, the first processing module 401 is specifically configured to perform feature extraction on at least one standardized database list to obtain feature vectors corresponding to the standardized database list; extracting features of the standardized query data to obtain feature vectors corresponding to the standardized query data; and performing space mapping processing on the feature vectors corresponding to the standardized database list and the feature vectors corresponding to the standardized query data to obtain at least one semantic consistency score corresponding to the standardized query data.
In some embodiments, the text processing apparatus is further configured to obtain a plurality of standardized query training data, a plurality of standardized database training lists, and labels corresponding to the plurality of standardized database training lists, where the labels corresponding to the standardized database training lists are used to characterize true semantic consistency scores of the standardized database training lists; inputting a plurality of standardized query training data into a modal alignment model for feature extraction to obtain feature vectors corresponding to the standardized query training data; inputting a plurality of standardized database training lists into a modal alignment model for feature extraction to obtain feature vectors corresponding to the standardized database training lists; performing space mapping processing on the feature vectors corresponding to the standardized query training data and the feature vectors corresponding to the standardized database training list to obtain semantic consistency scores corresponding to the standardized query training data; determining loss of a modal alignment model according to semantic consistency scores corresponding to standardized query training data and labels corresponding to standardized database training lists; and updating parameters in the modal alignment model according to the loss of the modal alignment model in a cyclic iteration mode.
In some embodiments, the first determining module 402 is specifically configured to sort the semantic consistency scores corresponding to the at least one normalized query data to obtain a database order list corresponding to the semantic consistency scores; and determining a target database list according to the database sequence list corresponding to the preset screening condition and the semantic consistency score.
In some embodiments, the text processing device is further configured to constrain target prompt information corresponding to the standardized query data through a preset external key, so as to obtain constraint prompt information corresponding to the standardized query data; and based on the large language model, carrying out text generation processing on the standardized query data and constraint prompt information corresponding to the standardized query data to obtain constraint query result data corresponding to the standardized query data.
In some embodiments, the second determining module 403 is specifically configured to obtain a preset alert template; and filling the preset prompt template according to the standardized query data and the target database list to obtain target prompt information corresponding to the standardized query data.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not constitute any limitation on the implementation process of the embodiments of the disclosure.
Fig. 5 is a schematic diagram of an electronic device 5 provided by an embodiment of the present disclosure. As shown in fig. 5, the electronic apparatus 5 of this embodiment includes: a processor 501, a memory 502 and a computer program 503 stored in the memory 502 and executable on the processor 501. The steps of the various method embodiments described above are implemented by processor 501 when executing computer program 503. Or the processor 501 when executing the computer program 503 performs the functions of the modules/units in the above-described apparatus embodiments.
The electronic device 5 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 5 may include, but is not limited to, a processor 501 and a memory 502. It will be appreciated by those skilled in the art that fig. 5 is merely an example of the electronic device 5 and is not limiting of the electronic device 5 and may include more or fewer components than shown, or different components.
The Processor 501 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL Processor, DSP), application SPECIFIC INTEGRATED Circuit (ASIC), field-Programmable gate array (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc.
The memory 502 may be an internal storage unit of the electronic device 5, for example, a hard disk or a memory of the electronic device 5. The memory 502 may also be an external storage device of the electronic device 5, such as a plug-in hard disk, a smart memory card (SMART MEDIA CARD, SMC), a Secure Digital (SD) card, a flash memory card (FLASH CARD) or the like, which are provided on the electronic device 5. Memory 502 may also include both internal storage units and external storage devices of electronic device 5. The memory 502 is used to store computer programs and other programs and data required by the electronic device.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a readable storage medium (e.g., a computer readable storage medium). Based on such understanding, the present disclosure may implement all or part of the flow of the method of the above-described embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable storage medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.
The above embodiments are merely for illustrating the technical solution of the present disclosure, and are not limiting thereof; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included in the scope of the present disclosure.
Claims (10)
1. A text processing method, comprising:
performing modal alignment processing on at least one standardized database list and standardized query data to obtain at least one semantic consistency score corresponding to the standardized query data;
determining a target database list according to the preset selection quantity and semantic consistency scores corresponding to at least one standardized query data;
Determining target prompt information corresponding to the standardized query data according to the standardized query data and the target database list through prompt engineering;
And based on a large language model, performing text generation processing on the standardized query data and target prompt information corresponding to the standardized query data to obtain query result data corresponding to the standardized query data.
2. The text processing method according to claim 1, further comprising, before performing a modality alignment process on the at least one standardized database list and the standardized query data to obtain a semantic consistency score corresponding to the at least one standardized query data:
Acquiring query data and at least one database list;
performing field alignment processing on the query data to obtain the standardized query data;
And carrying out standardized rewrite processing on at least one database list to obtain at least one standardized database list.
3. The text processing method according to claim 1, wherein performing a modality alignment process on at least one standardized database list and standardized query data to obtain at least one semantic consistency score corresponding to the standardized query data includes:
extracting features of at least one standardized database list to obtain feature vectors corresponding to the standardized database list;
Extracting features of the standardized query data to obtain feature vectors corresponding to the standardized query data;
And performing space mapping processing on the feature vectors corresponding to the standardized database list and the feature vectors corresponding to the standardized query data to obtain at least one semantic consistency score corresponding to the standardized query data.
4. The text processing method according to claim 1, further comprising, before performing a modality alignment process on the at least one standardized database list and the standardized query data to obtain a semantic consistency score corresponding to the at least one standardized query data:
Acquiring a plurality of standardized query training data, a plurality of standardized database training lists and a plurality of labels corresponding to the standardized database training lists, wherein the labels corresponding to the standardized database training lists are used for representing the true semantic consistency scores of the standardized database training lists;
Inputting a plurality of standardized query training data into a modal alignment model for feature extraction to obtain feature vectors corresponding to the standardized query training data;
Inputting a plurality of standardized database training lists into the modal alignment model for feature extraction to obtain feature vectors corresponding to the standardized database training lists;
performing space mapping processing on the feature vectors corresponding to the standardized query training data and the feature vectors corresponding to the standardized database training list to obtain semantic consistency scores corresponding to the standardized query training data;
determining the loss of the modal alignment model according to the semantic consistency score corresponding to the standardized query training data and the label corresponding to the standardized database training list;
and updating parameters in the modal alignment model according to the loss of the modal alignment model in a cyclic iteration mode.
5. The text processing method according to claim 1, wherein determining the target database list according to the preset selection number and the semantic consistency score corresponding to at least one of the standardized query data includes:
sequencing the semantic consistency scores corresponding to at least one piece of standardized query data to obtain a database sequence list corresponding to the semantic consistency scores;
And determining the target database list according to a preset screening condition and a database sequence list corresponding to the semantic consistency score.
6. The text processing method according to claim 1, further comprising, after the determining, by the prompt engineering, target prompt information corresponding to the standardized query data according to the standardized query data and the target database list:
constraining target prompt information corresponding to the standardized query data through a preset external key to obtain constraint prompt information corresponding to the standardized query data;
And based on the large language model, performing text generation processing on the standardized query data and constraint prompt information corresponding to the standardized query data to obtain constraint query result data corresponding to the standardized query data.
7. The text processing method according to claim 1, wherein the determining, by the prompt engineering, target prompt information corresponding to the standardized query data according to the standardized query data and the target database list includes:
Acquiring a preset prompting template;
and filling the preset prompt template according to the standardized query data and the target database list to obtain target prompt information corresponding to the standardized query data.
8. A text processing apparatus, comprising:
The first processing module is used for carrying out modal alignment processing on at least one standardized database list and standardized query data to obtain at least one semantic consistency score corresponding to the standardized query data;
The first determining module is used for determining a target database list according to the preset selection quantity and semantic consistency scores corresponding to at least one piece of standardized query data;
The second determining module is used for determining target prompt information corresponding to the standardized query data according to the standardized query data and the target database list through prompt engineering;
And the second processing module is used for generating and processing texts of the standardized query data and target prompt information corresponding to the standardized query data based on a large language model to obtain query result data corresponding to the standardized query data.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 7 when the computer program is executed.
10. A readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 7.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410345211.2A CN118227736A (en) | 2024-03-22 | 2024-03-22 | Text processing method, text processing device, electronic equipment and readable storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410345211.2A CN118227736A (en) | 2024-03-22 | 2024-03-22 | Text processing method, text processing device, electronic equipment and readable storage medium |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN118227736A true CN118227736A (en) | 2024-06-21 |
Family
ID=91510635
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202410345211.2A Pending CN118227736A (en) | 2024-03-22 | 2024-03-22 | Text processing method, text processing device, electronic equipment and readable storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN118227736A (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120087366A (en) * | 2025-04-30 | 2025-06-03 | 交通运输部水运科学研究所 | Port text data processing method, device, equipment and medium |
-
2024
- 2024-03-22 CN CN202410345211.2A patent/CN118227736A/en active Pending
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN120087366A (en) * | 2025-04-30 | 2025-06-03 | 交通运输部水运科学研究所 | Port text data processing method, device, equipment and medium |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20240028651A1 (en) | System and method for processing documents | |
| CN111797214A (en) | Question screening method, device, computer equipment and medium based on FAQ database | |
| US11874798B2 (en) | Smart dataset collection system | |
| US20180330231A1 (en) | Entity model establishment | |
| CN119127913A (en) | SQL statement generation method, system and device based on large model | |
| CN119357408A (en) | A method for constructing electric power knowledge graph based on large language model | |
| CN119046432A (en) | Data generation method and device based on artificial intelligence, computer equipment and medium | |
| CN119599130A (en) | Self-adaptive sensitive information intelligent identification method, device, equipment, storage medium and product | |
| CN119248945A (en) | Data retrieval method, device, computer equipment and storage medium | |
| CN116701593A (en) | Chinese question answering model training method and related equipment based on GraphQL | |
| CN111126073B (en) | Semantic retrieval method and device | |
| CN118227736A (en) | Text processing method, text processing device, electronic equipment and readable storage medium | |
| CN110737824A (en) | Content query method and device | |
| CN119917623A (en) | Information retrieval method, device, electronic device, storage medium and computer program product | |
| CN119646016A (en) | Data query method, device, electronic device, medium and program product | |
| Hsiao et al. | Creating hardware component knowledge bases with training data generation and multi-task learning | |
| CN119128134A (en) | Data visualization method, system, device and medium based on retrieval-enhanced generation | |
| CN119203952A (en) | A table data processing method, device, computer equipment and storage medium | |
| CN118364128A (en) | Image data identification method, device, electronic equipment and readable storage medium | |
| CN118093809A (en) | Document searching method and device and electronic equipment | |
| CN112784046B (en) | Text clustering method, device, equipment and storage medium | |
| Musabeyezu | Comparative study of annotation tools and techniques | |
| CN111368036B (en) | Method and device for searching information | |
| JP2019061522A (en) | Document recommendation system, document recommendation method and document recommendation program | |
| Rybak et al. | Machine learning-enhanced text mining as a support tool for research on climate change: theoretical and technical considerations |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |