Disclosure of Invention
Aiming at the current research situation and the existing problems, the invention provides a multi-table association large model question-answering method based on metadata characteristics and a thinking chain, so as to improve the performance of screening tables in a large model related multi-table scene.
The invention provides a multi-table association large language model question-answering method based on metadata characteristics and thinking chains, which comprises the following steps:
preprocessing a data table, namely performing data normalization processing on a plurality of data tables and establishing association among the data tables;
Extracting metadata characteristics of a data table to obtain a metadata document, wherein the metadata document comprises field information and key connection relation;
Constructing a thinking chain, namely constructing a question-answer thinking chain based on the association relation between the field information of the data table;
Constructing a prompt template, selecting prompts and generating prompts based on the questioning and answering thinking chain structure choose-prompt, wherein,
The method comprises the steps of choose-prompt selection, generation-prompt generation and prompt generation, wherein the prompt selection is used for screening out related tables according to user problems and metadata documents;
And predicting the question and answer, namely receiving the user problem, generating a prompt by using a prompt template, inputting the prompt into a pre-trained large language model, and generating a corresponding SQL sentence.
Preferably, the data normalization processing of the plurality of data tables includes one or more of the following steps:
data cleaning, namely normalizing one or more of the missing value, the repeated value and the abnormal value;
Data integration, namely merging related data in different data tables;
and (3) data conversion, namely unifying data formats of different data tables, carrying out standardization processing, and extracting features from the data tables.
Preferably, the field information comprises the name of the data table, the entity field in the data table and the attribute field corresponding to the entity field.
Preferably, the extracting the metadata feature of the data table further includes the steps of:
Meaning marking is carried out on the extracted field information;
Identifying and acquiring the data types of the field information, and standardizing and unifying the data types of the field information of the same category;
identifying and acquiring a main key-external key relation between data tables;
and integrating the extracted field information, the field meaning, the data type and the main key-foreign key relation to obtain the metadata document.
Preferably, the construction of the thinking chain comprises the following steps:
Constructing an entity field association reasoning process model among the data tables;
Constructing an attribute field association reasoning process model corresponding to entity fields among the data tables;
And constructing business rules and logic, namely integrating and setting rules and logic relations in the business field, and integrating the rules and logic relations into an entity field association reasoning process model and an attribute field association reasoning process model.
Preferably, the generate-sample generation prompt comprises a prompt field, a prompt field used for inputting a card slot and outputting a card slot, and the method is used for executing the following steps:
And filling the screened form information, the converted language format and the user problem into the input card slot to guide the large language model to generate corresponding SQL sentences.
Compared with the prior art, the multi-table associated large language model question-answering method based on the metadata characteristics and the thinking chain has the following beneficial effects:
the invention analyzes and understands NLP tasks, and determines a plurality of tables related to problems and relations among the tables. Understanding each of the tables involved, knowing the relevance between tables is critical to solving multi-table queries. And determining the form to be connected and the connection condition according to the relation between the problem and the form, and performing form connection operation according to the main key external key relation or other common attributes. And respectively constructing choose-prompt selection prompts and generation-prompt generation prompts according to the database form question-answer thinking chain implementation process, wherein the choose-prompt selection prompts guide a large model to screen relevant forms from the database according to user questions, and the generation-prompt generation prompts aim at generating corresponding SQL sentences by filling the screened form information, the converted language format, the problems and other key information into the clamping grooves, and returning results after execution.
When the method is faced with the multi-table query condition, the association between the table data information can be fully mined, the accuracy of table screening is improved, the model is guided to generate corresponding SQL sentences, and the accuracy of answering questions is improved.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The following problems need to be considered when facing the multi-table query scene in the Text2SQL task:
table association in the multi-table inquiry task, the association relation between different database tables needs to be identified and understood, including internal connection, external connection and the like, so as to ensure that the information of a plurality of tables is correctly integrated.
The connection condition that the model must be able to accurately identify the connection condition in the natural language question and convert it into the connection clause in the SQL query statement, covering the connection condition in the ON clause and the screening condition in the WHERE clause, etc.
Query statement complexity the SQL query statement involved in a multi-table query is typically more complex, contains more clauses, operations, and nested structures, and the model needs to be able to efficiently process these complex structures to generate an accurate SQL query statement.
Semantic reasoning the multi-table query task requires a model with strong semantic reasoning capability, and can solve the problems related to the relationship and the attribute among a plurality of tables, including the aspects of identifying indirect association, deducing the relationship among the attribute and the like.
Data sparsity-because multi-table queries involve multiple database tables, there may be a data sparsity problem, i.e., some associations in the input problem occur less frequently in training data. This makes it difficult for the model to learn enough information from the training data to accurately generate multi-table query statements.
As shown in FIG. 1, the overall question-answering method of the multi-table association big model question-answering based on metadata features and thinking chains provided by the embodiment of the invention firstly carries out preprocessing on local data, and then extracts the metadata features of the processed form data to generate metadata documents. And (3) constructing a database question-answer thinking chain to guide the large model to carry out the reasoning steps of table screening and SQL sentence generation according to the user questions. In the process of screening the table, the generation process of the SQL sentence is optimized by introducing the metadata document. And returning a problem result after the database executes the SQL sentence generated by the large model, and updating the metadata document according to the result. The method specifically comprises the following steps:
preprocessing a data table, namely performing data normalization processing on a plurality of data tables and establishing association among the data tables;
Extracting metadata characteristics of a data table to obtain a metadata document, wherein the metadata document comprises field information and key connection relation;
Constructing a thinking chain, namely constructing a question-answer thinking chain based on the association relation between the field information of the data table;
Constructing a prompt template, namely constructing choose-prompt selection prompts and generating prompts by generating the prompting based on a question-answer thinking chain, wherein the choose-prompt selection prompts are used for screening relevant forms according to user problems and combining metadata documents;
And predicting question and answer, namely receiving a user question in a subsequent operation, utilizing choose-prompt selection prompt to obtain table structure information, injecting generate-prompt generation prompt to generate a prompt, inputting the prompt into a pre-trained large language model, calling a database chain to perform database question and answer based on a candidate table, and generating a corresponding SQL sentence.
In the embodiment, chatGLM-6B-32K is selected as a universal large language model, and text2vec-base-chinese is selected as an embedded model. The ChatGLM has the advantages of low deployment cost, smooth output and the like, and compared with a basic model thereof, the ChatGLM-6B-32K has the context learning and multi-turn dialogue capability of a larger sequence. the text2vec-base-chinese has better text matching performance for multiple languages, especially Chinese, through a large amount of Chinese text training.
In one embodiment, as shown in FIG. 1, the form data has a variety of forms of presentation and storage, such as a csv format spreadsheet, an html format paged form, and a database format form management system, among others. From the practical engineering application and management, the local original table data has the problems of data missing, inconsistent data types, inconsistent metrology and the like. The table data is preprocessed by a series of processes in advance, and the table is connected so as to realize multi-table question and answer. Performing data normalization processing on the plurality of data tables includes one or more combinations of the following steps:
data cleaning, namely normalizing one or more of the missing value, the repeated value and the abnormal value;
The data integration is that the related data in different data tables are combined, including integrating the same name metadata information into one table, or integrating the tables with the same content and common content into one table, reducing the repeatability and the number of tables;
and (3) data conversion, namely unifying data formats of different data tables, carrying out standardization processing, and extracting features from the data tables.
When the method is specifically executed, the pretreatment steps are as follows:
(1) Data cleaning:
And (3) processing the missing values, namely detecting and processing the missing values in the data table, and selecting methods such as filling the missing values, deleting rows or columns containing the missing values and the like.
Outlier processing, namely identifying and processing outliers in a data table, detecting the outliers through a statistical method or a visual means, and processing according to specific situations.
Repeating value processing, namely searching and removing repeated records in the data table, and ensuring the uniqueness of the data.
(2) Data integration:
Table merging-the merging of related data in different data tables for subsequent analysis and modeling.
(3) And (3) data transformation:
Converting the data format, namely converting the data types in the data table, and ensuring the consistency and comparability of the data.
And extracting the required characteristics from the original data, and analyzing and modeling the metadata characteristic extraction process of the subsequent data table.
Data normalization, namely normalizing or normalizing the data to eliminate dimension differences among different data.
(4) And (3) data association:
and establishing links, namely establishing the connection between the data tables according to the association relation (such as a main key-external key relation) between the data so as to realize the integration and association of the data and facilitate multi-table query. Specific:
And establishing association between data according to common fields or key values in the data table, including a main key, an external key, character string matching, a time window or geographic position and the like, and performing association operation by using SQL sentences. Such as using JOIN keys to associate data in two tables, and using programming languages such as Python to do the association. Common libraries such as Pandas provide rich functionality to process and merge data. For subsequent multi-table querying or analysis.
And ensuring the consistency of the data, namely ensuring the consistency and the integrity of the data in the data association process and avoiding the redundancy or wrong association of the data.
Through the data preprocessing step, the data of a plurality of data tables can be effectively cleaned, integrated and converted, a solid foundation is laid for subsequent data analysis, mining and modeling work, and the accuracy and the efficiency of data processing are improved.
In one embodiment, the field information includes a name of the data table, an entity field in the data table, and an attribute field corresponding to the entity field.
In one embodiment, the metadata feature extraction is performed after the local table data is preprocessed. The metadata feature extraction is a very key step in the data processing and analysis process, can help better understand the structure, meaning and association relation of the data table, and provides basis for question answering. As shown in fig. 2, metadata feature extraction is performed with a table having 7 fields and table names as employee information. The NLP task is parsed and understood, and a plurality of tables and relationships among the tables are determined to which the problem relates. This requires identifying keywords, entities in the question and the connection means that may be involved. Each related table is understood to include information such as a table structure, a column name, a primary key external key relationship, and the like. And determining the form to be connected and the connection condition according to the relation between the problem and the form, and performing form connection operation according to the main key external key relation or other common attributes. Extracting metadata features of the data table comprises the following steps:
Meaning marking is carried out on the extracted field information;
Identifying and acquiring the data types of the field information, and standardizing and unifying the data types of the field information of the same category;
identifying and acquiring a main key-external key relation between data tables;
and integrating the extracted field information, the field meaning, the data type and the main key-foreign key relation to obtain the metadata document.
In a specific implementation, the structural information of the table is first extracted. The names of the data tables are obtained and the entities or topics represented by the tables are known. The statistical data table contains the number of fields to understand the dimension of the data. The name of each field is obtained, and the meaning of the field can be inferred from the field name. The primary key and the foreign key in the data table are identified so as to establish an association relationship between the data tables.
The meaning of the extracted fields is then noted. The data dictionary or metadata document is consulted to obtain a detailed description and meaning of each field. If the field is a field with stronger professionals, the field needs to be communicated with a business expert in the related field to confirm the meaning and business logic of the field.
And meanwhile, the data types of the extracted data are normalized and unified. The data type of each field, such as an integer, a character string, a date, etc., is acquired to ensure the accuracy and consistency of the data. Knowing the value range of each field helps verify the integrity and legitimacy of the data.
And extracting the association relation between the data. The primary-foreign key relationship between the data tables is identified, helping to establish an association between the data tables. And drawing a relation diagram among the data tables, and showing the association paths and the dependency relations among the data.
And finally integrating all the extracted metadata features. The extracted table structure information, field meanings, data types and association relations are integrated into the metadata document for subsequent use by the question-answering system.
Through the metadata feature extraction process, the structure and the meaning of the data table can be deeply known, the association relation between the data is clear, and necessary metadata support is provided for building a question-answering system, so that the questions presented by the user can be more accurately answered.
In one embodiment, the role of building a mental chain is to build logical relationships between various knowledge elements based on the content of the data sheet and business knowledge to help the question-answering system understand questions better and provide accurate answers. As shown in fig. 3, constructing the thought chain includes the steps of:
Constructing an entity field association reasoning process model among the data tables;
Constructing an attribute field association reasoning process model corresponding to entity fields among the data tables;
And constructing business rules and logic, namely integrating and setting rules and logic relations in the business field, and integrating the rules and logic relations into an entity field association reasoning process model and an attribute field association reasoning process model.
When specifically performed, the method comprises the following steps:
(1) Association between entities-in any data table, there may be an association between different entity fields. Taking the table provided in fig. 2 as an example, the employee ID may be associated with the employee ID in the payroll information table in the employee information table, thereby establishing an association between the employee and payroll information. Another example is that a department ID in the department information table is associated with a department ID in the employee information table in order to acquire detailed information of a department in which an employee is located.
(2) Dependency between attributes there is an impact and constraint relationship between different attributes in the data table. For example, employee job attributes may affect their payroll level, with different job positions corresponding to different payroll levels. The employee's date of job entry may be used to calculate the work age, which may affect the employee's opportunity to advance and payroll treatment. By establishing the attribute dependency model, the system can be helped to better understand the relationship between the attributes, so that the questions posed by the user can be accurately answered.
(3) Business rules and logic when constructing thinking chain, it is necessary to integrate rules and logic relations in specific business domain and integrate them into entity association and attribute dependence. This includes understanding business processes, rules, and constraints, ensuring that the system can accurately understand and apply these rules to provide a correct solution to the problem. For example, employees of different genders may have specific needs or limitations at certain locations that require consideration of the influence of gender factors in employee management.
In one embodiment, the generate-sample generation hint includes a hint field, for an input card slot and an output card slot.
The principle of prompt learning is described below:
Prompt learning is a method of directing a language model to perform a particular task by designing natural language prompts or instructions. The goal of prompt learning is to convert downstream tasks into pre-trained tasks through a prompt template, "prompt engineering" is one of the core theories of prompt learning, which emphasizes how to build or design appropriate prompts (prompts) to guide the model to perform specific tasks, making downstream tasks closer to the pre-trained model. These hints may be natural language questions, declarative instructions, example input-output pairs, or other forms of guidance.
Taking emotion classification tasks as an example, as shown in fig. 5. Different cues may be designed for the same natural language task. Different cues may have different effects, where the input card slot may be one or more. Through reasonable design prompt, the model can be effectively guided to execute specific tasks, and the performance and generalization capability of the model on downstream tasks are improved. In prompt engineering, it is critical to design appropriate prompts, which can affect the effects of model learning and reasoning.
The prompt engineering is an important method for guiding the model to learn and execute tasks, and the task in a specific field is guided to learn by carefully designing the prompt, so that the performance and the adaptability of the model are improved. The effective prompt design can enable the model to better understand task requirements, reduce errors and promote the overall learning effect.
In this embodiment, as shown in fig. 6, the construction of choose-sample selection prompts and generate-sample generation prompts is achieved through a database table question-answer thinking chain. The English part of the figure is a machine language execution process and is used for illustrating the execution logic of the question-answer thinking chain. choose-prompt selection prompts are used for guiding the large model to screen relevant tables from the database according to questions posed by the user. For example, in FIG. 6, for the input vector query of questions, whose wages are 10000, the name of the wages 10000 employees cannot be directly given through a single table, when a thinking chain is established, the employee ID in the employee information table may be associated with the employee ID in the salary information table by reading the metadata characteristics, so that the association between the employee name and the salary information is established, and therefore, the employee information table and the salary information table can be selected when choose-prompt is selected. The generate-prompt generation prompt aims at guiding the model to generate corresponding SQL sentences by filling the screened form information, the converted language format, the questions and other key information into the clamping groove. After executing the SQL statement, the model will return the corresponding result.
The embodiment of the generation type large language model construction combines with a deep learning algorithm in the artificial intelligence field, utilizes the capabilities of automatic text understanding, text generation and context awareness, automatically extracts key function point descriptions of each paragraph in a document by analyzing grammar, semantics and context of the text, and outputs the key function point descriptions according to items respectively. The fine-tuned generation type large language model can output normalized function point items according to a fixed format, and meanwhile, based on a prompt word learning strategy, the function points are refined under the condition of not changing an output format. Thus, the end-to-end generation of the content and the number of the items from the complete document to the functional points of different categories is realized.
In one embodiment, the invention trains the text classification model for identifying the functional point category based on the truth value part of the standardized functional point item in the pre-marked data set, obtains the final functional point item estimation result by fusing the output result of the generated large language model, and can be directly used for the subsequent cost calculation. Based on the functional point identification technology of the bidirectional coding representation transducer attention model, the normalized entry output of the model is further identified, and the identification result is fused with the normalized output result, so that more accurate functional point identification is realized. The text classification model training and reasoning process is shown in figure 2.
The input of the text classification training process is a functional point item in the data set, and the output true value is a category corresponding to each item. The input of the test flow is then the function point item extracted from the generated large language model.
For the functional point items input into the model, the embodiment firstly needs to divide text data into words or sub words, namely, performs word segmentation process, further extracts embedded codes for each word, extracts text word segmentation characteristics from the codes through a multi-layer stacked Transformer encoder module, and finally performs text classification in a functional point classifier to calculate cross entropy loss.
The function point classifier comprises a feedforward neural network and a softmax layer, which can be used as an output layer specific to a task after the encoder, and only one output layer is needed to be added when text is fine-tuned without retraining the whole model, so that the invention only needs to update the parameters of the classifier by using the function point classification loss.
The above embodiments for text classification BERT model construction achieve accurate classification of functional point entries by fine-tuning the BERT model, i.e., bi-directional encoding, text classification techniques that represent a Transformer attention network. The model uses large-scale pre-training, learns general language representation, has higher semantic feature expression compared with the traditional machine learning, can adapt to specific classification tasks in a fine adjustment mode, is easier to converge, and has higher generalization and accurate discrimination performance.
In one embodiment, in the test stage, since the input function point items come from the output result of the generated large language model, the category judgment result is contained in the output result, so that the unified category judgment result is used as the output for the case of unified results of two times, and the confidence of the classification result is output according to the text classification model for the case of non-unified results of two times, the invention replaces the first output with the function point identification category with high confidence, so as to realize more accurate reasoning.
For the whole realization of end-to-end software function point extraction and identification of the invention, the two algorithms are integrated into a unified flow, so that each text paragraph is cut from a complete document, the function point description is positioned in the text paragraph relevant to the function point, the content and the number of the function point items are extracted from the description, and finally the classification identification of the function point is realized. The general technical scheme is shown in fig. 3.
The method for multi-table association large language model question-answering provided by the invention is described in detail based on metadata characteristics and thinking chains, the principle and implementation of the invention are described by applying specific examples, the description of the examples is only used for helping to understand the method and core ideas of the invention, and meanwhile, the content of the description is not to be construed as limiting the invention in terms of specific implementation and application scope since the ideas of the invention can be changed by those of ordinary skill in the art.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.