+

CN118245591B - Multi-table association large language model question-answering method based on metadata characteristics and thinking chain - Google Patents

Multi-table association large language model question-answering method based on metadata characteristics and thinking chain Download PDF

Info

Publication number
CN118245591B
CN118245591B CN202410687924.7A CN202410687924A CN118245591B CN 118245591 B CN118245591 B CN 118245591B CN 202410687924 A CN202410687924 A CN 202410687924A CN 118245591 B CN118245591 B CN 118245591B
Authority
CN
China
Prior art keywords
data
prompt
question
association
metadata
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410687924.7A
Other languages
Chinese (zh)
Other versions
CN118245591A (en
Inventor
李孟书
唐海超
秦宇沨
胡勋
罗琪彬
宋浩楠
乔思龙
曲薇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CETC 15 Research Institute
Original Assignee
CETC 15 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CETC 15 Research Institute filed Critical CETC 15 Research Institute
Priority to CN202410687924.7A priority Critical patent/CN118245591B/en
Publication of CN118245591A publication Critical patent/CN118245591A/en
Application granted granted Critical
Publication of CN118245591B publication Critical patent/CN118245591B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于元数据特征和思维链的多表关联大语言模型问答方法,对多个数据表进行数据规范化处理,并建立数据表之间的关联;提取数据表的元数据特征获得元数据文档,包括字段信息和键连接关系;基于数据表字段信息之间的关联关系构建问答思维链;基于所述问答思维链构造choose‑prompt选择提示以及generate‑prompt生成提示,引导大语言模型根据用户提问从数据库中筛选出相关表格,并生成对应的SQL语句,执行后返回结果。本发明在面对多表查询状况时,能充分挖掘表格数据信息之间的关联,提高表格筛选的准确度,从而引导模型生成相应的SQL语句,提升回答问题的准确度。

The present invention discloses a multi-table association large language model question-answering method based on metadata features and thought chains, which performs data normalization processing on multiple data tables and establishes associations between data tables; extracts metadata features of data tables to obtain metadata documents, including field information and key connection relationships; constructs a question-answering thought chain based on the association relationship between data table field information; constructs a choose-prompt selection prompt and a generate-prompt generation prompt based on the question-answering thought chain, guides the large language model to filter out relevant tables from a database according to user questions, and generates corresponding SQL statements, and returns results after execution. When facing a multi-table query situation, the present invention can fully explore the association between table data information, improve the accuracy of table screening, thereby guiding the model to generate corresponding SQL statements and improve the accuracy of answering questions.

Description

Multi-table association large language model question-answering method based on metadata characteristics and thinking chain
Technical Field
The invention relates to the technical field of natural language processing, in particular to a multi-table association large language model question-answering method based on metadata characteristics and a thinking chain.
Background
In the current overview of multi-table query models, conventional or advanced models rely primarily on models in the deep learning category, such as the Graphic Neural Network (GNN) model, the IRNet model, and so on. While these approaches are excellent in solving certain problems, they ignore to some extent the explicit associations between queries and database tables, and the deficiencies in coping with the fields of condition value extraction and database architecture coding. In addition, these strategies typically focus on only a single task or single relationship type, and do not work well in multitasking and multiple relationship type scenarios. Thus, the conventional approach still suffers from some limitations and challenges when faced with the multi-table query situation.
Text2SQL is a Natural Language Processing (NLP) task that aims to translate natural language questions into Structured Query Language (SQL) queries. The goal of this task is to allow the computer to understand the questions posed by the user and generate corresponding SQL queries from the questions to retrieve the required information from the database.
In the Text2SQL task, the input is a natural language question and the output is an SQL query statement that can be directly executed in the database to obtain the answer. The technology has wide application in the fields of information retrieval, database query, dialogue systems and the like.
In a complex database system, a multi-table query scenario refers to a situation in which multiple database tables are required to be queried jointly in one query statement to obtain required information. In processing multi-table queries, whether involving multiple database connections or limited to a single database environment, it is essential to reduce the problem to a connection problem between multiple database tables. Thus, researchers can study multi-table queries under a unified framework, so that the universality and the expandability of query generation algorithms are improved, and the Text2SQL task is solved, so that correct connection between a plurality of tables is established, and required results are obtained.
The common Text2SQL method includes:
rule-based methods that use manually designed rules and templates to convert natural language questions to SQL query statements.
Statistical-based methods utilize statistical models and machine learning algorithms to learn the mapping between natural language questions and SQL queries.
Neural network-based methods use deep learning models (such as recurrent neural networks, attention mechanisms, and converters) to learn semantic mappings between natural language and SQL queries.
However, the existing method still has difficulties and challenges for the application of multi-table queries in Text2SQL tasks.
Therefore, how to provide a multi-table association large language model question-answering method based on metadata characteristics and thinking chains for improving the performance of screening tables in a multi-table scene related to a large model is a technical problem to be solved by those skilled in the art.
Disclosure of Invention
Aiming at the current research situation and the existing problems, the invention provides a multi-table association large model question-answering method based on metadata characteristics and a thinking chain, so as to improve the performance of screening tables in a large model related multi-table scene.
The invention provides a multi-table association large language model question-answering method based on metadata characteristics and thinking chains, which comprises the following steps:
preprocessing a data table, namely performing data normalization processing on a plurality of data tables and establishing association among the data tables;
Extracting metadata characteristics of a data table to obtain a metadata document, wherein the metadata document comprises field information and key connection relation;
Constructing a thinking chain, namely constructing a question-answer thinking chain based on the association relation between the field information of the data table;
Constructing a prompt template, selecting prompts and generating prompts based on the questioning and answering thinking chain structure choose-prompt, wherein,
The method comprises the steps of choose-prompt selection, generation-prompt generation and prompt generation, wherein the prompt selection is used for screening out related tables according to user problems and metadata documents;
And predicting the question and answer, namely receiving the user problem, generating a prompt by using a prompt template, inputting the prompt into a pre-trained large language model, and generating a corresponding SQL sentence.
Preferably, the data normalization processing of the plurality of data tables includes one or more of the following steps:
data cleaning, namely normalizing one or more of the missing value, the repeated value and the abnormal value;
Data integration, namely merging related data in different data tables;
and (3) data conversion, namely unifying data formats of different data tables, carrying out standardization processing, and extracting features from the data tables.
Preferably, the field information comprises the name of the data table, the entity field in the data table and the attribute field corresponding to the entity field.
Preferably, the extracting the metadata feature of the data table further includes the steps of:
Meaning marking is carried out on the extracted field information;
Identifying and acquiring the data types of the field information, and standardizing and unifying the data types of the field information of the same category;
identifying and acquiring a main key-external key relation between data tables;
and integrating the extracted field information, the field meaning, the data type and the main key-foreign key relation to obtain the metadata document.
Preferably, the construction of the thinking chain comprises the following steps:
Constructing an entity field association reasoning process model among the data tables;
Constructing an attribute field association reasoning process model corresponding to entity fields among the data tables;
And constructing business rules and logic, namely integrating and setting rules and logic relations in the business field, and integrating the rules and logic relations into an entity field association reasoning process model and an attribute field association reasoning process model.
Preferably, the generate-sample generation prompt comprises a prompt field, a prompt field used for inputting a card slot and outputting a card slot, and the method is used for executing the following steps:
And filling the screened form information, the converted language format and the user problem into the input card slot to guide the large language model to generate corresponding SQL sentences.
Compared with the prior art, the multi-table associated large language model question-answering method based on the metadata characteristics and the thinking chain has the following beneficial effects:
the invention analyzes and understands NLP tasks, and determines a plurality of tables related to problems and relations among the tables. Understanding each of the tables involved, knowing the relevance between tables is critical to solving multi-table queries. And determining the form to be connected and the connection condition according to the relation between the problem and the form, and performing form connection operation according to the main key external key relation or other common attributes. And respectively constructing choose-prompt selection prompts and generation-prompt generation prompts according to the database form question-answer thinking chain implementation process, wherein the choose-prompt selection prompts guide a large model to screen relevant forms from the database according to user questions, and the generation-prompt generation prompts aim at generating corresponding SQL sentences by filling the screened form information, the converted language format, the problems and other key information into the clamping grooves, and returning results after execution.
When the method is faced with the multi-table query condition, the association between the table data information can be fully mined, the accuracy of table screening is improved, the model is guided to generate corresponding SQL sentences, and the accuracy of answering questions is improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is apparent that the drawings in the following description are only embodiments of the present invention, and that other drawings may be obtained from the provided drawings without inventive labor for those skilled in the art.
FIG. 1 is a flowchart of a multi-table associative large language model question-answering method based on metadata features and thought chains provided by an embodiment of the present invention;
FIG. 2 is a flowchart of table data preprocessing provided in an embodiment of the present invention;
FIG. 3 is a flow chart of metadata feature extraction provided by an embodiment of the present invention;
FIG. 4 is a schematic diagram of a database form question-answer thinking chain provided by an embodiment of the invention;
fig. 5 is a schematic diagram of prompt learning provided by an embodiment of the present invention;
Fig. 6 is a flowchart for constructing a prompt template according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The following problems need to be considered when facing the multi-table query scene in the Text2SQL task:
table association in the multi-table inquiry task, the association relation between different database tables needs to be identified and understood, including internal connection, external connection and the like, so as to ensure that the information of a plurality of tables is correctly integrated.
The connection condition that the model must be able to accurately identify the connection condition in the natural language question and convert it into the connection clause in the SQL query statement, covering the connection condition in the ON clause and the screening condition in the WHERE clause, etc.
Query statement complexity the SQL query statement involved in a multi-table query is typically more complex, contains more clauses, operations, and nested structures, and the model needs to be able to efficiently process these complex structures to generate an accurate SQL query statement.
Semantic reasoning the multi-table query task requires a model with strong semantic reasoning capability, and can solve the problems related to the relationship and the attribute among a plurality of tables, including the aspects of identifying indirect association, deducing the relationship among the attribute and the like.
Data sparsity-because multi-table queries involve multiple database tables, there may be a data sparsity problem, i.e., some associations in the input problem occur less frequently in training data. This makes it difficult for the model to learn enough information from the training data to accurately generate multi-table query statements.
As shown in FIG. 1, the overall question-answering method of the multi-table association big model question-answering based on metadata features and thinking chains provided by the embodiment of the invention firstly carries out preprocessing on local data, and then extracts the metadata features of the processed form data to generate metadata documents. And (3) constructing a database question-answer thinking chain to guide the large model to carry out the reasoning steps of table screening and SQL sentence generation according to the user questions. In the process of screening the table, the generation process of the SQL sentence is optimized by introducing the metadata document. And returning a problem result after the database executes the SQL sentence generated by the large model, and updating the metadata document according to the result. The method specifically comprises the following steps:
preprocessing a data table, namely performing data normalization processing on a plurality of data tables and establishing association among the data tables;
Extracting metadata characteristics of a data table to obtain a metadata document, wherein the metadata document comprises field information and key connection relation;
Constructing a thinking chain, namely constructing a question-answer thinking chain based on the association relation between the field information of the data table;
Constructing a prompt template, namely constructing choose-prompt selection prompts and generating prompts by generating the prompting based on a question-answer thinking chain, wherein the choose-prompt selection prompts are used for screening relevant forms according to user problems and combining metadata documents;
And predicting question and answer, namely receiving a user question in a subsequent operation, utilizing choose-prompt selection prompt to obtain table structure information, injecting generate-prompt generation prompt to generate a prompt, inputting the prompt into a pre-trained large language model, calling a database chain to perform database question and answer based on a candidate table, and generating a corresponding SQL sentence.
In the embodiment, chatGLM-6B-32K is selected as a universal large language model, and text2vec-base-chinese is selected as an embedded model. The ChatGLM has the advantages of low deployment cost, smooth output and the like, and compared with a basic model thereof, the ChatGLM-6B-32K has the context learning and multi-turn dialogue capability of a larger sequence. the text2vec-base-chinese has better text matching performance for multiple languages, especially Chinese, through a large amount of Chinese text training.
In one embodiment, as shown in FIG. 1, the form data has a variety of forms of presentation and storage, such as a csv format spreadsheet, an html format paged form, and a database format form management system, among others. From the practical engineering application and management, the local original table data has the problems of data missing, inconsistent data types, inconsistent metrology and the like. The table data is preprocessed by a series of processes in advance, and the table is connected so as to realize multi-table question and answer. Performing data normalization processing on the plurality of data tables includes one or more combinations of the following steps:
data cleaning, namely normalizing one or more of the missing value, the repeated value and the abnormal value;
The data integration is that the related data in different data tables are combined, including integrating the same name metadata information into one table, or integrating the tables with the same content and common content into one table, reducing the repeatability and the number of tables;
and (3) data conversion, namely unifying data formats of different data tables, carrying out standardization processing, and extracting features from the data tables.
When the method is specifically executed, the pretreatment steps are as follows:
(1) Data cleaning:
And (3) processing the missing values, namely detecting and processing the missing values in the data table, and selecting methods such as filling the missing values, deleting rows or columns containing the missing values and the like.
Outlier processing, namely identifying and processing outliers in a data table, detecting the outliers through a statistical method or a visual means, and processing according to specific situations.
Repeating value processing, namely searching and removing repeated records in the data table, and ensuring the uniqueness of the data.
(2) Data integration:
Table merging-the merging of related data in different data tables for subsequent analysis and modeling.
(3) And (3) data transformation:
Converting the data format, namely converting the data types in the data table, and ensuring the consistency and comparability of the data.
And extracting the required characteristics from the original data, and analyzing and modeling the metadata characteristic extraction process of the subsequent data table.
Data normalization, namely normalizing or normalizing the data to eliminate dimension differences among different data.
(4) And (3) data association:
and establishing links, namely establishing the connection between the data tables according to the association relation (such as a main key-external key relation) between the data so as to realize the integration and association of the data and facilitate multi-table query. Specific:
And establishing association between data according to common fields or key values in the data table, including a main key, an external key, character string matching, a time window or geographic position and the like, and performing association operation by using SQL sentences. Such as using JOIN keys to associate data in two tables, and using programming languages such as Python to do the association. Common libraries such as Pandas provide rich functionality to process and merge data. For subsequent multi-table querying or analysis.
And ensuring the consistency of the data, namely ensuring the consistency and the integrity of the data in the data association process and avoiding the redundancy or wrong association of the data.
Through the data preprocessing step, the data of a plurality of data tables can be effectively cleaned, integrated and converted, a solid foundation is laid for subsequent data analysis, mining and modeling work, and the accuracy and the efficiency of data processing are improved.
In one embodiment, the field information includes a name of the data table, an entity field in the data table, and an attribute field corresponding to the entity field.
In one embodiment, the metadata feature extraction is performed after the local table data is preprocessed. The metadata feature extraction is a very key step in the data processing and analysis process, can help better understand the structure, meaning and association relation of the data table, and provides basis for question answering. As shown in fig. 2, metadata feature extraction is performed with a table having 7 fields and table names as employee information. The NLP task is parsed and understood, and a plurality of tables and relationships among the tables are determined to which the problem relates. This requires identifying keywords, entities in the question and the connection means that may be involved. Each related table is understood to include information such as a table structure, a column name, a primary key external key relationship, and the like. And determining the form to be connected and the connection condition according to the relation between the problem and the form, and performing form connection operation according to the main key external key relation or other common attributes. Extracting metadata features of the data table comprises the following steps:
Meaning marking is carried out on the extracted field information;
Identifying and acquiring the data types of the field information, and standardizing and unifying the data types of the field information of the same category;
identifying and acquiring a main key-external key relation between data tables;
and integrating the extracted field information, the field meaning, the data type and the main key-foreign key relation to obtain the metadata document.
In a specific implementation, the structural information of the table is first extracted. The names of the data tables are obtained and the entities or topics represented by the tables are known. The statistical data table contains the number of fields to understand the dimension of the data. The name of each field is obtained, and the meaning of the field can be inferred from the field name. The primary key and the foreign key in the data table are identified so as to establish an association relationship between the data tables.
The meaning of the extracted fields is then noted. The data dictionary or metadata document is consulted to obtain a detailed description and meaning of each field. If the field is a field with stronger professionals, the field needs to be communicated with a business expert in the related field to confirm the meaning and business logic of the field.
And meanwhile, the data types of the extracted data are normalized and unified. The data type of each field, such as an integer, a character string, a date, etc., is acquired to ensure the accuracy and consistency of the data. Knowing the value range of each field helps verify the integrity and legitimacy of the data.
And extracting the association relation between the data. The primary-foreign key relationship between the data tables is identified, helping to establish an association between the data tables. And drawing a relation diagram among the data tables, and showing the association paths and the dependency relations among the data.
And finally integrating all the extracted metadata features. The extracted table structure information, field meanings, data types and association relations are integrated into the metadata document for subsequent use by the question-answering system.
Through the metadata feature extraction process, the structure and the meaning of the data table can be deeply known, the association relation between the data is clear, and necessary metadata support is provided for building a question-answering system, so that the questions presented by the user can be more accurately answered.
In one embodiment, the role of building a mental chain is to build logical relationships between various knowledge elements based on the content of the data sheet and business knowledge to help the question-answering system understand questions better and provide accurate answers. As shown in fig. 3, constructing the thought chain includes the steps of:
Constructing an entity field association reasoning process model among the data tables;
Constructing an attribute field association reasoning process model corresponding to entity fields among the data tables;
And constructing business rules and logic, namely integrating and setting rules and logic relations in the business field, and integrating the rules and logic relations into an entity field association reasoning process model and an attribute field association reasoning process model.
When specifically performed, the method comprises the following steps:
(1) Association between entities-in any data table, there may be an association between different entity fields. Taking the table provided in fig. 2 as an example, the employee ID may be associated with the employee ID in the payroll information table in the employee information table, thereby establishing an association between the employee and payroll information. Another example is that a department ID in the department information table is associated with a department ID in the employee information table in order to acquire detailed information of a department in which an employee is located.
(2) Dependency between attributes there is an impact and constraint relationship between different attributes in the data table. For example, employee job attributes may affect their payroll level, with different job positions corresponding to different payroll levels. The employee's date of job entry may be used to calculate the work age, which may affect the employee's opportunity to advance and payroll treatment. By establishing the attribute dependency model, the system can be helped to better understand the relationship between the attributes, so that the questions posed by the user can be accurately answered.
(3) Business rules and logic when constructing thinking chain, it is necessary to integrate rules and logic relations in specific business domain and integrate them into entity association and attribute dependence. This includes understanding business processes, rules, and constraints, ensuring that the system can accurately understand and apply these rules to provide a correct solution to the problem. For example, employees of different genders may have specific needs or limitations at certain locations that require consideration of the influence of gender factors in employee management.
In one embodiment, the generate-sample generation hint includes a hint field, for an input card slot and an output card slot.
The principle of prompt learning is described below:
Prompt learning is a method of directing a language model to perform a particular task by designing natural language prompts or instructions. The goal of prompt learning is to convert downstream tasks into pre-trained tasks through a prompt template, "prompt engineering" is one of the core theories of prompt learning, which emphasizes how to build or design appropriate prompts (prompts) to guide the model to perform specific tasks, making downstream tasks closer to the pre-trained model. These hints may be natural language questions, declarative instructions, example input-output pairs, or other forms of guidance.
Taking emotion classification tasks as an example, as shown in fig. 5. Different cues may be designed for the same natural language task. Different cues may have different effects, where the input card slot may be one or more. Through reasonable design prompt, the model can be effectively guided to execute specific tasks, and the performance and generalization capability of the model on downstream tasks are improved. In prompt engineering, it is critical to design appropriate prompts, which can affect the effects of model learning and reasoning.
The prompt engineering is an important method for guiding the model to learn and execute tasks, and the task in a specific field is guided to learn by carefully designing the prompt, so that the performance and the adaptability of the model are improved. The effective prompt design can enable the model to better understand task requirements, reduce errors and promote the overall learning effect.
In this embodiment, as shown in fig. 6, the construction of choose-sample selection prompts and generate-sample generation prompts is achieved through a database table question-answer thinking chain. The English part of the figure is a machine language execution process and is used for illustrating the execution logic of the question-answer thinking chain. choose-prompt selection prompts are used for guiding the large model to screen relevant tables from the database according to questions posed by the user. For example, in FIG. 6, for the input vector query of questions, whose wages are 10000, the name of the wages 10000 employees cannot be directly given through a single table, when a thinking chain is established, the employee ID in the employee information table may be associated with the employee ID in the salary information table by reading the metadata characteristics, so that the association between the employee name and the salary information is established, and therefore, the employee information table and the salary information table can be selected when choose-prompt is selected. The generate-prompt generation prompt aims at guiding the model to generate corresponding SQL sentences by filling the screened form information, the converted language format, the questions and other key information into the clamping groove. After executing the SQL statement, the model will return the corresponding result.
The embodiment of the generation type large language model construction combines with a deep learning algorithm in the artificial intelligence field, utilizes the capabilities of automatic text understanding, text generation and context awareness, automatically extracts key function point descriptions of each paragraph in a document by analyzing grammar, semantics and context of the text, and outputs the key function point descriptions according to items respectively. The fine-tuned generation type large language model can output normalized function point items according to a fixed format, and meanwhile, based on a prompt word learning strategy, the function points are refined under the condition of not changing an output format. Thus, the end-to-end generation of the content and the number of the items from the complete document to the functional points of different categories is realized.
In one embodiment, the invention trains the text classification model for identifying the functional point category based on the truth value part of the standardized functional point item in the pre-marked data set, obtains the final functional point item estimation result by fusing the output result of the generated large language model, and can be directly used for the subsequent cost calculation. Based on the functional point identification technology of the bidirectional coding representation transducer attention model, the normalized entry output of the model is further identified, and the identification result is fused with the normalized output result, so that more accurate functional point identification is realized. The text classification model training and reasoning process is shown in figure 2.
The input of the text classification training process is a functional point item in the data set, and the output true value is a category corresponding to each item. The input of the test flow is then the function point item extracted from the generated large language model.
For the functional point items input into the model, the embodiment firstly needs to divide text data into words or sub words, namely, performs word segmentation process, further extracts embedded codes for each word, extracts text word segmentation characteristics from the codes through a multi-layer stacked Transformer encoder module, and finally performs text classification in a functional point classifier to calculate cross entropy loss.
The function point classifier comprises a feedforward neural network and a softmax layer, which can be used as an output layer specific to a task after the encoder, and only one output layer is needed to be added when text is fine-tuned without retraining the whole model, so that the invention only needs to update the parameters of the classifier by using the function point classification loss.
The above embodiments for text classification BERT model construction achieve accurate classification of functional point entries by fine-tuning the BERT model, i.e., bi-directional encoding, text classification techniques that represent a Transformer attention network. The model uses large-scale pre-training, learns general language representation, has higher semantic feature expression compared with the traditional machine learning, can adapt to specific classification tasks in a fine adjustment mode, is easier to converge, and has higher generalization and accurate discrimination performance.
In one embodiment, in the test stage, since the input function point items come from the output result of the generated large language model, the category judgment result is contained in the output result, so that the unified category judgment result is used as the output for the case of unified results of two times, and the confidence of the classification result is output according to the text classification model for the case of non-unified results of two times, the invention replaces the first output with the function point identification category with high confidence, so as to realize more accurate reasoning.
For the whole realization of end-to-end software function point extraction and identification of the invention, the two algorithms are integrated into a unified flow, so that each text paragraph is cut from a complete document, the function point description is positioned in the text paragraph relevant to the function point, the content and the number of the function point items are extracted from the description, and finally the classification identification of the function point is realized. The general technical scheme is shown in fig. 3.
The method for multi-table association large language model question-answering provided by the invention is described in detail based on metadata characteristics and thinking chains, the principle and implementation of the invention are described by applying specific examples, the description of the examples is only used for helping to understand the method and core ideas of the invention, and meanwhile, the content of the description is not to be construed as limiting the invention in terms of specific implementation and application scope since the ideas of the invention can be changed by those of ordinary skill in the art.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

Claims (5)

1.一种基于元数据特征和思维链的多表关联大语言模型问答方法,其特征在于:包括如下步骤:1. A multi-table association large language model question answering method based on metadata features and thought chain, characterized by comprising the following steps: 数据表预处理:对多个数据表进行数据规范化处理,并建立数据表之间的关联;Data table preprocessing: normalize multiple data tables and establish associations between them; 建立元数据文档:提取数据表的元数据特征获得元数据文档,包括字段信息和键连接关系;Create metadata document: extract metadata features of data table to obtain metadata document, including field information and key connection relationship; 构建思维链:基于数据表字段信息之间的关联关系构建问答思维链;包括如下步骤:Constructing a thinking chain: Constructing a question-answer thinking chain based on the relationship between the field information in the data table; including the following steps: 构建数据表间的实体字段关联推理过程模型;Construct the entity field association reasoning process model between data tables; 构建数据表间实体字段对应的属性字段关联推理过程模型;Construct an attribute field association reasoning process model corresponding to entity fields between data tables; 构建业务规则和逻辑,包括整合设定业务领域下的规则和逻辑关系,将其融入到实体字段关联推理过程模型和属性字段关联推理过程模型中;Build business rules and logic, including integrating the rules and logical relationships under the business domain, and integrating them into the entity field association reasoning process model and the attribute field association reasoning process model; 构建prompt提示模板:基于所述问答思维链构造选择提示以及生成提示;其中,Construct a prompt template: construct a selection prompt and a generation prompt based on the question-answer thinking chain; wherein, 选择提示用于根据用户问题结合元数据文档筛选出相关表格;生成提示用于接收所述相关表格对应的元数据文档内容结合用户问题,生成prompt提示;生成提示通过向卡槽中填充筛选出关键信息从而引导模型生成相应的SQL语句;所述关键信息包括:表格信息、转换语言格式和问题;The selection prompt is used to filter out relevant tables according to the user's question and metadata document; the generation prompt is used to receive the metadata document content corresponding to the relevant table and generate a prompt in combination with the user's question; the generation prompt guides the model to generate corresponding SQL statements by filling in the card slot to filter out key information; the key information includes: table information, conversion language format and question; 问答预测:接收用户问题,利用prompt提示模板生成prompt提示,将所述prompt提示输入至预训练的大语言模型,生成对应的SQL语句。Question and answer prediction: Receive user questions, generate prompts using prompt templates, input the prompts into the pre-trained large language model, and generate corresponding SQL statements. 2.根据权利要求1所述的一种基于元数据特征和思维链的多表关联大语言模型问答方法,其特征在于,所述对多个数据表进行数据规范化处理包括如下步骤中的一种或多种的组合:2. According to the method of claim 1, wherein the method of normalizing the data of the multiple data tables comprises one or more combinations of the following steps: 数据清洗:包括对缺失值、重复值和异常值中一项或多项的规范化处理;Data cleaning: including normalization of one or more of missing values, duplicate values and outliers; 数据集成:将不同数据表中的相关数据进行合并;Data integration: merge related data from different data tables; 数据变换:统一不同数据表的数据格式并进行标准化处理,从数据表中提取特征。Data transformation: unify the data formats of different data tables and perform standardization processing to extract features from the data tables. 3.根据权利要求1所述的一种基于元数据特征和思维链的多表关联大语言模型问答方法,其特征在于,所述字段信息包括:数据表的名称、数据表中的实体字段和实体字段对应的属性字段。3. According to the multi-table association large language model question and answer method based on metadata features and thought chains as described in claim 1, it is characterized in that the field information includes: the name of the data table, the entity fields in the data table and the attribute fields corresponding to the entity fields. 4.根据权利要求1所述的一种基于元数据特征和思维链的多表关联大语言模型问答方法,其特征在于,所述提取数据表的元数据特征还包括如下步骤:4. According to the method of claim 1, wherein the step of extracting metadata features of a data table and a large language model question answering method based on metadata features and thought chains further comprises the following steps: 对提取的字段信息进行含义标注;Annotate the meaning of the extracted field information; 识别并获取字段信息的数据类型,并对相同类别字段信息的数据类型进行规范统一;Identify and obtain the data type of field information, and standardize and unify the data type of field information of the same category; 识别并获取数据表之间的主键-外键关系;Identify and obtain primary key-foreign key relationships between data tables; 将提取的字段信息、字段含义、数据类型和主键-外键关系进行整合,获得元数据文档。The extracted field information, field meaning, data type and primary key-foreign key relationship are integrated to obtain a metadata document. 5.根据权利要求1所述的一种基于元数据特征和思维链的多表关联大语言模型问答方法,其特征在于,所述生成提示包括提示字段、用于输入卡槽和输出卡槽;用于执行如下步骤:5. According to the method of claim 1, a multi-table association large language model question answering method based on metadata features and thought chain, characterized in that the generated prompt includes a prompt field, an input card slot and an output card slot; used to perform the following steps: 通过向输入卡槽中填充筛选出的表格信息、转换语言格式、用户问题,引导大语言模型生成相应的SQL语句。By filling the input card slot with the filtered table information, converting the language format, and user questions, the large language model is guided to generate the corresponding SQL statements.
CN202410687924.7A 2024-05-30 2024-05-30 Multi-table association large language model question-answering method based on metadata characteristics and thinking chain Active CN118245591B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410687924.7A CN118245591B (en) 2024-05-30 2024-05-30 Multi-table association large language model question-answering method based on metadata characteristics and thinking chain

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410687924.7A CN118245591B (en) 2024-05-30 2024-05-30 Multi-table association large language model question-answering method based on metadata characteristics and thinking chain

Publications (2)

Publication Number Publication Date
CN118245591A CN118245591A (en) 2024-06-25
CN118245591B true CN118245591B (en) 2024-11-29

Family

ID=91555086

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410687924.7A Active CN118245591B (en) 2024-05-30 2024-05-30 Multi-table association large language model question-answering method based on metadata characteristics and thinking chain

Country Status (1)

Country Link
CN (1) CN118245591B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118467683B (en) * 2024-07-15 2024-09-24 金现代信息产业股份有限公司 Contract text examination method, system, device and medium based on natural language
CN119668579A (en) * 2024-11-20 2025-03-21 中山大学 LAPACK code library function name recommendation method and system based on thought chain and logical reasoning
CN120632054A (en) * 2025-08-11 2025-09-12 华院计算技术(上海)股份有限公司 Legal thinking chain data construction method, legal thinking chain data construction device and program product

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118093621A (en) * 2024-02-20 2024-05-28 上海信投数字科技有限公司 Structured query language generation method and device, electronic equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11604790B2 (en) * 2020-08-31 2023-03-14 Unscrambl Inc Conversational interface for generating and executing controlled natural language queries on a relational database
CN116991869A (en) * 2023-07-24 2023-11-03 北京泰策科技有限公司 A method to automatically generate database query statements based on NLP language model
CN117076719B (en) * 2023-10-12 2024-04-19 北京枫清科技有限公司 Database joint query method, device and equipment based on large language model
CN117390169B (en) * 2023-12-11 2024-04-12 季华实验室 Table data question and answer method, device, equipment and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118093621A (en) * 2024-02-20 2024-05-28 上海信投数字科技有限公司 Structured query language generation method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN118245591A (en) 2024-06-25

Similar Documents

Publication Publication Date Title
CN118245591B (en) Multi-table association large language model question-answering method based on metadata characteristics and thinking chain
US6941513B2 (en) System and method for text structuring and text generation
CN113032418A (en) Method for converting complex natural language query into SQL (structured query language) based on tree model
CN112541337B (en) Document template automatic generation method and system based on recurrent neural network language model
CN112328800A (en) System and method for automatically generating programming specification question answers
CN118132579A (en) NL2 SQL-based intelligent medical insurance query method and system
CN113111158B (en) Intelligent data visualization oriented conversational question-answering implementation method
CN118484465B (en) A method and device for generating SQL statements from natural language statements
CN111831624A (en) Data table creating method and device, computer equipment and storage medium
CN116341569A (en) Professional document intelligent auxiliary reading method based on domain knowledge base
CN117251455A (en) Intelligent report generation method and system based on large model
CN112507089A (en) Intelligent question-answering engine based on knowledge graph and implementation method thereof
Sun A natural language interface for querying graph databases
CN114942981A (en) Question-answer query method and device, electronic equipment and computer readable storage medium
CN116561264A (en) Knowledge graph-based intelligent question-answering system construction method
CN119669530A (en) Knowledge graph generation-assisted teaching question answering method and system based on LLM
CN119646026A (en) A vertical field document question answering method and system based on knowledge graph enhanced big model
CN119990290A (en) An intelligent analysis method of legal text based on legal concept genealogy
CN120012891A (en) A method and system for constructing a standard analysis graph using a large language model
CN119808919A (en) A personalized learning recommendation method based on personalized knowledge graph
CN119226311A (en) SQL generation method, system, device and medium based on large language model
CN117875307A (en) Text parsing method and device for intelligent question and answer
CN117473054A (en) Knowledge graph-based general intelligent question-answering method and device
CN117094390A (en) Knowledge graph construction and intelligent search method oriented to ocean engineering field
Dai et al. QAM: Question answering system based on knowledge graph in the military

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载