CN118245591B

CN118245591B - Multi-table association large language model question-answering method based on metadata characteristics and thinking chain

Info

Publication number: CN118245591B
Application number: CN202410687924.7A
Authority: CN
Inventors: 李孟书; 唐海超; 秦宇沨; 胡勋; 罗琪彬; 宋浩楠; 乔思龙; 曲薇
Original assignee: CETC 15 Research Institute
Current assignee: CETC 15 Research Institute
Priority date: 2024-05-30
Filing date: 2024-05-30
Publication date: 2024-11-29
Anticipated expiration: 2044-05-30
Also published as: CN118245591A

Abstract

The present invention discloses a multi-table association large language model question-answering method based on metadata features and thought chains, which performs data normalization processing on multiple data tables and establishes associations between data tables; extracts metadata features of data tables to obtain metadata documents, including field information and key connection relationships; constructs a question-answering thought chain based on the association relationship between data table field information; constructs a choose-prompt selection prompt and a generate-prompt generation prompt based on the question-answering thought chain, guides the large language model to filter out relevant tables from a database according to user questions, and generates corresponding SQL statements, and returns results after execution. When facing a multi-table query situation, the present invention can fully explore the association between table data information, improve the accuracy of table screening, thereby guiding the model to generate corresponding SQL statements and improve the accuracy of answering questions.

Description

Multi-table association large language model question-answering method based on metadata characteristics and thinking chain

Technical Field

The invention relates to the technical field of natural language processing, in particular to a multi-table association large language model question-answering method based on metadata characteristics and a thinking chain.

Background

In the current overview of multi-table query models, conventional or advanced models rely primarily on models in the deep learning category, such as the Graphic Neural Network (GNN) model, the IRNet model, and so on. While these approaches are excellent in solving certain problems, they ignore to some extent the explicit associations between queries and database tables, and the deficiencies in coping with the fields of condition value extraction and database architecture coding. In addition, these strategies typically focus on only a single task or single relationship type, and do not work well in multitasking and multiple relationship type scenarios. Thus, the conventional approach still suffers from some limitations and challenges when faced with the multi-table query situation.

Text2SQL is a Natural Language Processing (NLP) task that aims to translate natural language questions into Structured Query Language (SQL) queries. The goal of this task is to allow the computer to understand the questions posed by the user and generate corresponding SQL queries from the questions to retrieve the required information from the database.

In the Text2SQL task, the input is a natural language question and the output is an SQL query statement that can be directly executed in the database to obtain the answer. The technology has wide application in the fields of information retrieval, database query, dialogue systems and the like.

In a complex database system, a multi-table query scenario refers to a situation in which multiple database tables are required to be queried jointly in one query statement to obtain required information. In processing multi-table queries, whether involving multiple database connections or limited to a single database environment, it is essential to reduce the problem to a connection problem between multiple database tables. Thus, researchers can study multi-table queries under a unified framework, so that the universality and the expandability of query generation algorithms are improved, and the Text2SQL task is solved, so that correct connection between a plurality of tables is established, and required results are obtained.

The common Text2SQL method includes:

rule-based methods that use manually designed rules and templates to convert natural language questions to SQL query statements.

Statistical-based methods utilize statistical models and machine learning algorithms to learn the mapping between natural language questions and SQL queries.

Neural network-based methods use deep learning models (such as recurrent neural networks, attention mechanisms, and converters) to learn semantic mappings between natural language and SQL queries.

However, the existing method still has difficulties and challenges for the application of multi-table queries in Text2SQL tasks.

Therefore, how to provide a multi-table association large language model question-answering method based on metadata characteristics and thinking chains for improving the performance of screening tables in a multi-table scene related to a large model is a technical problem to be solved by those skilled in the art.

Disclosure of Invention

Aiming at the current research situation and the existing problems, the invention provides a multi-table association large model question-answering method based on metadata characteristics and a thinking chain, so as to improve the performance of screening tables in a large model related multi-table scene.

The invention provides a multi-table association large language model question-answering method based on metadata characteristics and thinking chains, which comprises the following steps:

preprocessing a data table, namely performing data normalization processing on a plurality of data tables and establishing association among the data tables;

Extracting metadata characteristics of a data table to obtain a metadata document, wherein the metadata document comprises field information and key connection relation;

Constructing a thinking chain, namely constructing a question-answer thinking chain based on the association relation between the field information of the data table;

Constructing a prompt template, selecting prompts and generating prompts based on the questioning and answering thinking chain structure choose-prompt, wherein,

The method comprises the steps of choose-prompt selection, generation-prompt generation and prompt generation, wherein the prompt selection is used for screening out related tables according to user problems and metadata documents;

And predicting the question and answer, namely receiving the user problem, generating a prompt by using a prompt template, inputting the prompt into a pre-trained large language model, and generating a corresponding SQL sentence.

Preferably, the data normalization processing of the plurality of data tables includes one or more of the following steps:

data cleaning, namely normalizing one or more of the missing value, the repeated value and the abnormal value;

Data integration, namely merging related data in different data tables;

and (3) data conversion, namely unifying data formats of different data tables, carrying out standardization processing, and extracting features from the data tables.

Preferably, the field information comprises the name of the data table, the entity field in the data table and the attribute field corresponding to the entity field.

Preferably, the extracting the metadata feature of the data table further includes the steps of:

Meaning marking is carried out on the extracted field information;

Identifying and acquiring the data types of the field information, and standardizing and unifying the data types of the field information of the same category;

identifying and acquiring a main key-external key relation between data tables;

and integrating the extracted field information, the field meaning, the data type and the main key-foreign key relation to obtain the metadata document.

Preferably, the construction of the thinking chain comprises the following steps:

Constructing an entity field association reasoning process model among the data tables;

Constructing an attribute field association reasoning process model corresponding to entity fields among the data tables;

And constructing business rules and logic, namely integrating and setting rules and logic relations in the business field, and integrating the rules and logic relations into an entity field association reasoning process model and an attribute field association reasoning process model.

Preferably, the generate-sample generation prompt comprises a prompt field, a prompt field used for inputting a card slot and outputting a card slot, and the method is used for executing the following steps:

And filling the screened form information, the converted language format and the user problem into the input card slot to guide the large language model to generate corresponding SQL sentences.

Compared with the prior art, the multi-table associated large language model question-answering method based on the metadata characteristics and the thinking chain has the following beneficial effects:

the invention analyzes and understands NLP tasks, and determines a plurality of tables related to problems and relations among the tables. Understanding each of the tables involved, knowing the relevance between tables is critical to solving multi-table queries. And determining the form to be connected and the connection condition according to the relation between the problem and the form, and performing form connection operation according to the main key external key relation or other common attributes. And respectively constructing choose-prompt selection prompts and generation-prompt generation prompts according to the database form question-answer thinking chain implementation process, wherein the choose-prompt selection prompts guide a large model to screen relevant forms from the database according to user questions, and the generation-prompt generation prompts aim at generating corresponding SQL sentences by filling the screened form information, the converted language format, the problems and other key information into the clamping grooves, and returning results after execution.

When the method is faced with the multi-table query condition, the association between the table data information can be fully mined, the accuracy of table screening is improved, the model is guided to generate corresponding SQL sentences, and the accuracy of answering questions is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is apparent that the drawings in the following description are only embodiments of the present invention, and that other drawings may be obtained from the provided drawings without inventive labor for those skilled in the art.

FIG. 1 is a flowchart of a multi-table associative large language model question-answering method based on metadata features and thought chains provided by an embodiment of the present invention;

FIG. 2 is a flowchart of table data preprocessing provided in an embodiment of the present invention;

FIG. 3 is a flow chart of metadata feature extraction provided by an embodiment of the present invention;

FIG. 4 is a schematic diagram of a database form question-answer thinking chain provided by an embodiment of the invention;

fig. 5 is a schematic diagram of prompt learning provided by an embodiment of the present invention;

Fig. 6 is a flowchart for constructing a prompt template according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The following problems need to be considered when facing the multi-table query scene in the Text2SQL task:

table association in the multi-table inquiry task, the association relation between different database tables needs to be identified and understood, including internal connection, external connection and the like, so as to ensure that the information of a plurality of tables is correctly integrated.

The connection condition that the model must be able to accurately identify the connection condition in the natural language question and convert it into the connection clause in the SQL query statement, covering the connection condition in the ON clause and the screening condition in the WHERE clause, etc.

Query statement complexity the SQL query statement involved in a multi-table query is typically more complex, contains more clauses, operations, and nested structures, and the model needs to be able to efficiently process these complex structures to generate an accurate SQL query statement.

Semantic reasoning the multi-table query task requires a model with strong semantic reasoning capability, and can solve the problems related to the relationship and the attribute among a plurality of tables, including the aspects of identifying indirect association, deducing the relationship among the attribute and the like.

Data sparsity-because multi-table queries involve multiple database tables, there may be a data sparsity problem, i.e., some associations in the input problem occur less frequently in training data. This makes it difficult for the model to learn enough information from the training data to accurately generate multi-table query statements.

As shown in FIG. 1, the overall question-answering method of the multi-table association big model question-answering based on metadata features and thinking chains provided by the embodiment of the invention firstly carries out preprocessing on local data, and then extracts the metadata features of the processed form data to generate metadata documents. And (3) constructing a database question-answer thinking chain to guide the large model to carry out the reasoning steps of table screening and SQL sentence generation according to the user questions. In the process of screening the table, the generation process of the SQL sentence is optimized by introducing the metadata document. And returning a problem result after the database executes the SQL sentence generated by the large model, and updating the metadata document according to the result. The method specifically comprises the following steps:

Constructing a prompt template, namely constructing choose-prompt selection prompts and generating prompts by generating the prompting based on a question-answer thinking chain, wherein the choose-prompt selection prompts are used for screening relevant forms according to user problems and combining metadata documents;

And predicting question and answer, namely receiving a user question in a subsequent operation, utilizing choose-prompt selection prompt to obtain table structure information, injecting generate-prompt generation prompt to generate a prompt, inputting the prompt into a pre-trained large language model, calling a database chain to perform database question and answer based on a candidate table, and generating a corresponding SQL sentence.

In the embodiment, chatGLM-6B-32K is selected as a universal large language model, and text2vec-base-chinese is selected as an embedded model. The ChatGLM has the advantages of low deployment cost, smooth output and the like, and compared with a basic model thereof, the ChatGLM-6B-32K has the context learning and multi-turn dialogue capability of a larger sequence. the text2vec-base-chinese has better text matching performance for multiple languages, especially Chinese, through a large amount of Chinese text training.

In one embodiment, as shown in FIG. 1, the form data has a variety of forms of presentation and storage, such as a csv format spreadsheet, an html format paged form, and a database format form management system, among others. From the practical engineering application and management, the local original table data has the problems of data missing, inconsistent data types, inconsistent metrology and the like. The table data is preprocessed by a series of processes in advance, and the table is connected so as to realize multi-table question and answer. Performing data normalization processing on the plurality of data tables includes one or more combinations of the following steps:

The data integration is that the related data in different data tables are combined, including integrating the same name metadata information into one table, or integrating the tables with the same content and common content into one table, reducing the repeatability and the number of tables;

When the method is specifically executed, the pretreatment steps are as follows:

(1) Data cleaning:

And (3) processing the missing values, namely detecting and processing the missing values in the data table, and selecting methods such as filling the missing values, deleting rows or columns containing the missing values and the like.

Outlier processing, namely identifying and processing outliers in a data table, detecting the outliers through a statistical method or a visual means, and processing according to specific situations.

Repeating value processing, namely searching and removing repeated records in the data table, and ensuring the uniqueness of the data.

(2) Data integration:

Table merging-the merging of related data in different data tables for subsequent analysis and modeling.

(3) And (3) data transformation:

Converting the data format, namely converting the data types in the data table, and ensuring the consistency and comparability of the data.

And extracting the required characteristics from the original data, and analyzing and modeling the metadata characteristic extraction process of the subsequent data table.

Data normalization, namely normalizing or normalizing the data to eliminate dimension differences among different data.

(4) And (3) data association:

and establishing links, namely establishing the connection between the data tables according to the association relation (such as a main key-external key relation) between the data so as to realize the integration and association of the data and facilitate multi-table query. Specific:

And establishing association between data according to common fields or key values in the data table, including a main key, an external key, character string matching, a time window or geographic position and the like, and performing association operation by using SQL sentences. Such as using JOIN keys to associate data in two tables, and using programming languages such as Python to do the association. Common libraries such as Pandas provide rich functionality to process and merge data. For subsequent multi-table querying or analysis.

And ensuring the consistency of the data, namely ensuring the consistency and the integrity of the data in the data association process and avoiding the redundancy or wrong association of the data.

Through the data preprocessing step, the data of a plurality of data tables can be effectively cleaned, integrated and converted, a solid foundation is laid for subsequent data analysis, mining and modeling work, and the accuracy and the efficiency of data processing are improved.

In one embodiment, the field information includes a name of the data table, an entity field in the data table, and an attribute field corresponding to the entity field.

In one embodiment, the metadata feature extraction is performed after the local table data is preprocessed. The metadata feature extraction is a very key step in the data processing and analysis process, can help better understand the structure, meaning and association relation of the data table, and provides basis for question answering. As shown in fig. 2, metadata feature extraction is performed with a table having 7 fields and table names as employee information. The NLP task is parsed and understood, and a plurality of tables and relationships among the tables are determined to which the problem relates. This requires identifying keywords, entities in the question and the connection means that may be involved. Each related table is understood to include information such as a table structure, a column name, a primary key external key relationship, and the like. And determining the form to be connected and the connection condition according to the relation between the problem and the form, and performing form connection operation according to the main key external key relation or other common attributes. Extracting metadata features of the data table comprises the following steps:

Meaning marking is carried out on the extracted field information;

identifying and acquiring a main key-external key relation between data tables;

In a specific implementation, the structural information of the table is first extracted. The names of the data tables are obtained and the entities or topics represented by the tables are known. The statistical data table contains the number of fields to understand the dimension of the data. The name of each field is obtained, and the meaning of the field can be inferred from the field name. The primary key and the foreign key in the data table are identified so as to establish an association relationship between the data tables.

The meaning of the extracted fields is then noted. The data dictionary or metadata document is consulted to obtain a detailed description and meaning of each field. If the field is a field with stronger professionals, the field needs to be communicated with a business expert in the related field to confirm the meaning and business logic of the field.

And meanwhile, the data types of the extracted data are normalized and unified. The data type of each field, such as an integer, a character string, a date, etc., is acquired to ensure the accuracy and consistency of the data. Knowing the value range of each field helps verify the integrity and legitimacy of the data.

And extracting the association relation between the data. The primary-foreign key relationship between the data tables is identified, helping to establish an association between the data tables. And drawing a relation diagram among the data tables, and showing the association paths and the dependency relations among the data.

And finally integrating all the extracted metadata features. The extracted table structure information, field meanings, data types and association relations are integrated into the metadata document for subsequent use by the question-answering system.

Through the metadata feature extraction process, the structure and the meaning of the data table can be deeply known, the association relation between the data is clear, and necessary metadata support is provided for building a question-answering system, so that the questions presented by the user can be more accurately answered.

In one embodiment, the role of building a mental chain is to build logical relationships between various knowledge elements based on the content of the data sheet and business knowledge to help the question-answering system understand questions better and provide accurate answers. As shown in fig. 3, constructing the thought chain includes the steps of:

When specifically performed, the method comprises the following steps:

(1) Association between entities-in any data table, there may be an association between different entity fields. Taking the table provided in fig. 2 as an example, the employee ID may be associated with the employee ID in the payroll information table in the employee information table, thereby establishing an association between the employee and payroll information. Another example is that a department ID in the department information table is associated with a department ID in the employee information table in order to acquire detailed information of a department in which an employee is located.

(2) Dependency between attributes there is an impact and constraint relationship between different attributes in the data table. For example, employee job attributes may affect their payroll level, with different job positions corresponding to different payroll levels. The employee's date of job entry may be used to calculate the work age, which may affect the employee's opportunity to advance and payroll treatment. By establishing the attribute dependency model, the system can be helped to better understand the relationship between the attributes, so that the questions posed by the user can be accurately answered.

(3) Business rules and logic when constructing thinking chain, it is necessary to integrate rules and logic relations in specific business domain and integrate them into entity association and attribute dependence. This includes understanding business processes, rules, and constraints, ensuring that the system can accurately understand and apply these rules to provide a correct solution to the problem. For example, employees of different genders may have specific needs or limitations at certain locations that require consideration of the influence of gender factors in employee management.

In one embodiment, the generate-sample generation hint includes a hint field, for an input card slot and an output card slot.

The principle of prompt learning is described below:

Prompt learning is a method of directing a language model to perform a particular task by designing natural language prompts or instructions. The goal of prompt learning is to convert downstream tasks into pre-trained tasks through a prompt template, "prompt engineering" is one of the core theories of prompt learning, which emphasizes how to build or design appropriate prompts (prompts) to guide the model to perform specific tasks, making downstream tasks closer to the pre-trained model. These hints may be natural language questions, declarative instructions, example input-output pairs, or other forms of guidance.

Taking emotion classification tasks as an example, as shown in fig. 5. Different cues may be designed for the same natural language task. Different cues may have different effects, where the input card slot may be one or more. Through reasonable design prompt, the model can be effectively guided to execute specific tasks, and the performance and generalization capability of the model on downstream tasks are improved. In prompt engineering, it is critical to design appropriate prompts, which can affect the effects of model learning and reasoning.

The prompt engineering is an important method for guiding the model to learn and execute tasks, and the task in a specific field is guided to learn by carefully designing the prompt, so that the performance and the adaptability of the model are improved. The effective prompt design can enable the model to better understand task requirements, reduce errors and promote the overall learning effect.

In this embodiment, as shown in fig. 6, the construction of choose-sample selection prompts and generate-sample generation prompts is achieved through a database table question-answer thinking chain. The English part of the figure is a machine language execution process and is used for illustrating the execution logic of the question-answer thinking chain. choose-prompt selection prompts are used for guiding the large model to screen relevant tables from the database according to questions posed by the user. For example, in FIG. 6, for the input vector query of questions, whose wages are 10000, the name of the wages 10000 employees cannot be directly given through a single table, when a thinking chain is established, the employee ID in the employee information table may be associated with the employee ID in the salary information table by reading the metadata characteristics, so that the association between the employee name and the salary information is established, and therefore, the employee information table and the salary information table can be selected when choose-prompt is selected. The generate-prompt generation prompt aims at guiding the model to generate corresponding SQL sentences by filling the screened form information, the converted language format, the questions and other key information into the clamping groove. After executing the SQL statement, the model will return the corresponding result.

The embodiment of the generation type large language model construction combines with a deep learning algorithm in the artificial intelligence field, utilizes the capabilities of automatic text understanding, text generation and context awareness, automatically extracts key function point descriptions of each paragraph in a document by analyzing grammar, semantics and context of the text, and outputs the key function point descriptions according to items respectively. The fine-tuned generation type large language model can output normalized function point items according to a fixed format, and meanwhile, based on a prompt word learning strategy, the function points are refined under the condition of not changing an output format. Thus, the end-to-end generation of the content and the number of the items from the complete document to the functional points of different categories is realized.

In one embodiment, the invention trains the text classification model for identifying the functional point category based on the truth value part of the standardized functional point item in the pre-marked data set, obtains the final functional point item estimation result by fusing the output result of the generated large language model, and can be directly used for the subsequent cost calculation. Based on the functional point identification technology of the bidirectional coding representation transducer attention model, the normalized entry output of the model is further identified, and the identification result is fused with the normalized output result, so that more accurate functional point identification is realized. The text classification model training and reasoning process is shown in figure 2.

The input of the text classification training process is a functional point item in the data set, and the output true value is a category corresponding to each item. The input of the test flow is then the function point item extracted from the generated large language model.

For the functional point items input into the model, the embodiment firstly needs to divide text data into words or sub words, namely, performs word segmentation process, further extracts embedded codes for each word, extracts text word segmentation characteristics from the codes through a multi-layer stacked Transformer encoder module, and finally performs text classification in a functional point classifier to calculate cross entropy loss.

The function point classifier comprises a feedforward neural network and a softmax layer, which can be used as an output layer specific to a task after the encoder, and only one output layer is needed to be added when text is fine-tuned without retraining the whole model, so that the invention only needs to update the parameters of the classifier by using the function point classification loss.

The above embodiments for text classification BERT model construction achieve accurate classification of functional point entries by fine-tuning the BERT model, i.e., bi-directional encoding, text classification techniques that represent a Transformer attention network. The model uses large-scale pre-training, learns general language representation, has higher semantic feature expression compared with the traditional machine learning, can adapt to specific classification tasks in a fine adjustment mode, is easier to converge, and has higher generalization and accurate discrimination performance.

In one embodiment, in the test stage, since the input function point items come from the output result of the generated large language model, the category judgment result is contained in the output result, so that the unified category judgment result is used as the output for the case of unified results of two times, and the confidence of the classification result is output according to the text classification model for the case of non-unified results of two times, the invention replaces the first output with the function point identification category with high confidence, so as to realize more accurate reasoning.

For the whole realization of end-to-end software function point extraction and identification of the invention, the two algorithms are integrated into a unified flow, so that each text paragraph is cut from a complete document, the function point description is positioned in the text paragraph relevant to the function point, the content and the number of the function point items are extracted from the description, and finally the classification identification of the function point is realized. The general technical scheme is shown in fig. 3.

The method for multi-table association large language model question-answering provided by the invention is described in detail based on metadata characteristics and thinking chains, the principle and implementation of the invention are described by applying specific examples, the description of the examples is only used for helping to understand the method and core ideas of the invention, and meanwhile, the content of the description is not to be construed as limiting the invention in terms of specific implementation and application scope since the ideas of the invention can be changed by those of ordinary skill in the art.

In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

Claims

1. A multi-table association large language model question answering method based on metadata features and thought chain, characterized by comprising the following steps:

Data table preprocessing: normalize multiple data tables and establish associations between them;

Create metadata document: extract metadata features of data table to obtain metadata document, including field information and key connection relationship;

Constructing a thinking chain: Constructing a question-answer thinking chain based on the relationship between the field information in the data table; including the following steps:

Construct the entity field association reasoning process model between data tables;

Construct an attribute field association reasoning process model corresponding to entity fields between data tables;

Build business rules and logic, including integrating the rules and logical relationships under the business domain, and integrating them into the entity field association reasoning process model and the attribute field association reasoning process model;

Construct a prompt template: construct a selection prompt and a generation prompt based on the question-answer thinking chain; wherein,

The selection prompt is used to filter out relevant tables according to the user's question and metadata document; the generation prompt is used to receive the metadata document content corresponding to the relevant table and generate a prompt in combination with the user's question; the generation prompt guides the model to generate corresponding SQL statements by filling in the card slot to filter out key information; the key information includes: table information, conversion language format and question;

Question and answer prediction: Receive user questions, generate prompts using prompt templates, input the prompts into the pre-trained large language model, and generate corresponding SQL statements.

2. According to the method of claim 1, wherein the method of normalizing the data of the multiple data tables comprises one or more combinations of the following steps:

Data cleaning: including normalization of one or more of missing values, duplicate values and outliers;

Data integration: merge related data from different data tables;

Data transformation: unify the data formats of different data tables and perform standardization processing to extract features from the data tables.

3. According to the multi-table association large language model question and answer method based on metadata features and thought chains as described in claim 1, it is characterized in that the field information includes: the name of the data table, the entity fields in the data table and the attribute fields corresponding to the entity fields.

4. According to the method of claim 1, wherein the step of extracting metadata features of a data table and a large language model question answering method based on metadata features and thought chains further comprises the following steps:

Annotate the meaning of the extracted field information;

Identify and obtain the data type of field information, and standardize and unify the data type of field information of the same category;

Identify and obtain primary key-foreign key relationships between data tables;

The extracted field information, field meaning, data type and primary key-foreign key relationship are integrated to obtain a metadata document.

5. According to the method of claim 1, a multi-table association large language model question answering method based on metadata features and thought chain, characterized in that the generated prompt includes a prompt field, an input card slot and an output card slot; used to perform the following steps:

By filling the input card slot with the filtered table information, converting the language format, and user questions, the large language model is guided to generate the corresponding SQL statements.