CN119046313B

CN119046313B - Query statement generation method based on relational graph

Info

Publication number: CN119046313B
Application number: CN202411505112.2A
Authority: CN
Inventors: 黄浩; 吴华夫; 姚诗成
Original assignee: Guangzhou Smart Software Co ltd
Current assignee: Guangzhou Smart Software Co ltd
Priority date: 2024-10-28
Filing date: 2024-10-28
Publication date: 2025-01-21
Anticipated expiration: 2044-10-28
Also published as: CN119046313A

Abstract

The application relates to a query statement generation method based on a relationship graph, which is used for acquiring natural query statement and target data table information input by a user, constructing the relationship graph comprising statement nodes and query field nodes, and extracting multidimensional characteristic representation of the nodes in the graph through a coding layer. The feature representation of the nodes is further processed by the pre-trained relational graph attention network layer to capture more advanced semantic information. Finally, the task decoder based on the constraint relation generates an SQL query statement comprising a plurality of query sub-statements according to the processed node characteristic representation and the global characteristic representation. The method not only improves the understanding capability of the model on the complex relationship, but also ensures the accuracy and the completeness of the generated SQL query statement, particularly can accurately process the query statement related to various subtasks, and provides a more efficient and reliable solution for the Text-to-SQL task.

Description

Query statement generation method based on relational graph

Technical Field

The application relates to the technical field of data query, in particular to a query statement generation method based on a relation diagram.

Background

In the field of Natural Language Processing (NLP), text-to-SQL tasks aim to automatically convert natural query sentences proposed by users based on natural language expression forms into executable SQL query sentences, thereby realizing effective query of databases. Existing Text-to-SQL techniques rely primarily on graph-based attention models by building a graph structure containing problem nodes and database schema nodes (e.g., table names, column names), and utilizing deep learning models for feature extraction and SQL statement generation.

However, the prior art focuses mainly on training by constructing complex relational graphs and graph-annotating force models and utilizing a large amount of annotation data, and although this approach has advanced in the art, the generated SQL query statement still often has inaccurate problems when dealing with complex queries of users, such as involving multiple SQL subtasks.

Disclosure of Invention

Based on the above, the application aims to provide a query statement generation method based on a relationship graph, which combines the global features and the local features of the relationship graph and considers the constraint relationship among subtasks in the query statement, thereby improving the accuracy of the query statement generated by a model.

The query statement generating method based on the relation diagram, provided by the embodiment of the application, comprises the following steps:

Acquiring natural query sentences input by a user and target data table information queried by the user, wherein the target data table information comprises a plurality of query fields;

Constructing a relation graph according to the natural query statement and the target data table information, wherein nodes of the relation graph comprise statement nodes and query field nodes, the statement nodes are obtained according to word segmentation of the natural query statement, and the query field nodes are obtained according to query fields matched with the natural query statement;

Inputting the relation graph to a coding layer to obtain relation representation among all nodes, first characteristic representation of all nodes and global characteristic representation of all nodes of the relation graph;

Inputting the relation representation among the nodes and the first characteristic representation of each node into a pre-trained relation diagram attention network layer to obtain a second characteristic representation of each node;

and inputting the second characteristic representation of each node and the global characteristic representation to a pre-trained task decoder to obtain a query statement output by the task decoder, wherein the task decoder is constructed based on constraint relations of each query sub-statement of the query statement.

The present inventors have analyzed the prior art to find that the prior art often ignores the complex constraint relationships between sub-tasks (e.g., SELECT, FROM, WHERE, GROUP BY, HAVING, etc.) within an SQL query statement and how these sub-tasks work together to form an efficient SQL query. For example, the conditional columns in the HAVING clause must be a subset of the columns selected by the SELECT clause, a constraint that often is not fully considered and embodied in the prior art, resulting in the generated SQL statement that may not perform properly or return unexpected results.

In this regard, the present application is directed to logical relationships and constraints among different subtasks within an SQL query statement when addressing Text-to-SQL tasks. Specifically, in SQL queries, there is a strict dependency between the result subtask (e.g., the SELECT clause) and the condition subtask (e.g., the condition column in the HAVING clause). For example, in a scenario of data aggregation (e.g., using CUBE), the conditional columns in the HAVING clause must be a subset of the database columns explicitly selected by the SELECT clause. Based on this insight, the application not only considers the logical relationship between the natural query statement and the database table schema when building the relationship graph, but also uses a specially designed task decoder that can understand and utilize the constraints between these subtasks when building the deep learning model. In order to achieve the above, the application introduces a relational graph neural network for feature extraction and representation learning. The relational graph neural network can capture complex relationships between nodes in the graph structure and generate characteristic representations of the nodes through multi-layer nonlinear transformation. Correspondingly, the task decoder considers the structure and rule of the SQL algorithm and the logic sequence among the subtasks in design, learns the feature representation of each node output based on the relational graph neural network through pre-training, combines the global feature representation output by the preprocessed coding layer, and gradually generates the query statement conforming to the constraint conditions among the subtasks and the query statement logic. Finally, when the technical scheme of the application is applied to the natural query statement input by the user, the query statement which is accurate and accords with SQL grammar can be generated.

In summary, the technical scheme of the application realizes efficient and accurate conversion from natural language query to SQL query statement by comprehensively considering all aspects of SQL query statement, including logical relation, constraint condition, grammar rule of query statement and constraint of database table mode, and provides a powerful support for complex query task in natural language processing field.

For a better understanding and implementation, the present application is described in detail below with reference to the drawings.

Drawings

FIG. 1 is a flowchart of a query sentence generation method based on a relationship diagram according to an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating the steps of the pre-training of the task decoder and the attention neural network layer according to the embodiment of the present application;

FIG. 3 is a schematic diagram illustrating steps for constructing a relationship diagram according to a natural query statement and target data table information in an embodiment of the present application;

FIG. 4 is a schematic diagram of steps for encoding a layer to obtain a node-node relationship representation, a node feature representation, and a global feature representation in an embodiment of the present application;

Fig. 5 is a schematic diagram of steps for obtaining node feature representation and global feature representation by node coding layer coding according to an embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the following detailed description of the embodiments of the present application will be given with reference to the accompanying drawings. Where the following description refers to the accompanying drawings, the same numbers in different drawings refer to the same or similar elements, unless otherwise indicated.

It should be understood that the embodiments described in the examples described below do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with aspects of the application as detailed in the accompanying claims. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. Furthermore, in the description of the present application, unless otherwise indicated, "a plurality" means two or more. It should also be understood that the term "and/or" as used herein refers to and encompasses any or all possible combinations of one or more of the associated listed items, e.g., a and/or B, may mean that a exists alone, both a and B exist alone, and that the character "/" generally indicates that the associated object is an "or" relationship.

It should be appreciated that, although the terms first, second, third, etc. may be used herein to describe various information, these information should not be limited to these terms, and these terms are merely used to distinguish between similar objects and do not necessarily describe a particular order or sequence or imply relative importance. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art according to the specific circumstances. The words "if"/"if" as used herein may be interpreted as "at..once" or "when..once" or "in response to a determination", depending on the context.

In this regard, the present inventors have analyzed the prior art to find that the prior art often ignores the complex constraint relationships between sub-tasks (e.g., SELECT, FROM, WHERE, GROUP BY, HAVING, etc.) within an SQL query statement and how these sub-tasks work together to form an efficient SQL query. For example, the conditional columns in the HAVING clause must be a subset of the columns selected by the SELECT clause, a constraint that often is not fully considered and embodied in the prior art, resulting in the generated SQL statement that may not perform properly or return unexpected results.

Referring to fig. 1, an embodiment of the present application provides a query sentence generating method based on a relationship diagram, including the following steps:

S101, acquiring a natural query statement input by a user and target data table information queried by the user, wherein the target data table information comprises a plurality of query fields;

S102, constructing a relation diagram according to the natural query statement and the target data table information, wherein nodes of the relation diagram comprise statement nodes and query field nodes, the statement nodes are obtained according to word segmentation of the natural query statement, and the query field nodes are obtained according to query fields matched with the natural query statement;

s103, inputting the relation graph to a coding layer to obtain relation expression among all nodes, first characteristic expression of all nodes and global characteristic expression of all nodes of the relation graph;

S104, inputting the relation expression among the nodes and the first characteristic expression of each node into a pre-trained relation diagram attention network layer to obtain a second characteristic expression of each node;

S105, inputting the second characteristic representation of each node and the global characteristic representation to a pre-trained task decoder to obtain a query statement output by the task decoder, wherein the task decoder is constructed based on constraint relations of each query sub-statement of the query statement.

The application is particularly focused on the logic relationship and constraint conditions among different subtasks in the SQL query statement when solving the Text-to-SQL task. Specifically, in SQL queries, there is a strict dependency between the result subtask (e.g., the SELECT clause) and the condition subtask (e.g., the condition column in the HAVING clause). For example, in a scenario of data aggregation (e.g., using CUBE), the conditional columns in the HAVING clause must be a subset of the database columns explicitly selected by the SELECT clause. Based on this insight, the application not only considers the logical relationship between the natural query statement and the database table schema when building the relationship graph, but also uses a specially designed task decoder that can understand and utilize the constraints between these subtasks when building the deep learning model. In order to achieve the above, the application introduces a relational graph neural network for feature extraction and representation learning. The relational graph neural network can capture complex relationships between nodes in the graph structure and generate characteristic representations of the nodes through multi-layer nonlinear transformation. Correspondingly, the task decoder considers the structure and rule of the SQL algorithm and the logic sequence among the subtasks in design, learns the feature representation of each node output based on the relational graph neural network through pre-training, combines the global feature representation output by the preprocessed coding layer, and gradually generates the query statement conforming to the constraint conditions among the subtasks and the query statement logic. Finally, when the technical scheme of the application is applied to the natural query statement input by the user, the query statement which is accurate and accords with SQL grammar can be generated.

The query statement generation method based on the relation diagram provided by the embodiment of the application takes a computer as an execution subject. The following describes first the structure and key parameters related to the model, and then the steps.

The coding layer, the first key component in the model, is responsible for converting input data (e.g., node features in the relationship graph) into vector representations in high-dimensional space for subsequent processing and analysis. In one embodiment, the encoding Layer may include an embedding Layer (Embedding Layer) and an initialization Layer (e.g., dropout Layer or Normalization Layer) for converting discrete data (e.g., node information, labels, etc.) into continuous, dense vectors of fixed length. These vectors are used as inputs for subsequent processing (e.g., the relational graph attention network layer) and contain the core information of the input data.

The Relational representation, in the context of a Relational Graph, refers to a mathematical representation of the connections (or edges) between nodes in the Graph. It is used to characterize associations, dependencies or interactions between different nodes. In a relationship graph, relationships may be explicit (e.g., directly connected by directed edges or undirected edges) or implicit (e.g., indirectly inferred by attribute similarity, distance, or other metric between nodes). The relational expression typically contains information about the type, weight, direction, etc. between two or more nodes, which is critical to understanding the structure of the data in the graph, performing node classification, link prediction, etc. In one embodiment, the relational representation may exist in a variety of forms, such as elements in an adjacency matrix, edge feature vectors, or representations dynamically generated through a messaging mechanism in a graph neural network. These representations not only reflect the direct connections between nodes, but may also capture higher-order, complex interaction patterns, providing more comprehensive and rich context information for the task decoder.

The first feature representation refers to feature vectors obtained by preliminary encoding or converting input data (such as nodes in a relationship graph) at an initial stage of data processing or model training. These feature vectors are used as inputs for subsequent processing flows, and include raw information or preliminary processed information of the input data. In one embodiment, the first feature representation corresponds to an initial feature representation of each node in the relationship graph. These characteristics may include the ID, type, attribute (e.g., text description, numerical attribute, etc.) of the node, and the location or structure information (e.g., degree, centrality, etc.) of the node in the graph. These initial features are converted into a fixed length, dense vector form, i.e., a first feature representation, after processing at the coding layer (e.g., the embedding layer). These vectors are further processed and transformed in the subsequent relational graph attention network layer to generate higher-level, more expressive feature representations, ultimately for output generation by the task decoder. It should be noted that the first feature representation may vary depending on the particular application scenario and model design. In practical applications, it is necessary to select an appropriate feature representation method and coding technique according to the characteristics of data and the requirements of tasks.

Global feature representation refers to feature vectors extracted throughout a relationship graph or query context that can reflect global properties or statistical information. In this embodiment, the global feature representation may be obtained by aggregating the features of all nodes or edges in the graph (e.g., summing, averaging, maximum, etc.), or capturing complex interactions and dependencies between nodes in the graph by more complex Graph Neural Network (GNN) techniques. The global feature representation plays an important auxiliary role in the subsequent task decoding process, and helps the model to better understand the overall structure and intent of the query.

The second feature representation is a feature vector obtained after a certain transformation or enhancement, and contains more context information, richer semantic content or higher level abstract concepts than the first feature representation. Generally, in the context of a relational graph or Graph Neural Network (GNN), the second feature representation may be a node feature processed through a graph rolling network (GCN), graph annotation force network (GAT), graph messaging mechanism, or other graph neural network layer. In an embodiment of the application, the second feature representation is generated in the relational graph attention network layer based on the relationship representation between the respective nodes and the first feature representation of the respective nodes. In the attention network layer of the relation graph, the relation among the nodes and the node characteristic information of each node are subjected to weighted aggregation and other processing through an attention mechanism, and then a second characteristic representation corresponding to each node is obtained. In general, the second feature representation includes not only the original information of the first feature representation of the node, but also information of its neighboring nodes and structural information of the entire relationship graph.

The relationship graph attention network layer is a neural network layer specifically designed to process relationship graph data. Based on the attention mechanism (Attention Mechanism), the weights of different nodes or edges can be dynamically adjusted as nodes in the relationship graph are processed to focus on more important information for the current task. In one embodiment, the relational graph attention network layer may be implemented by multiple layers of stacked attention modules, each of which provides fine-grained control of messaging and aggregation between nodes. This layer aims to enhance the model's ability to capture complex relationships between nodes in the relationship graph and provide a richer feature representation for subsequent task decoders.

The task decoder, the last key component in the model, is responsible for converting the feature representation processed by the previous layers into the final output, in this embodiment, the SQL query statement. Task decoders typically employ a sequence-to-sequence architecture. In one embodiment, the task decoder includes multiple sublayers or modules, such as a feature fusion layer, a task fusion layer, and multiple linear layers (otherwise known as fully connected layers). The feature fusion layer is responsible for merging and integrating the feature representations of different sources, and the task fusion layer is responsible for combining the results (such as grouping, field selection, condition generation and the like) of all the subtasks into a complete query statement. The linear layer is used for realizing specific calculation and conversion processes, such as prediction of conditional connectors, calculation of field number and the like. The task decoder is designed to ultimately generate a complete query statement that complies with the grammatical and logical rules by building and refining the various parts of the query statement in steps.

Referring to fig. 2, in the embodiment of the present application, the pre-trained relationship diagram attention network layer and task decoder are trained by the following ways:

Step S201, acquiring a training data set, wherein the training data set comprises a plurality of training samples, and each training sample comprises a relation representation among nodes of a relation graph, a first feature representation of each node, a global feature representation of the relation graph and corresponding query tag information, wherein the query tag information comprises standard query sentences corresponding to the relation graph;

Step S202, inputting the relation expression among all nodes of the relation diagram of each training sample and the characteristic expression of each node into an initialized relation diagram attention network layer to obtain the second characteristic expression of each node of each training sample;

Step 203, inputting the second feature representation of each node of each training sample and the corresponding global feature representation to an initialized task decoder to obtain an output query statement corresponding to each training sample;

And step S204, calculating loss values between query sentences and corresponding query label information of each training sample, and adjusting parameters of the relationship graph attention network layer and the task decoder according to the loss values until the loss values are smaller than a preset loss threshold value or the total number of weight adjustment times reaches preset adjustment times, so as to obtain the trained relationship graph attention network layer and the trained task decoder.

The effective implementation of the embodiments of the present application relies on the previously trained relationship diagram attention network layer and task decoder (for further structural limitations of the relationship diagram attention network layer and task decoder, please refer to the description of the subsequent embodiments). Specifically, step S201 first obtains a data set containing rich training samples, where each training sample includes not only a relationship representation among nodes in the relationship graph, but also an initial feature representation (i.e., a first feature representation) of each node, and a global feature representation of the entire relationship graph. It is particularly important that each training sample is also attached with corresponding query tag information, namely standard query sentences which are generated according to the relation diagram, and the standard query sentences are the key basis for evaluating the output accuracy of the model. In this embodiment, each training sample is obtained according to each different natural query statement, specifically, the method in the embodiment of the present application is adopted to construct a relationship graph for the natural query statement and the corresponding target data table information, and the relationship graph is input into the encoding layer to encode to obtain a relationship representation among nodes of the relationship graph of each natural query statement, a first feature representation of each node, and a global feature representation of each relationship graph, and in addition, the standard query statement corresponding to each natural query statement is used as query tag information of the corresponding training sample. Specific implementation steps can be seen in the detailed description of the following examples.

Step S202 inputs the relationship diagram data (including the relationship representation between nodes and the feature representation of each node) in the training sample into the initialized relationship diagram attention network layer. The network layer utilizes a graph attention mechanism to extract a second feature representation rich in context information by deep processing of the original node features taking into account interactions and importance between the nodes. This process aims at capturing potentially complex pattern and structure information in the graph data, providing powerful support for subsequent task decoding.

Step S203 inputs the second feature representation of each node obtained in step S202 and the global feature representation of the relationship graph in the training sample into the initialized task decoder. The task decoder uses these integrated features to gradually construct a query statement corresponding to the input relationship graph through a series of decoding operations (e.g., sequence generation, attention distribution, etc.).

Step S204 compares the query sentence output by the task decoder with the query label information in the training samples, and calculates a loss value (such as cross entropy loss, BLEU score, etc.) between the two to quantify the difference between the model output and the real label. And according to the calculated loss value, parameters in the attention network layer and the task decoder of the relation diagram are adjusted through a back propagation algorithm, so that the model performance is continuously optimized. The process is repeated until the loss value is reduced below a preset threshold value or the total number of weight adjustment reaches a preset adjustment number, at this time, the model is considered to be fully trained and can be used for subsequent query generation tasks, namely, the method is applied to the query statement generation method based on the relational graph, and the model obtained through pre-training processes the natural query statement newly input by the user.

Through the training process, the combined model of the relationship diagram attention network layer and the task decoder provided by the application can efficiently extract key information from complex relationship diagram data and accurately generate query sentences conforming to expectations. The model not only fully utilizes the advantages of the graph annotation meaning force mechanism in the aspect of capturing complex relations among nodes, but also realizes the natural conversion from the graph to the text through a task-specific decoding strategy. In the whole, the technical scheme remarkably improves the efficiency and accuracy of the query generation task based on the relation diagram, and provides powerful technical support for the fields of information retrieval, database query and the like.

For step S101, natural query sentences input by a user and target data table information queried by the user are acquired, wherein the target data table information comprises a plurality of query fields.

This step first obtains the natural language query statement entered by the user, which is a description of the problem that the user wishes to retrieve information from the database. At the same time, it is also necessary to obtain the target data table information of the user query, which typically includes the name of the data table and several query fields contained in the table (i.e., column names in the database). This step is the starting point of the overall query generation process, providing the necessary input data for the subsequent steps.

For step S102, a relation diagram is constructed according to the natural query statement and the target data table information, wherein nodes of the relation diagram comprise statement nodes and query field nodes, the statement nodes are obtained according to word segmentation of the natural query statement, and the query field nodes are obtained according to query fields matched with the natural query statement.

The step constructs a relation diagram according to the natural query statement and the target data table information. This relationship diagram is used to represent the association between query statements and database table patterns. The nodes of the relation graph comprise statement nodes and query field nodes. Sentence nodes are obtained through word segmentation processing of natural query sentences, and each word or phrase serves as a node to reflect the semantic content of the query sentences. The query field nodes are obtained by matching the query field (i.e. the database column name) mentioned in the natural query sentence with the data table information, and these nodes represent the data columns that the user wants to query. Through the connection between nodes (e.g., character matching based edges), the relationship graph forms a graph structure that reflects the relationship between query intent and database structure.

In the context of the present embodiment, the user's query is directed to a large data table, i.e., the number of query fields included is greater than a predetermined threshold of fields (e.g., greater than 100). Because the user queries the definite large data table, the target query data table of the SQL query statement (namely the data table of the from sub statement) can be directly determined, and when the relationship graph is constructed, the data table name is not required to be added into the relationship graph node, and only the related query field is required to be added into the relationship graph node. Further, in the context of the large data table query, the number of query field nodes in the constructed relationship graph may be large, resulting in huge computation for encoding and decoding the relationship graph, resulting in huge computation power consumption and affecting query efficiency. In this regard, the application will provide a solution in one embodiment later, see in particular the description that follows.

Referring to fig. 3, in one embodiment, the target data table information includes a plurality of data sub-tables, each data sub-table including a plurality of query fields;

step S102, the step of constructing a relationship diagram according to the natural query statement and the target data table information, includes:

S1021, carrying out syntactic analysis on the natural query statement, and determining a first dependency relationship among each word segmentation in the natural query statement according to a syntactic analysis result;

step S1022, identifying entity word segmentation of the natural query sentence, if the entity word segmentation is matched with at least one query field in the target data table information, determining a second dependency relationship between the entity word segmentation and the matched query field;

Step S1023, constructing nodes of a relation graph according to each word segment in the natural query sentence and each matched query field, and constructing corresponding edges between each node of the relation graph according to a first dependency relationship among each word segment in the natural query sentence, a second dependency relationship between the entity word segment and the matched query field and a third dependency relationship between any two query fields belonging to the same data small table.

In step S1021, the natural query sentence is parsed, which generally involves using Natural Language Processing (NLP) techniques, such as part-of-speech tagging, named Entity Recognition (NER), and syntactic dependency analysis. The result of the syntactic parsing will reveal the grammatical structure and semantic relationships between the individual terms (segmentations) in the query statement, the so-called "first dependency". For example, in the query statement "find the highest sales product in 2023," syntactic parsing may identify that "2023" is the time-like term, "sales" is the core of the subject, "highest" is the adjective modifier "sales" and "products" are the objects, and the syntactic relationship between these elements constitutes the first dependency. In one embodiment, a Spacy algorithm may be employed for syntax tree parsing.

Step S1022 focuses on identifying entity parts in the natural query statement and determining whether the entities match the query fields in the target data table information. Once a match is found, a "second dependency" between the entity word and the query field is established. In addition, if two matching query fields are found to belong to the same data table, such as the sales and product IDs belonging to the sales record data table, a third dependency relationship between the two query fields is further established, indicating their close relationship in data logic.

In step S1023, a relationship diagram is constructed based on all the above-described dependencies. The nodes of the relationship graph may include each word in the natural query statement, each matching query field, and even the data tab itself (as needed). The edges are determined according to the first, second and third dependency relationships and represent the connection modes and meanings among the nodes. For example, one edge may point from "2023" to "sales," indicating the defining effect of the time-like language on sales, and another edge may point from "sales" to "products" and be connected to "product ID" by a dashed edge (determined by specific design rules), indicating that these fields are co-acting in the query, and that "product ID" is logically associated with "product".

The embodiment can effectively convert the natural language query statement into the structured relation diagram, and the conversion process not only keeps the original intention of the query statement, but also clearly determines the data table, the fields and the relation among the data table, the fields involved in the query. The structured representation method greatly simplifies the complexity of subsequent query processing, can more accurately understand the user query, optimizes the query plan and improves the query execution efficiency. Meanwhile, the relation diagram clearly shows the association of each element in the query, and powerful support is provided for subsequent works such as query optimization, error diagnosis and the like.

In one embodiment, the syntactic analysis result comprises a word level syntactic relation of the natural query statement, a first dependency relation among the segmented words in the natural query statement comprises first dependency type information among the segmented words, and the first dependency type information among the segmented words is determined according to the word level syntactic relation.

The embodiment further defines that the inter-word dependency relationship in the natural query statement also comprises type information. Specifically, when a natural query sentence input by a user is parsed by a parser, individual words (segmentations) in the sentence are recognized and a syntactic relationship between them, that is, a word level syntactic relationship is determined. These relationships include, but are not limited to, master-predicate relationships, guest-move relationships, bias relationships, etc., which together form the grammar framework of the sentence. After obtaining the word-level syntactic relationships, these relationships are further analyzed to extract first dependencies between individual tokens within the natural query statement. Unlike the conventional method, the present embodiment introduces type information in the dependency relationship—specifically, according to the result of the syntactic parsing, assigns a first dependency type information to the dependency relationship between each word segment. The determination of the dependency type information is based on word level syntactic relations identified in the syntactic parsing. For example, if there is a master-predicate relationship between two tokens, the dependency type information between them may be labeled "master-predicate" and if there is a moving-guest relationship, it may be labeled "moving-guest". These types of information not only reflect the syntactic links between the tokens, but also provide deep semantic cues as to how they interact in a sentence. The introduction of the type information not only enriches the representation dimension of the dependency relationship, but also enables the model to capture semantic features and structural information in the natural query statement more finely.

In summary, the embodiment determines the first dependency type information between the word segments according to the word level syntactic relation in the syntactic analysis result, and the addition of these types of information makes the relationship graph contain not only the connection information between the nodes, but also rich semantic and grammar knowledge. This enhanced relational graph representation capability enables subsequent relational graph attention network layers and task decoders to better understand and utilize graph data to generate more accurate, relevant, and semantically rich query statements. Overall, the technical scheme improves the performance and user experience of the query generation system based on the relationship diagram.

In one embodiment, step S1022 includes, before the step of determining the second dependency relationship between the entity word and the matched query field if the entity word matches at least one query field in the target data table information, the steps of:

Step S10221, performing character matching on the entity word and each query field in the target data table information, and if the character mismatching degree of the entity word and any query field is smaller than a preset threshold, determining that the entity word is matched with the query field.

The embodiment specifically defines the pre-determination step S10221 before executing step S1022, i.e. determining the second dependency relationship between the entity word segmentation and the query field in the target data table information. The core of this step is to determine the query fields that match the entity word by means of character matching. Specifically, all query fields defined in the target data table information are traversed first, and character-level comparison is carried out with the currently processed entity word one by one. The comparison can select a proper character matching algorithm according to actual conditions, and the character matching algorithm which can tolerate a certain degree of character difference is preferably selected to calculate the character mismatch degree. By calculating the character mismatch degree (such as editing distance, similarity score, etc.) between the entity word and each query field and comparing the calculated character mismatch degree with a preset threshold value, it can be determined which query field(s) the entity word has higher similarity on the character level. If the character mismatch degree of a certain query field and the entity word is smaller than a preset threshold value, the query field and the entity word are considered to have a sufficiently strong association in terms of semantics or representation, so that the matching relationship between the query field and the entity word can be determined, and based on the matching relationship, a second dependency relationship between the entity word and the query field is established.

In summary, the embodiment remarkably improves the accuracy and robustness of matching between the entity word segmentation and the query field of the target data table by introducing the character matching mechanism in step S10221. The mechanism not only considers the accurate matching condition between the query field and the entity word segmentation, but also allows a certain degree of character difference through a preset threshold value, so that uncertainty and diversity in user input can be better processed. The flexible matching mode is beneficial to improving the user friendliness and practicability of the whole query generation system, so that the model can more accurately understand the query intention of the user and generate corresponding and effective data query sentences. Meanwhile, because the mismatching condition caused by character difference is reduced, the overall performance and the query efficiency of the model are further improved.

In one embodiment, the second dependency relationship between the entity word segment and the matched query field comprises second dependency type information between the entity word segment and the matched query field, wherein the second dependency type information comprises a complete matching type and a partial matching type;

the method further comprises the steps of:

Step S1024, if the character mismatch degree is 0, determining that the second dependency type information of the entity word and the query field is a complete matching type, and if the character mismatch degree is greater than 0 and smaller than a preset threshold, determining that the second dependency type information of the entity word and the query field is a partial matching type.

In this embodiment, in order to describe the relationship between the entity word and the matched query field more finely, the concept of the second dependency type information is introduced and subdivided into a full match type and a partial match type. The two types of information not only reflect the matching degree between the entity word segmentation and the query field, but also provide richer semantic clues for subsequent query processing.

Specifically, in step S1024, the second dependency type information is determined based on the character mismatch degree calculated previously. If the character mismatch degree is 0, i.e., the entity word segmentation and the query field are completely consistent at the character level, the second dependency type information therebetween may be determined to be a complete match type. The matching type indicates that the entity word is highly consistent with the query field semantically, and a possible prompt model can directly use the entity word for constructing a query sentence without additional conversion or interpretation. On the other hand, if the character mismatch degree is greater than 0 but less than the preset threshold, then a certain character difference exists between the entity word segmentation and the query field, but the difference degree is within an acceptable range. In this case, the second dependency type information therebetween is determined to be the partial match type. The partial match type indicates that the entity-segmentation may be semantically similar to but not exactly the same as the query field, and the hint model may need to ensure the accuracy of the query through some additional processing (e.g., word-like substitution, fuzzy query, etc.).

In summary, the second dependency type information concept is introduced and subdivided into the complete matching type and the partial matching type, so that the logic information in the relationship graph is further enhanced, the representation capability of the relationship graph is also enhanced, and the subsequent relationship graph attention network layer and task decoder can better understand and utilize the graph data, thereby generating more accurate, relevant and semantically rich query sentences. Overall, the technical scheme improves the performance and user experience of the query generation system based on the relationship diagram.

In one embodiment, the target data table information further includes field large class information corresponding to each query field;

The third dependency relationship comprises third dependency type information, wherein the third dependency type information comprises the same field major class type and different field major class types;

the method further comprises the steps of:

Step S1025, if any two query fields belonging to the same data small table correspond to the same field large class information, determining that the third dependency type information between the two query fields is the same field large class type, otherwise, determining that the third dependency type information between the two query fields is different field large class type.

In this embodiment, in order to further improve accuracy of the neural network when generating the query statement, the target data table information is divided more carefully, and the concept of the type information is introduced into the third dependency relationship. These improvements aim to provide the neural network with more rich contextual information by capturing more comprehensively the semantic and structural relationships between the fields within the data table, thereby guiding it to generate more accurate and logical query statements. Specifically, in the target data table information, field large class information is additionally added in addition to the basic information of each query field. The field large class information is the result of a higher level classification of the query field, reflecting the common attributes or categories represented by the fields in the data table. For example, a data table may contain a plurality of fields associated with "user information", such as a user name, user ID, user mailbox, etc., which may be categorized into the general class "user information". Next, a third dependency and its type information are defined. The third dependency relationship mainly focuses on the property of association between any two query fields in the same data table, and the third dependency type information is used for describing the specific type of the association. The third dependency type information may be subdivided into the same field big class type and different field big class types based on the field big class information. If two query fields belong to the same field big class, then the third dependency type information between them is the same field big class type, whereas if they belong to different field big classes, then they are different field big class types.

In step S1025, the specific operation of the above logic is implemented. For any two query fields in the same data small table, first check whether their corresponding field major information is the same. If the two fields are identical, the third dependency type information between the two fields is determined to be the same field large class type, and if the two fields are different, the third dependency type information is determined to be different field large class type.

According to the embodiment, the concept of the field large-class information, the third dependency relationship and the type information thereof is introduced, and the step S1025 is implemented in the data processing process, so that the relationship diagram not only contains the connection information among the query field nodes, but also contains rich type information among the query fields, and the representation capability of the relationship diagram is enhanced. The introduction of the field large-class information enables the neural network to better understand the inherent relation and hierarchical structure among the fields in the data table, so that the query intention of a user and the structural characteristics of the data table can be more accurately mastered when the query statement is generated. Meanwhile, the introduction of the third dependency relationship and the type information thereof provides more abundant context information for the neural network, so that the neural network can analyze the association property among query fields more carefully and generate query sentences more conforming to logic and semantics according to the association property. Overall, the technical scheme improves the performance and user experience of the query generation system based on the relationship diagram.

For step S103, the relationship graph is input to the coding layer, and a relationship representation among the nodes, a first feature representation of each node, and a global feature representation of each node of the relationship graph are obtained.

After the relation diagram is built in the steps, the relation diagram is input into a coding layer for processing. The function of the coding layer is to extract the characteristics of the nodes and edges in the relation graph and learn the representation. Through the coding layer, a relational representation (which may specifically be characteristics such as strength and type of inter-node connection) between the nodes, a first characteristic representation of the respective nodes (which may specifically include semantic and contextual characteristics of the nodes themselves), and a global characteristic representation of the respective nodes of the relational graph (i.e., characteristics of the nodes that comprise the entire graph structure) can be generated. These features represent important input information for the attention network and decoder in subsequent steps.

Referring to fig. 4, in one embodiment, the coding layers in step S103 include a relational coding layer and a node coding layer;

Step S103, the step of inputting the relationship graph to the coding layer to obtain a relationship representation among the nodes, a first feature representation of each node, and a global feature representation of each node of the relationship graph, includes:

Step S1031, inputting the relation graph to the relation coding layer to obtain the relation expression among the output nodes;

step S1032, each node of the relation graph is input to the node coding layer, and feature representation of each node and global feature representation of each node of the relation graph are obtained.

The structure of the coding layer is further refined in the embodiment, and the coding layer is divided into a relation coding layer and a node coding layer, so that relation information between nodes in the relation graph and characteristic information of the nodes are processed respectively. This hierarchical approach helps the neural network more efficiently capture complex structures of the relationship graph and generate high quality node representations and global feature representations.

In step S1031, the relationship diagram is first input to the relationship encoding layer. The main task of the relational coding layer is to analyze and process the connection relations and dependency information between nodes in the relational graph, thereby generating relational expressions between the nodes. The relational expression not only contains direct connection information between nodes, but also can capture indirect association and more complex dependency relationship between nodes through an iterative process of a graph neural network (such as a graph roll-up network GCN). By processing the relational encoding layer, a more comprehensive and deep characteristic representation (namely, relational representation) of the node relation can be obtained.

Each node of the relationship diagram is input to the node encoding layer at step S1032. The main objective of the node coding layer is to extract the feature information of each node and convert it into a feature representation suitable for subsequent processing. Such a feature representation may contain various types of information, such as attribute information, class labels, text descriptions, etc., of the node, depending on the processing capabilities of the node coding layer employed. Meanwhile, the node coding layer can aggregate and abstract node characteristics of the whole relation graph to generate global characteristic representation of the relation graph. Global feature representation is a high-level overview and description of the entire relationship graph that helps the neural network understand the semantics and structure of the entire relationship graph in subsequent steps.

In summary, the embodiment obviously improves the efficiency and accuracy of the neural network in processing the data of the relation diagram by introducing a layering processing mechanism of the relation coding layer and the node coding layer. The relation coding layer effectively captures complex relation information between nodes and provides rich context support for subsequent query generation or reasoning tasks. The node coding layer is focused on the extraction and representation learning of the characteristics of the node, so that the integrity and accuracy of the node information are ensured. The layering processing mode not only enables the neural network to deeply understand the internal structure of the relation graph, but also improves the expandability and the robustness of the neural network in processing large-scale relation graph data. Finally, through comprehensive relation expression, node characteristic expression, global characteristic expression and other information, the neural network can generate more accurate and logical query sentences or reasoning results, so that a better data service experience is provided for users.

Referring to FIG. 5, in one embodiment, the node encoding layer includes an embedding layer and an attention network layer, the embedding layer being connected to the attention network layer;

step S1032, where each node of the relationship graph is input to a node coding layer, and the feature representation of each node and the global feature representation of each node of the relationship graph are obtained, includes:

Step S10321, splicing statement nodes and query field nodes of the relation graph according to a preset format to obtain a relation graph node spliced text, inputting the relation graph node spliced text to the embedded layer to obtain a first embedded vector sequence of the relation graph node spliced text, wherein the first embedded vector sequence comprises embedded vectors corresponding to each statement node and each query field node in the relation graph node spliced text;

Step S10322, inputting the first embedded vector sequence to the attention network layer to obtain an output second embedded vector sequence, wherein the second embedded vector sequence comprises embedded vectors corresponding to each statement node and each query field node in the relation graph node spliced text and global feature vectors of the relation graph node spliced text;

Step S10323, obtaining feature representation corresponding to each sentence node according to the embedded vector corresponding to each sentence node in the relation graph node spliced text, obtaining feature representation corresponding to each query field node according to the embedded vector corresponding to each query field node in the relation graph node spliced text, and obtaining global feature representation of each node of the relation graph according to the global feature vector of the relation graph node spliced text.

In this embodiment, in order to more accurately capture complex associations between nodes in a relationship graph and characteristics of the nodes themselves, a node encoding layer including an embedding layer (Embedding Layer) and an attention network layer (Attention Network Layer) is designed. The design aims at automatically extracting and fusing semantic information among nodes and inside the nodes by means of deep learning so as to generate a characteristic representation with more expressive force.

First, step S10321 splices the sentence nodes and the query field nodes in the relationship graph according to a preset format. Such a splice may be a simple text connection or a complex format with separators to facilitate subsequent processing. And then, inputting the spliced relation graph node text into an embedded layer. The embedding layer converts the text into a vector representation in high-dimensional space, i.e., a first embedded vector sequence, using a pre-trained Word embedding model (e.g., word2Vec, BERT, etc.). These vectors contain not only the semantic information of the words, but also are optimized by context so that similar words or nodes are closer together in vector space.

Next, step S10322 feeds the first embedded vector sequence into the attention network layer. The core idea of the attention mechanism is to allow the model to focus on important parts while processing information, ignoring extraneous or secondary details. In this example, the attention network layer generates a second sequence of embedded vectors by calculating the importance weight of each embedded vector to the whole and adjusting the values of the vectors accordingly. In the process, not only the information of the original embedded vector is reserved, but also the characteristic representation of the key nodes is enhanced in a weighted fusion mode. Meanwhile, the attention network layer also outputs a global feature vector which synthesizes the whole information of the text spliced by the nodes of the relation graph, and provides a basis for the subsequent global feature representation.

Finally, step S10323 obtains the feature representation of each sentence node and each query field node according to the embedded vectors in the second embedded vector sequence. The characteristic representations not only contain semantic information of the nodes, but also integrate information of other related nodes through an attention mechanism, so that effective transfer and integration of information among the nodes are realized. In addition, according to the global feature vector, global feature representations of all nodes of the relation graph are obtained.

In summary, the embodiment realizes the deep understanding and feature extraction of each node and the interrelation thereof in the relationship graph by introducing the node coding layer formed by the embedding layer and the attention network layer. On one hand, the embedded layer converts text nodes into high-dimensional vectors by utilizing a pre-trained word embedded model, and rich semantic information is reserved, and on the other hand, the attention network layer enhances the feature representation of key nodes and generates global feature vectors by dynamically adjusting the weights of the node vectors. The design not only improves the accuracy and the robustness of node characteristic representation, but also enables the model to better capture complex structures and potential modes in the relation graph, and provides powerful support for subsequent tasks such as data processing, query optimization and the like.

In one embodiment, the node coding layer further comprises a pooling layer connected with the attention network layer, the embedding layer comprises a character word segmentation layer and an embedding calculation layer, the character word segmentation layer is used for carrying out character word segmentation on each query field node in the relation graph node spliced text to obtain a plurality of characters corresponding to each query field node, the embedding calculation layer is used for calculating first embedding vectors of a plurality of characters corresponding to each query field node, and the embedding vectors corresponding to the query field nodes in the first embedding vector sequence comprise first embedding vectors of the characters of the query field nodes;

step S10323, the step of splicing the embedded vectors corresponding to each query field node in the text according to the relationship graph node to obtain the feature representation corresponding to each query field node, includes:

Step S103231, carrying out pooling calculation according to the first embedded vector of each character of each query field node to obtain a second embedded vector of each query field node, and determining the second embedded vector corresponding to each query field node as the characteristic representation corresponding to each query field node.

The character word segmentation layer is used for carrying out word segmentation processing at a character level on each query field node in the relation graph node spliced text. This process splits each query field node into its constituent individual characters, providing the basis for subsequent character-based embedding calculations. The character word segmentation has the advantage that finer semantic and structural information in the query field nodes can be captured, and the accuracy of feature representation is improved.

And the embedded calculation layer is used for calculating a first embedded vector of each character corresponding to each query field node. These embedded vectors are obtained through a pre-trained character embedding model (e.g., char2Vec, etc.), which contains not only the semantic information of the character itself, but may also be optimized through context. In this way, the embedded computation layer generates a character-based, high-dimensional, information-rich embedded vector sequence for the query field node.

And the pooling layer is used for receiving the output from the attention network layer and performing pooling calculation on the embedded vector sequence of each query field node. Pooling operations (e.g., max pooling, average pooling, etc.) generate a second embedded vector of fixed length by aggregating key information in the sequence of embedded vectors. This second embedded vector not only retains the important features of the original embedded vector, but also reduces computational complexity and noise interference by dimension reduction, making the feature representation more compact and robust.

It should be noted that, various layers involved in the embodiments of the present application, including an encoding layer, a character word segmentation layer, an embedding calculation layer, a pooling layer, a relationship attention network layer, a task decoder, and the like, may be processing modules based on a neural network, which have the capability of receiving input data and outputting processed data, and such capability is usually obtained through training, where some layers may directly adopt the existing technology, and some layers need to be obtained through corresponding training according to the scenario of the present application, unless specifically described. In addition to the specific description, or based on technology implementing logic, some layers may be processing modules not based on neural networks, but may employ processing modules with custom rules, which may not be obtained through training, for example, a character segmentation layer may also be a process of character-level segmentation based on custom rules. The foregoing description of embodiments has been presented with particular reference to the present application.

The embodiment relates to a large data table query scene, the number of query field nodes in a relationship graph may be huge, and encoding and decoding the nodes directly will bring huge calculation burden to affect the query efficiency. For this reason, the node coding layer is optimized in a targeted manner, and for each query field node, the node coding layer is first split into a character sequence by the character word segmentation layer, and then an embedded vector of each character is calculated by the embedded calculation layer. This step ensures that even field names are long or contain complex characters, they can be efficiently converted into representations in high-dimensional vector space. The sequence of character embedded vectors is then fed into the attention network layer, and weights are assigned to the embedded vectors for each character by the attention mechanism to highlight important information and suppress noise. The pooling layer, after the attention network layer, compresses the long sequence into a second embedded vector of fixed length by pooling the embedded vector sequence of each query field node. The step obviously reduces the dimension and the calculated amount of the feature vector, simultaneously reserves key information, and effectively reduces the complexity of encoding and decoding.

Through the optimization measures, the embodiment realizes the construction of the relation diagram and the effective optimization of the node coding layer under the scene of inquiring the big data table. On one hand, the complexity of the relation graph is simplified by omitting the data table name node and focusing on the query field node, and on the other hand, the efficient extraction and compression coding of the feature vector of the query field node are realized by introducing the combined processing of the character word segmentation layer, the embedded computing layer, the attention network layer and the pooling layer. These optimizations not only reduce the amount of computation, reduce the computational effort consumption, but also improve the query efficiency, enabling the system to more quickly respond to query requests for large-scale data tables.

For step S104, the relationship representation between the nodes and the first feature representation of each node are input to a pre-trained relationship graph attention network layer, and a second feature representation of each node is obtained.

After obtaining the preliminary feature representations of the nodes in step S103, step S104 inputs these feature representations as well as the relationship representations between the nodes into a pre-trained relationship graph attention network layer for further processing. The relationship graph attention network layer is able to capture important relationships and dependencies between nodes using the attention mechanism and update and optimize the feature representation of the nodes based on these relationships and dependencies. By the step, more accurate and rich node second characteristic representations can be generated, and more powerful support is provided for subsequent generation of query sentences.

And for step S105, inputting the second characteristic representation of each node and the global characteristic representation to a pre-trained task decoder to obtain a query statement output by the task decoder, wherein the task decoder is constructed based on the constraint relation of each query sub-statement of the query statement.

This step inputs the second feature representation of each node as well as the global feature representation into a pre-trained task decoder. The task decoder is constructed based on constraint relations between individual query sub-sentences within the query sentence and through prior training, it can understand these constraints and use them to guide the generation process of the query sentence. In the decoding process, the task decoder gradually builds SQL query sentences conforming to the query intention of the user. The query sentences not only contain information such as data columns and conditions which the user wants to query, but also follow the rules of SQL grammar, and of course, under related scenes, the constraint conditions of the database table mode can be followed. Through this step, the accurate conversion from natural language query to SQL query statement is finally realized.

Further, in one embodiment as mentioned above, the method of the present application queries against a large data table, the name of which can be determined directly. Therefore, the relationship graph does not include nodes of data table names, nor does the query statement that is ultimately generated include data table names of the large data table being queried. If the query mechanism of the data query engine requires the query statement to specify the target data table name of the query, it is obvious that the target data table information obtained in step S101 further includes the data table name of the large data table, and after the query statement output by the task decoder is obtained in step S105, the data table name is obtained and the query statement is further generated with the query statement to generate the query engine query statement, and the query can be executed based on the query engine query statement.

In one embodiment, the task decoder comprises a feature fusion layer and a task decoding layer, wherein the feature fusion layer is connected with the task decoding layer, and the task decoding layer comprises a first linear layer, a second linear layer, a third linear layer, a fourth linear layer, a fifth linear layer, a sixth linear layer, a seventh linear layer, an eighth linear layer, a ninth linear layer, a tenth linear layer, an eleventh linear layer and a twelfth linear layer which are respectively connected with the task fusion layer;

The step S105 of inputting the second feature representation of each node and the global feature representation to a pre-trained task decoder to obtain a query statement output by the task decoder includes:

step S1051, performing feature fusion on the global feature representation and the feature representation of the sentence node, and sequentially inputting the fused feature representation to the first linear layer, the second linear layer and the third linear layer to obtain a first task result, where the first task result includes the number of fields of the packet data filtering conditional sub-sentence, the number of fields of the field selection sub-sentence and the number of fields of the conditional selection sub-sentence;

Step S1052, inputting the feature representation of the sentence node to the fourth linear layer, the fifth linear layer and the sixth linear layer in order, to obtain a second task result, where the second task result includes a conditional connector of the packet data filtering conditional sub-sentence, a conditional operator of the packet data filtering conditional sub-sentence, a conditional connector of the conditional selecting sub-sentence and a conditional value of the conditional selecting sub-sentence;

Step S1053, performing feature fusion on the embedded representation output by the sixth linear layer and the feature representation of the sentence node, and inputting the fused feature representation to the seventh linear layer to obtain a third task result, where the third task result includes a condition value and a condition column of a condition selection sub-sentence in the query sentence;

Step S1054, performing feature fusion on the global feature representation and the feature representation of the query field node, and sequentially inputting the fused feature representation to the eighth linear layer to obtain a fourth task result, where the fourth task result includes a field of a field selection sub-sentence;

Step S1055, performing feature fusion on the embedded representation output by the eighth linear layer and the feature representation of the query field node, and sequentially inputting the fused feature representation to the ninth linear layer and the tenth linear layer to obtain the fifth task result, where the fifth task result includes a field of a sorting sub-sentence and a field of a conditional selection sub-sentence;

Step S1056, inputting the embedded representation output by the eighth linear layer to the eleventh linear layer and the twelfth linear layer in sequence, to obtain a sixth task result, where the sixth task result includes fields of a packet sub-statement and a packet data filtering condition sub-statement;

Step S1057, inputting the first task result, the second task result, the third task result, the fourth task result, the fifth task result, and the sixth task result into the task fusion layer, and obtaining an output query statement.

In this embodiment, the task decoder is designed to include a feature fusion layer and a task decoding layer, where the task decoding layer is further subdivided into a plurality of linear layers to process feature representations from different nodes and ultimately generate a complete query statement. The design aims to gradually construct the query statement conforming to SQL grammar and logic structure through layering processing and feature fusion.

First, step S1051 fuses the global feature representation with the feature representation of the statement node, which is intended to combine the overall information of the relationship graph with the specific query statement structure. The fused feature representation is then passed through the first linear layer, the second linear layer, and the third linear layer in order to predict the number of fields required in each of the packet data filtering conditional sub-statement (Having statement), the field selection sub-statement (select statement), and the conditional selection sub-statement (where statement). This prediction provides a framework for subsequent detailed condition generation.

Next, step S1052 determines the conditional connectors (e.g., AND, OR) AND the conditional operators (e.g., =, >, < etc.) in the packet data filtering conditional sub-sentence (Having sentence) AND the conditional selection sub-sentence (where sentence) by the processing of the fourth to sixth linear layers using only the feature representation of the sentence node. This step focuses on refining the conditional logic in the query statement.

To generate a specific condition in the condition selection sub-sentence, step S1053 fuses the embedded representation output by the sixth linear layer with the feature representation of the sentence node again, and feeds the fused feature representation into the seventh linear layer. The seventh linear layer is responsible for outputting the condition values and corresponding condition columns, ensuring that each condition in the query statement is explicit and meaningful.

Step S1054 merges the global feature representation with the feature representation of the query field node, and then generates a field to be included in the field selection sub-sentence (select sentence) by the processing of the eighth linear layer. This step ensures that the query results contain fields of interest to the user.

Step S1055 merges the embedded representation output by the eighth linear layer with the feature representation of the query field node, and determines the fields in the sort sub-sentence (Order by sentence) and other fields possibly involved in the conditional select sub-sentence (where sentence) through the processing of the ninth linear layer and the tenth linear layer. This helps to generate more refined query results.

Step S1056 generates fields involved in a packet sub-sentence (Group by sentence) and a packet data filtering condition sub-sentence (Having sentence) by processing of the eleventh and twelfth linear layers directly using the embedded representation output by the eighth linear layer. This step completes the construction of the grouping function in the query statement.

Finally, step S1057 inputs the results of the respective subtasks described above (i.e., the first to sixth task results) into the task fusion layer. The task fusion layer is responsible for integrating the task results into a complete query statement conforming to SQL grammar. This step is the final step in the decoding process, which ensures that the generated query statement is both accurate and efficient.

The embodiment realizes the automatic generation of the complex query statement by introducing a multi-layer linear structure and a feature fusion mechanism. The task decoder can not only predict each component part (such as grouping, field selection, condition and the like) of the query statement according to the node characteristic representation in the relation diagram, but also ensure that each component part is accurate through fine processing. The design not only improves the accuracy and efficiency of query statement generation, but also greatly reduces the difficulty and cost of manually writing complex query statements. Meanwhile, the model is more stable and efficient in processing large-scale data and complex queries in a layering processing and feature fusion mode.

The above examples illustrate only a few embodiments of the application, which are described in detail and are not to be construed as limiting the scope of the application. It should be noted that modifications and improvements can be made by those skilled in the art without departing from the spirit of the application, and the application is intended to encompass such modifications and improvements.

Claims

1. A method for generating a query statement based on a relationship graph, characterized in that it comprises the following steps:

Acquire a natural query statement input by a user and target data table information queried by the user; the target data table information includes a number of query fields;

A relationship graph is constructed according to the natural query statement and the target data table information; the nodes of the relationship graph include statement nodes and query field nodes; the statement nodes are obtained according to the word segmentation of the natural query statement; the query field nodes are obtained according to the query fields matched by the natural query statement;

Inputting the relationship graph into the encoding layer to obtain the relationship representation between each node, the first feature representation of each node, and the global feature representation of each node in the relationship graph;

Inputting the relationship representation between the nodes and the first feature representation of each node into the pre-trained relationship graph attention network layer to obtain the second feature representation of each node;

Inputting the second feature representation of each node and the global feature representation into a pre-trained task decoder to obtain a query statement output by the task decoder; wherein the task decoder is constructed based on the constraint relationship of each query sub-statement of the query statement;

The encoding layer includes a node encoding layer; the node encoding layer includes an embedding layer and an attention network layer; the embedding layer is connected to the attention network layer;

The step of inputting the relationship graph into the encoding layer to obtain the relationship representation between each node, the first feature representation of each node, and the global feature representation of each node in the relationship graph includes:

Splicing the statement nodes and query field nodes of the relationship graph according to a preset format to obtain a spliced text of the relationship graph nodes; inputting the spliced text of the relationship graph nodes into the embedding layer to obtain a first embedding vector sequence of the spliced text of the relationship graph nodes; the first embedding vector sequence includes the embedding vector corresponding to each statement node and each query field node in the spliced text of the relationship graph nodes;

Inputting the first embedding vector sequence into the attention network layer to obtain an output second embedding vector sequence; the second embedding vector sequence includes the embedding vector corresponding to each sentence node and each query field node in the relationship graph node concatenation text, and the global feature vector of the relationship graph node concatenation text;

According to the embedding vector corresponding to each sentence node in the text of the relationship graph node splicing, the feature representation corresponding to each sentence node is obtained; according to the embedding vector corresponding to each query field node in the text of the relationship graph node splicing, the feature representation corresponding to each query field node is obtained; according to the global feature vector of the text of the relationship graph node splicing, the global feature representation of each node of the relationship graph is obtained.

2. The query statement generation method based on the relationship graph according to claim 1 is characterized in that the target data table information includes a plurality of small data tables, and each small data table includes a plurality of query fields;

The constructing a relationship graph according to the natural query statement and the target data table information includes:

Performing syntactic analysis on the natural query statement, and determining a first dependency relationship between each word segment in the natural query statement according to the syntactic analysis result;

Identify the entity participle of the natural query statement, and if the entity participle matches at least one query field in the target data table information, determine a second dependency relationship between the entity participle and the matched query field; if any two matched query fields belong to the same data tablelet, determine a third dependency relationship between the two matched query fields;

The nodes of the relationship graph are constructed according to the individual word segments and the matched query fields within the natural query statement, and the corresponding edges between the nodes of the relationship graph are constructed according to the first dependency relationship between the individual word segments within the natural query statement, the second dependency relationship between the entity word segment and the matched query field, and the third dependency relationship between any two query fields belonging to the same data tablelet.

3. According to the query statement generation method based on the relationship graph according to claim 2, it is characterized in that the syntactic parsing result includes the word-level syntactic relationship of the natural query statement; the first dependency relationship between each word within the natural query statement includes the first dependency type information between each word; the first dependency type information between each word is determined according to the word-level syntactic relationship.

4. The query statement generation method based on the relationship graph according to claim 2 is characterized in that if the entity segmentation matches at least one query field in the target data table information, before determining the second dependency relationship between the entity segmentation and the matching query field, it includes the steps of:

The entity segmentation is character matched with each query field in the target data table information. If the character mismatch degree between the entity segmentation and any query field is less than a preset threshold, it is determined that the entity segmentation matches the query field.

5. The query statement generation method based on the relationship graph according to claim 4 is characterized in that the second dependency relationship between the entity participle and the matching query field includes second dependency type information between the entity participle and the matching query field; the second dependency type information includes a complete match type and a partial match type;

The method further comprises the steps of:

If the character mismatch degree is 0, it is determined that the entity segmentation and the second dependent type information of the query field are a complete match type; if the character mismatch degree is greater than 0 and less than a preset threshold, it is determined that the entity segmentation and the second dependent type information of the query field are a partial match type.

6. The query statement generation method based on the relationship graph according to claim 2 is characterized in that the target data table information also includes field category information corresponding to each query field;

The third dependency relationship includes third dependency type information; the third dependency type information includes the same field category type and different field category types;

The method further comprises the steps of:

If any two query fields belonging to the same data subtable correspond to the same field category information, the third dependency type information between the two query fields is determined to be the same field category type; otherwise, the third dependency type information between the two query fields is determined to be different field category types.

7. The query statement generation method based on the relationship graph according to claim 1, characterized in that the encoding layer includes a relationship encoding layer;

The relationship graph is input into the relationship encoding layer to obtain the relationship representation between the output nodes.

8. According to the query statement generation method based on the relationship graph of claim 7, it is characterized in that the node encoding layer also includes a pooling layer connected to the attention network layer; the embedding layer includes a character segmentation layer and an embedding calculation layer; the character segmentation layer is used to perform character segmentation on each query field node in the relationship graph node splicing text to obtain a number of characters corresponding to each query field node; the embedding calculation layer is used to calculate the first embedding vector of the several characters corresponding to each query field node; the embedding vector corresponding to the query field node in the first embedding vector sequence includes the first embedding vector of each character of the query field node;

The step of concatenating the embedding vector corresponding to each query field node in the text according to the relationship graph nodes to obtain the feature representation corresponding to each query field node includes:

Pooling calculation is performed based on the first embedding vectors of each character of each query field node to obtain a second embedding vector of each query field node; and the second embedding vector corresponding to each query field node is determined as a feature representation corresponding to each query field node.

9. The query statement generation method based on the relationship graph according to claim 1 is characterized in that the task decoder includes a feature fusion layer and a task decoding layer; the feature fusion layer is connected to the task decoding layer; the task decoding layer includes a task fusion layer and a first linear layer, a second linear layer, a third linear layer, a fourth linear layer, a fifth linear layer, a sixth linear layer, a seventh linear layer, an eighth linear layer, a ninth linear layer, a tenth linear layer, an eleventh linear layer and a twelfth linear layer respectively connected to the task fusion layer;

The step of inputting the second feature representation of each node and the global feature representation into a pre-trained task decoder to obtain a query statement output by the task decoder includes:

Performing feature fusion on the global feature representation and the feature representation of the statement node, and inputting the fused feature representation to the first linear layer, the second linear layer, and the third linear layer in sequence, to obtain a first task result, wherein the first task result includes the number of fields of the grouped data filtering condition sub-statement, the number of fields of the field selection sub-statement, and the number of fields of the condition selection sub-statement;

Inputting the feature representation of the statement node to the fourth linear layer, the fifth linear layer, and the sixth linear layer in sequence to obtain a second task result, wherein the second task result includes a conditional connector of a grouped data filtering conditional sub-statement, a conditional operator of a grouped data filtering conditional sub-statement, a conditional connector of a conditional selection sub-statement, and a conditional value of the conditional selection sub-statement;

Perform feature fusion on the embedding representation output by the sixth linear layer and the feature representation of the sentence node, and input the fused feature representation into the seventh linear layer to obtain a third task result, wherein the third task result includes a condition value and a condition column of a conditional selection sub-statement in the query sentence;

Performing feature fusion on the global feature representation and the feature representation of the query field node, and sequentially inputting the fused feature representation into the eighth linear layer to obtain a fourth task result, wherein the fourth task result includes a field of a field selection sub-statement;

Performing feature fusion on the embedding representation output by the eighth linear layer and the feature representation of the query field node, and inputting the fused feature representation into the ninth linear layer and the tenth linear layer in sequence, to obtain a fifth task result, wherein the fifth task result includes a field of a sorting sub-statement and a field of a conditional selection sub-statement;

Inputting the embedding representation output by the eighth linear layer into the eleventh linear layer and the twelfth linear layer in sequence to obtain a sixth task result, wherein the sixth task result includes fields of a grouping sub-statement and a grouping data filtering condition sub-statement;

The first task result, the second task result, the third task result, the fourth task result, the fifth task result and the sixth task result are input into the task fusion layer to obtain an output query statement.

10. The query statement generation method based on the relationship graph according to claim 1 is characterized in that the pre-trained relationship graph attention network layer and task decoder are trained by the following method:

Acquire a training data set; wherein the training data set includes a plurality of training samples, each training sample includes a relationship representation between nodes of a relationship graph, a first feature representation of each node, a global feature representation of the relationship graph, and corresponding query label information; the query label information includes a standard query statement corresponding to the relationship graph;

Input the relationship representation between each node of the relationship graph of each training sample and the feature representation of each node into the initialized relationship graph attention network layer to obtain the second feature representation of each node of each training sample as output;

Inputting the second feature representation of each node of each training sample and the corresponding global feature representation into the initialized task decoder to obtain the query statement corresponding to each training sample;

Calculate the loss value between the query statement of each training sample and the corresponding query label information; according to the loss value, adjust the parameters of the relationship graph attention network layer and the task decoder until the loss value is less than the preset loss threshold or the total number of weight adjustments reaches the preset number of adjustments, and obtain the trained relationship graph attention network layer and the task decoder.