CN119691035A

CN119691035A - Metadata generation method and device

Info

Publication number: CN119691035A
Application number: CN202411750516.8A
Authority: CN
Inventors: 张颖; 王武军
Original assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2024-11-30
Filing date: 2024-11-30
Publication date: 2025-03-25

Abstract

The embodiment of the present application provides a method and device for generating metadata, wherein the method includes: obtaining a pre-created meta-model and full basic information extracted from at least one data source, wherein the meta-model includes at least two sub-data structures, and the full basic information includes at least two granularities, and each sub-data structure has a granularity corresponding to it; using the meta-model to encapsulate the full basic information to obtain a full data instance; based on a preset clustering and merging method, clustering and merging the full data instance through a retrieval enhanced generation model to generate target metadata, wherein the retrieval enhanced generation model is used to cluster the full data instance according to the preset clustering and merging method, and the preset clustering and merging method includes grouping according to the description information of the full data instance, thereby achieving the technical effect of automatically generating metadata and solving the problem of low metadata generation efficiency and difficult to ensure data fusion quality in related technologies.

Description

Metadata generation method and device

Technical Field

The embodiment of the application relates to the field of computers, in particular to a metadata generation method and device.

Background

Data fusion is to organically concentrate data with different sources, formats and characteristics logically or physically, so as to provide comprehensive data sharing for enterprises. The data model is designed consistently, ensuring the integrity, accuracy and consistency of the data is the key of data fusion, and the metadata is the basis and tie for constructing the data model and carrying out data sharing. Metadata is currently manually generated based on manual means.

The manual generation of unified metadata has the defects of large workload, difficult normalization, high error rate, poor maintainability and the like, so that the metadata generation efficiency is low, and the data fusion quality is difficult to guarantee.

Disclosure of Invention

The embodiment of the application provides a metadata generation method and device, which at least solve the problems of low metadata generation efficiency and difficult data fusion quality guarantee in the related technology.

According to one embodiment of the application, a metadata generation method is provided, which comprises the steps of obtaining a pre-created meta-model and full-volume basic information extracted from at least one data source, wherein the meta-model comprises at least two sub-data structures, the full-volume basic information comprises at least two granularities, each sub-data structure is provided with one granularity corresponding to the at least two granularities, packaging the full-volume basic information by using the meta-model to obtain full-volume data examples, wherein the full-volume data examples represent the full-volume basic information stored according to rules indicated by the meta-model, and carrying out cluster merging on the full-volume data examples by searching an enhanced generation model based on a preset cluster merging mode to generate target meta-data, wherein the searching enhanced generation model is used for clustering the full-volume data examples according to the preset cluster merging mode, and the preset cluster merging mode comprises grouping description information of the full-volume data examples.

In an exemplary embodiment, the method for generating the target metadata by carrying out cluster merging on the full data instance through the retrieval enhancement generation model based on a preset cluster merging mode comprises the steps of carrying out cluster merging on the full data instance through the retrieval enhancement generation model based on a preset cluster merging mode, determining a target aggregation instance, wherein the target aggregation instance comprises an instance corresponding to table-level metadata, the at least two sub-data structures comprise the table-level metadata, and generating the target metadata based on the target aggregation instance, wherein the target metadata is set to be assembled according to a target machine language.

In an exemplary embodiment, the step of carrying out cluster merging on the full data instance through the search enhancement generation model based on a preset cluster merging mode to determine a target aggregation instance includes setting the preset cluster merging mode, storing the preset cluster merging mode into a knowledge base corresponding to the search enhancement generation model, referring to the knowledge base through the search enhancement generation model, carrying out cluster merging on the full data instance automatically, and determining the target aggregation instance.

In an exemplary embodiment, the clustering and merging are performed on the full-volume data instance through the retrieval enhancement generation model based on a preset clustering and merging mode, and a target aggregation instance is determined, and the method comprises the steps of obtaining description information corresponding to each table-level metadata in the full-volume data instance through the retrieval enhancement model, grouping the full-volume data instance according to the description information to obtain at least two groups of data instances respectively corresponding to the at least two sub-data structures, and merging the data instances according to attribute parameters of the corresponding sub-data structures for each group of data instances to obtain the target aggregation instance.

In an exemplary embodiment, merging the data instances of each group according to attribute parameters of a corresponding sub-data structure to obtain the target aggregate instance includes performing an operation on each group of the data instances under the condition that the sub-data structure corresponds to the table-level metadata, wherein each group of the data instances performing the operation is regarded as a first group of data instances, adding and filling table-level data source attribute parameters of each data instance in the first group of data instances to obtain target table-level data source attribute parameters, setting table-level identification attribute parameters of each data instance in the first group of data instances to fill according to a first language service text to obtain target table-level identification attribute parameters corresponding to the target aggregate instance, and setting table-level description attribute parameters of each data instance in the first group of data instances to fill according to a second language service text to obtain target table-level description attribute parameters, wherein the first language service text and the second language service text use table-level data source attribute parameters are added and filled to obtain target table-level data source attribute parameters, and the target aggregate instance attribute parameters are generated based on the target-level data source attribute parameters and the target aggregate instance attribute parameters.

In an exemplary embodiment, merging each group of the data instances according to attribute parameters of a corresponding sub-data structure to obtain the target aggregate instance, wherein the merging comprises executing the following operation on each group of the data instances under the condition that the sub-data structure corresponds to field-level metadata, wherein each group of the data instances executing the following operation is regarded as a second group of the data instances, performing semantic deduplication on a field-level description attribute parameter and a field-level data item attribute parameter of each data instance in the second group of the data instances, setting the field-level description attribute parameter after the deduplication as a field-level description attribute parameter according to second language service text filling to obtain a target field-level attribute parameter, setting a field-level identification attribute parameter of each data instance in the second group of the data instances as a field-level service text filling according to a first language service text to obtain a target field-level identification attribute parameter, wherein languages used by the first language service text and the second language service text are different, aggregating each group of the field-level description attribute parameters in the second group of the data instances according to a field-level description attribute parameter of the field-level data instance, and generating the target-level attribute parameter according to a field-level attribute parameter of the aggregate data instance, and generating the target-level attribute parameter according to the field-level attribute parameter of the target-level attribute parameter.

In an exemplary embodiment, merging the data instances according to attribute parameters of a corresponding sub-data structure to obtain the target aggregate instance includes executing an operation on each set of the data instances under the condition that the sub-data structure corresponds to dictionary-level metadata, wherein each time a set of the data instances is regarded as a third set of data instances, performing semantic deduplication on dictionary-level value parameters of each data instance in the third set of data instances to obtain target dictionary-level value attribute parameters, filling dictionary-level description attribute parameters and dictionary-level data type attribute parameters of each data instance in the third set of data instances according to original values to obtain target dictionary-level description attribute parameters and target dictionary-level data type attribute parameters, and generating the target aggregate instance based on the target dictionary-level value attribute parameters, the target dictionary-level description attribute parameters and the target dictionary-level data type attribute parameters.

According to another embodiment of the application, a metadata generation device is provided, which comprises an acquisition module, a packaging module and a clustering module, wherein the acquisition module is used for acquiring a pre-created meta-model and full-volume basic information extracted from at least one data source, the meta-model comprises at least two sub-data structures, the full-volume basic information comprises at least two granularities, one granularity corresponds to each sub-data structure, the packaging module is used for packaging the full-volume basic information by using the meta-model to obtain full-volume data examples, the full-volume data examples represent the full-volume basic information stored according to rules indicated by the meta-model, the clustering module is used for carrying out cluster merging on the full-volume data examples through a retrieval enhancement generation model based on a preset cluster merging mode to generate target metadata, and the retrieval enhancement generation model is used for carrying out clustering on the full-volume data examples according to the preset cluster merging mode, and the preset cluster merging mode comprises grouping according to description information of the full-volume data examples.

According to a further embodiment of the application, there is also provided a computer readable storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

According to a further embodiment of the application there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

By the present application, first, the design of the meta-model includes at least two sub-data structures, such as table-level metadata and field-level metadata, while the full-size base information includes at least two granularities, such as levels of libraries and tables. Such a design ensures that each sub-data structure corresponds to a granularity of underlying information, thereby achieving accurate data encapsulation. Next, the full-scale base information is encapsulated by a meta-model to generate a full-scale data instance. In this step, the full amount of basic information is organized and stored according to rules of the meta-model, forming structured data instances, and providing a standardized basis for subsequent data processing and analysis. And finally, based on a preset cluster merging mode, utilizing a retrieval enhancement generation model (RAG model) to cluster and merge the full data instances. In the process, the RAG model groups according to the description information of the full data instance, so that the logical concentration and integration of data are realized, the target metadata is finally and automatically generated, the purpose of unified management and fusion of data with different sources and formats is achieved, the technical effect of automatically generating the metadata is realized, and the problems that the metadata generation efficiency is low and the data fusion quality is difficult to guarantee in the related technology are solved.

Drawings

Fig. 1 is a hardware configuration block diagram of a server apparatus of a metadata generation method of an embodiment of the present application;

FIG. 2 is a flow chart of a method of generating metadata according to an embodiment of the present application;

FIG. 3 is a flow chart of a method of generating metadata according to an embodiment of the present application;

Fig. 4 is a block diagram of a metadata generation apparatus according to an embodiment of the present application.

Detailed Description

Embodiments of the present application will be described in detail below with reference to the accompanying drawings in conjunction with the embodiments.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

The method embodiments provided in the embodiments of the present application may be executed in a server apparatus or similar computing device. Taking the operation on a server device as an example, fig. 1 is a block diagram of a hardware structure of a server device of a metadata generation method according to an embodiment of the present application. As shown in fig. 1, the server apparatus may include one or more (only one is shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a microprocessor MCU, a programmable logic device FPGA, or the like, and a memory 104 for storing data, wherein the server apparatus may further include a transmission apparatus 106 for communication functions and an input-output apparatus 1 08. It will be appreciated by those of ordinary skill in the art that the architecture shown in fig. 1 is merely illustrative and is not intended to limit the architecture of the server apparatus described above. For example, the server device may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to a metadata generation method in an embodiment of the present application, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, that is, implements the above-described method. The memory 104 may include high speed random access memory, but may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory. In some examples, the memory 104 may further include memory remotely located with respect to the processor 102, which may be connected to the server device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of a server device. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.

In this embodiment, a metadata generating method is provided, fig. 2 is a flowchart of a metadata generating method according to an embodiment of the present application, and as shown in fig. 2, the flowchart includes the following steps:

Step S202, a pre-created meta-model and full basic information extracted from at least one data source are obtained, wherein the meta-model comprises at least two sub-data structures, the full basic information comprises at least two granularities, and each sub-data structure has one granularity corresponding to the granularity;

Step S204, packaging the full-quantity basic information by using a meta-model to obtain a full-quantity data instance, wherein the full-quantity data instance represents the full-quantity basic information stored according to rules indicated by the meta-model;

Step S206, carrying out cluster combination on the full-volume data instances through a retrieval enhancement generation model based on a preset cluster combination mode to generate target metadata, wherein the retrieval enhancement generation model is used for carrying out cluster on the full-volume data instances according to the preset cluster combination mode, and the preset cluster combination mode comprises grouping according to description information of the full-volume data instances.

The main execution body of the above steps may be a server, a terminal, or the like, but is not limited thereto.

The execution order of step S202 and step S204 may be interchanged, i.e. step S204 may be executed first and then step S202 may be executed.

Alternatively, in the embodiment of the present application, the above step S202 refers to a process of acquiring a pre-created meta model and a full amount of basic information extracted from at least one data source. The meta-model comprises at least two sub-data structures, which refer to data descriptions of different levels and types that make up the meta-model, such as table-level metadata, field-level metadata, element-level metadata, and dictionary-level metadata. These sub-data structures correspond to different data granularities, e.g. table-level metadata corresponds to tables in the database, field-level metadata corresponds to fields in the table. The full base information includes at least two granularities, which refer to the degree of detail of information extracted from the data source, such as database names, table names, field names, etc., and their attributes, such as field length, field accuracy, etc. Each sub-data structure has a granularity corresponding to it, ensuring that the meta-model is able to fully describe and encapsulate the underlying information of the data source.

Optionally, in the embodiment of the present application, step S204 refers to a process of encapsulating the full-scale basic information using the meta-model, so as to obtain a full-scale data instance. The full-volume data instance represents full-volume base information stored according to rules indicated by the metamodel. This process involves organizing and storing basic information extracted from the data sources according to the structure and requirements of the metamodel to form data instances in a uniform format. These data instances can reflect the full view of the data source, including table structure, field attributes, data content, etc., which are the basis for subsequent data fusion and metadata automatic generation.

Optionally, in the embodiment of the present application, step S206 refers to a process of performing cluster merging on the full-scale data instance by retrieving the enhancement generation model based on a preset cluster merging manner, so as to generate the target metadata. The retrieval enhancement generation model is used for clustering the full data instances according to a preset cluster merging mode, combines the retrieval and generation technology, and can perform logical or physical centralized processing on the data instances according to rules in a knowledge base. The preset cluster merging mode comprises grouping according to the description information of the full data instances, which means that the data instances with the same or similar characteristics are grouped according to the descriptive information of the service semantics, the attribute characteristics and the like of the data instances so as to carry out further merging processing to generate target metadata, wherein the metadata are used for guiding subsequent data fusion and application development.

It should be noted that, in the process of obtaining the pre-created meta-model and the full-size base information extracted from at least one data source, the granularity of the sub-data structure of the meta-model and the full-size base information may have a variety of different combinations and forms. For example, the sub-data structure of the metamodel may include table-level metadata and field-level metadata, and the granularity of the full amount of base information may be a database-level library name, a table name, or a field-level field name, a field type. Each sub-data structure corresponds to a granularity, such as table-level metadata corresponding to the information of the database and table, and field-level metadata corresponding to the detailed information of the field. In addition, the meta model may also contain element-level metadata and dictionary-level metadata, corresponding to the granularity of the data item content and dictionary content, respectively. The application is not limited in this regard.

In the process of packaging the full-volume basic information by using the meta-model to obtain the full-volume data instance, the storage rule and format of the full-volume data instance can be changed according to different requirements and scenes. For example, full-volume data instances may be stored in JSON format, containing detailed information for table-level, field-level, element-level, and dictionary-level metadata. Such information may include the name of the data source, the description of the table, the type and size of the fields, the content and type of the data item, the content and type of the dictionary, etc. The full data instance can also contain additional information such as version information, creation time, modification time and the like of the data so as to meet the requirements of data management and tracking. The application is not limited in this regard.

And carrying out cluster combination on the full data instance through a retrieval enhancement generation model based on a preset cluster combination mode, wherein the preset cluster combination mode can be customized according to different business logics and data characteristics in the process of generating target metadata. For example, the grouping may be by business segment of the data source, such as merging all tables related to finance into one metadata group, or by update frequency of the data, such as processing the data updated in real time and the data updated in non-real time separately. In addition, the cluster merging can be further based on the access mode of the data or the query habit of the user so as to optimize the retrieval and access efficiency of the data. The result of cluster merging can be a new metadata model, or can be an update and extension of an existing metadata model. The application is not limited in this regard.

Taking an e-commerce platform data fusion scenario as an example:

S1, firstly, a pre-created meta-model and full basic information extracted from at least one data source are required to be acquired. In this scenario, the meta-model includes at least two sub-data structures, such as table-level metadata and field-level metadata. The full base information includes at least two granularities, such as a field granularity and a record granularity. The table-level metadata corresponds to table information in the database, the field-level metadata corresponds to field information in the table, the field granularity provides information such as field names, field types and the like, and the record granularity provides specific data record contents.

And S2, packaging the full-quantity basic information by using the meta-model to obtain a full-quantity data instance. In this process, the full amount of base information is organized into unified data instances according to rules of the metamodel. For example, a metamodel defines a template in JSON format that contains the structure of table-level metadata and field-level metadata. The data in the full-scale base information is filled into the template to form a full-scale data instance. The examples include user order data, product information data and the like of the e-commerce platform, and each data example follows the structure of a meta-model, so that the consistency and the understandability of the data are ensured.

And S3, carrying out cluster merging on the full data instance through a retrieval enhancement generation model based on a preset cluster merging mode to generate target metadata. In this step, the preset cluster merging mode may include grouping according to the service description information of the full data instance. For example, the order data in the different tables may be clustered together according to the status of the order (e.g., "to pay", "paid") to generate a new metadata model that contains a unified view of all related orders. The retrieval enhancement generation model plays a key role in the process, not only clusters according to preset rules, but also can understand and process natural language description, so that cluster merging is more intelligent and accurate.

Through the flow, the automatic generation of metadata is realized, and the efficiency and quality of data fusion are obviously improved. In the E-commerce platform scene, the method can rapidly integrate the user order information from different data sources, reduce the complexity and error rate of manual operation, and improve the speed and accuracy of data processing. In addition, due to the automatic generation of metadata, the consistency and maintainability of the data are ensured, and a solid foundation is provided for subsequent data analysis and business decision. The method has good expansibility, can adapt to different business requirements and data structure changes, and provides a flexible and efficient solution for data fusion.

As an alternative, based on a preset cluster merging mode, carrying out cluster merging on the full data instance through a retrieval enhancement generation model to generate target metadata, wherein the method comprises the following steps:

carrying out cluster merging on the full data instances through a retrieval enhancement generation model based on a preset cluster merging mode, and determining a target aggregation instance, wherein the target aggregation instance comprises an instance corresponding to the table-level metadata, and at least two sub-data structures comprise the table-level metadata;

Target metadata is generated based on the target aggregate instance, wherein the target metadata is configured to be assembled in a target machine language.

Optionally, in the embodiment of the present application, the foregoing manner of merging based on preset clusters refers to a method for logically or physically centralizing the full-volume data instance according to a specific rule or standard. This approach includes, but is not limited to, grouping according to descriptive information of business semantics, attribute features, etc. of the data instance. For example, in the data fusion scenario of the e-commerce platform, the preset cluster merging manner may aggregate the order data in different data sources according to the service description information of the order, such as "order status" or "product category". Such cluster merging helps to form a more complete and consistent view of the data, facilitating subsequent data analysis and decision support.

Alternatively, in the embodiment of the present application, the above target aggregate instance refers to an aggregate data instance obtained after processing by retrieving the enhancement generation model. These examples include examples corresponding to table-level metadata, and the at least two sub-data structures include table-level metadata. Target aggregate instances may include, but are not limited to, table-level metadata, field-level metadata, and dictionary-level metadata. For example, in a data fusion of a travel platform, target aggregate instances may include travel order information from different data sources that are aggregated into a unified view, where table-level metadata may contain basic information for the order, field-level metadata contains specific field information for the order, such as order number, customer name, etc., and dictionary-level metadata may contain enumerated values for the order status, such as "to pay", "paid", etc.

Alternatively, in an embodiment of the present application, the target metadata refers to metadata generated based on the target aggregate instance, and the metadata is configured to be assembled according to a target machine language. Target metadata is data that describes data that provides a framework and rules for the management and use of data. For example, in building a data warehouse, the target metadata may include definitions of tables, data types and lengths of fields, index information, and the like. These metadata may be used to generate SQL statements to create tables and fields in a database or to guide the flow of data during data migration and transformation. The target machine language may be SQL, noSQL, or any other programming language for data manipulation and storage. In this way, the target metadata not only provides a structured description of the data, but also ensures that the data can be efficiently migrated and used between different systems and platforms.

It should be noted that, in the process of carrying out cluster merging on the full data instances through the retrieval enhancement generation model based on the preset cluster merging mode, the preset cluster merging mode can be designed in a diversified manner according to different service requirements and data characteristics. For example, the cluster merging mode can be performed according to the source of the data, the service attribute of the data or the time stamp of the data. In the e-commerce platform, order data can be clustered and combined according to dimensions such as commodity category, order state or customer region, and the like, so that a target aggregation instance is formed. In the financial field, transaction records can be clustered and combined according to the dimensions of transaction types, monetary ranges, account states and the like. In the medical health field, medical records can be clustered and combined according to dimensions such as patient diagnosis, treatment stage or medicine category. The application is not limited in this regard.

It should be further noted that, the generation of the target aggregate instance is not limited to table-level metadata, but may also include various sub-data structures such as field-level metadata, element-level metadata, dictionary-level metadata, and the like. For example, in building an integrated customer management system, the target aggregate instance may include the customer's basic information (table-level metadata), the customer's purchase history (field-level metadata), the customer's specific interaction record (element-level metadata), and the customer satisfaction survey's options (dictionary-level metadata). The design enables the target aggregation instance to comprehensively reflect the characteristics of the data and provide rich information for subsequent data applications. The application is not limited in this regard.

Finally, it should be noted that, based on the target metadata generated by the target aggregation instance, the assembly process can adopt different target machine languages according to different technical platforms and application scenarios. For example, in a cloud service platform, target metadata may need to be assembled in the style of the RESTful API to be compatible with the cloud service interface, in a traditional relational database, target metadata may need to be assembled in the SQL language to create table structures and indexes in the database, and in a big data processing platform, target metadata may need to be assembled in the language specifications in the Hadoop ecosystem to work in concert with components such as HDFS and MapReduce. This flexibility enables the target metadata to accommodate different technical environments, meeting diverse data processing requirements. The application is not limited in this regard.

According to the embodiment of the application, a metadata self-generation data fusion method based on a RAG (RETRIEVAL-augmented Generation) model is adopted, a full amount of data examples are processed through a preset cluster merging rule, the data examples are clustered and merged by utilizing the retrieval enhancement generation capability of the RAG model, the target aggregation example is determined, the comprehensiveness and consistency of data fusion are ensured, the purpose of organically and logically or physically centralizing data with different sources, formats and characteristic properties is achieved, and therefore, the technical effects of automation and intelligence of data fusion are realized.

Further, target metadata is generated based on the target aggregate instance, wherein the target metadata is configured to be assembled in a target machine language. Through an automatic metadata generation flow, the aggregated instance is converted into a machine-readable metadata format, so that target metadata can be directly used for creating a database table and defining data fields, the metadata generation efficiency is improved, the high compatibility of the generated metadata and target machine language is ensured, the automation of a data fusion process and the seamless integration among systems are further realized, the purposes of improving the data processing efficiency, reducing human errors and enhancing the data consistency are achieved, and the high automation and the precision of a data fusion technology are realized.

As an alternative scheme, the method comprises the steps of carrying out cluster merging on the full-volume data instance through a search enhancement generation model based on a preset cluster merging mode, and determining a target aggregation instance.

Optionally, in the embodiment of the present application, the setting a preset cluster merging manner refers to defining a set of rules or criteria for guiding how the retrieval enhancement generation model performs a logical or physical centralized processing on the full-volume data instance. This includes, but is not limited to, determining which data should be aggregated together and how to organize the data to form meaningful information units. For example, in medical data analysis, a preset cluster merge may be set according to the patient's diagnostic results, treatment type, or drug usage to ensure similar medical events are clustered together. In the business intelligence field, this may involve setting clustering rules according to product categories, sales areas, or customer groups. These preset cluster merging approaches help extract valuable information from a large number of complex data and support decision making. Storing the preset cluster merge into the knowledge base corresponding to the search enhancement generation model means that these rules or criteria are stored in a structured database that can be accessed and used by the model. This knowledge base provides the necessary context information for retrieving the enhanced generative model, enabling it to perform cluster merge tasks more accurately. The data in the knowledge base may include, but is not limited to, classification rules, attribute weights, historical clustering results, and the like. For example, in a customer relationship management system, the knowledge base may contain classification rules for customer feedback that are set based on the customer's purchase history, service interactions, and feedback content so that the model can automatically identify and classify new customer feedback. The method comprises the steps of generating a model reference knowledge base through retrieval enhancement, automatically carrying out cluster merging on full data instances, determining a target aggregation instance, searching and referencing related information in the knowledge base by utilizing the retrieval capability of the model, and creating a new data structure by combining the generation capability. This process involves analysis and understanding of the full volume data instance and integration of the data according to a preset cluster merge approach. For example, in social media analysis, the model may reference rules in the knowledge base for topic classification, automatically cluster related posts and comments, and form a comprehensive view of a particular event or trend. In the financial field, the model may automatically aggregate transaction records into an activity overview of the customer account based on transaction type and monetary range rules in the knowledge base. Such automated cluster merging not only improves the efficiency of data processing, but also enhances the readability and usability of data.

It should be noted that, setting the preset cluster merging mode can be designed in a diversified manner according to different service requirements and data characteristics. For example, in the field of electronic commerce, cluster merging may be set according to the dimensions of commodity category, customer rating, sales area, etc., so as to aggregate similar commodity or customer feedback. In the medical health field, clustering rules can be designed according to the age of patients, the disease types, the treatment results and other dimensions, so that analysis and research on similar cases can be facilitated. In the financial industry, a cluster merging mode can be set according to the dimensions of transaction amount, transaction time, transaction type and the like, so that risk assessment and fraud detection can be conveniently carried out on transaction data. The application is not limited in this regard.

Storing the preset cluster merging mode into a knowledge base corresponding to the retrieval enhancement generation model, which means that the rules can be customized and updated according to different scenes and requirements. For example, in the customer service area, the knowledge base may contain emotion analysis rules based on customer feedback so that the model can identify and classify the emotional tendency of the customer. In supply chain management, the knowledge base may contain clustering rules based on logistic information so that the model can track and optimize the flow of goods. In the educational field, the knowledge base may contain clustering rules based on student performance so that the model can identify the learning patterns and needs of the students. The updating and maintenance of these knowledge bases may be dynamic to accommodate changing data and business environments. The application is not limited in this regard.

The process of determining the target aggregate instance can involve a variety of data processing techniques and algorithms by automatically clustering and merging the full-scale data instances by retrieving the enhanced generative model reference knowledge base. For example, in text data processing, a model may utilize natural language processing techniques to understand and classify text content. In the field of image recognition, models may employ computer vision techniques to identify and cluster image data. In structured data analysis, a model can employ machine learning algorithms to discover patterns and associations in data. The application of the techniques enables the model to accurately perform cluster merging tasks and generate meaningful target aggregation instances, thereby providing a powerful tool for decision support and data analysis. The application is not limited in this regard.

According to the embodiment of the application, the aim of automatically carrying out cluster merging on the full data instance is realized by storing the preset cluster merging modes into the knowledge base corresponding to the retrieval enhancement generation model (RAG model). The RAG model utilizes the enhanced retrieval capability, refers to a clustering merging rule in a knowledge base, automatically identifies and processes data instances, so that a target aggregation instance is determined, the data fusion process is not dependent on manual operation any more, intelligent classification and merging are performed on the data according to a preset rule in an automatic mode, the efficiency and accuracy of data processing are improved, the purposes of reducing manual intervention and improving the data processing efficiency and accuracy are achieved, the technical effects of improving the data fusion quality and the automation level are achieved, errors caused by manual operation are reduced, the flexibility and the expandability of data fusion are improved, and a more accurate and comprehensive data basis is provided for subsequent data analysis and decision support.

The method comprises the steps of obtaining description information corresponding to metadata of each table level in the full data instance through the retrieval enhancement model, grouping the full data instance according to the description information to obtain at least two groups of data instances corresponding to at least two sub-data structures respectively, and combining the attribute parameters of the corresponding sub-data structures according to each group of data instances to obtain a target aggregation instance.

Alternatively, in an embodiment of the present application, the above-mentioned search enhancement model refers to a model that combines search (RETRIEVAL) and Generation (Generation) technologies, specifically, RAG (RETRIEVAL-augmentedGeneration). The model generates answers or contents by referring to information of a knowledge base, and has strong interpretability and customization capability. The method is suitable for a plurality of natural language processing tasks such as a question and answer system, document generation, intelligent assistant and the like. The RAG model has the advantages of strong universality, realization of instant knowledge updating and provision of more efficient and accurate information service by an end-to-end evaluation method. In the application, the RAG model is used for processing each table-level metadata in the full-volume data instance to acquire corresponding description information, wherein the description information is a key input in the subsequent data fusion and metadata automatic generation processes.

Alternatively, in the embodiment of the present application, the above description information refers to a schema attribute value included in table-level metadata, which provides a service description of a table, that is, a service meaning and use of the table. For example, in a table of e-commerce platform user order data, the schema may be described as "e-commerce platform user order data", while in a table of travel platform user order information, the schema may be described as "travel platform user order information". These descriptors are key to understanding and distinguishing the different data tables, and they help the RAG model identify and understand the content and context of the data tables so that similar or related data tables are correctly grouped and merged during the data fusion process.

Optionally, in the embodiment of the present application, the grouping refers to logically classifying the data tables with similar or identical service semantics according to the description information of the table-level metadata in the full-volume data instance. For example, all of the data tables associated with the order may be grouped together, while the data tables associated with the user information may be grouped together. The grouping is based on the description information of the service logic and the data table, so that the data tables with the same or similar service attributes can be combined into a new aggregation instance in the subsequent data fusion process, thereby realizing the organic concentration and sharing of the data.

Optionally, in an embodiment of the present application, the foregoing sub-data structure refers to a basic element that forms a data table, including table-level metadata, field-level metadata, element-level metadata, and dictionary-level metadata. These sub-data structures together define the structure and contents of a data table. For example, table-level metadata defines the basic information of a data table, field-level metadata defines the attributes of each field in the table, element-level metadata describes the data contents in the table, and dictionary-level metadata defines the dictionary entries used in the table. In the present application, for each set of data instances, the attribute parameters according to these sub-data structures are combined to generate the target aggregate instance.

Optionally, in an embodiment of the present application, the target aggregate instance refers to a new data instance obtained through the grouping and merging operation, where the new data instance includes merged table-level metadata, field-level metadata, element-level metadata, and dictionary-level metadata. The aggregate instance is the result of data fusion, which logically or physically aggregates data of multiple sources, formats, and characteristics to form a unified data model. For example, if there are two different e-commerce platform order data tables, they may be combined by the method of the present application into one target aggregate instance that contains the order data for both platforms while maintaining the integrity, accuracy and consistency of the data. Such an aggregated instance provides an enterprise with a comprehensive view of data sharing, supporting more efficient data analysis and decision making.

It should be noted that, by retrieving the enhancement model to obtain the description information corresponding to each table metadata in the full-scale data instance, the process may involve a plurality of different data sources and data types. For example, the data sources may include relational databases, non-relational databases, data warehouses or data lakes, etc., while the data types may encompass structured data, semi-structured data such as JSON, and unstructured data such as text documents. The application is not limited in this regard.

Further, the full volume of data instances are grouped according to descriptive information, and in this step, classification of the descriptive information may be based on different business logic, data attributes, or data sources. For example, the data may be grouped by industry domain (e.g., financial, medical, educational), by sensitivity of the data (e.g., public data, internal data, secret data), or by update frequency of the data (e.g., real-time data, daily update data, historical archive data). The application is not limited in this regard.

Finally, for each set of data instances, merging is performed according to the attribute parameters of the corresponding sub-data structure, where the attribute parameters of the sub-data structure may include, but are not limited to, data format, version, quality, timeliness, and the like. For example, data instances having the same data format (e.g., CSV, XML) may be merged, or data instances from the same version control system may be merged, or data instances having the same data quality criteria (e.g., accuracy, integrity) may be merged. In addition, the merging can be performed according to the timeliness of the data (such as real-time data and historical data). The application is not limited in this regard. In this way, the data examples can be effectively aggregated and fused according to different requirements and scenes flexibly to generate the target aggregation examples meeting specific requirements.

According to the embodiment of the application, the RAG (RETRIEVAL-augmented Generation) model is adopted to enhance the data retrieval and generation capability, the description information corresponding to each table-level metadata in the full-scale data instance is obtained through the retrieval enhancement model, the step utilizes the retrieval capability of the RAG model to identify and extract key metadata description, and accurate input is provided for data fusion. And then, grouping the full data instances according to the description information to obtain at least two groups of data instances corresponding to at least two sub-data structures, wherein the process is based on business semantics in the description information, and classifying the data tables with similar business attributes into the same group, so that the logic centralization and preprocessing of the data are realized. Finally, combining the attribute parameters of the corresponding sub-data structures, such as the attributes of the field-level metadata and the dictionary-level metadata, according to each group of data instances to obtain a target aggregation instance, wherein the step realizes automatic generation of metadata and physical concentration of data through an automatic cluster combination rule. Through the series of processing, the purposes of reducing manual intervention and improving the data processing efficiency and accuracy are achieved, so that the technical effects of improving the data fusion quality and the automation level are achieved.

The method comprises the steps of combining attribute parameters of corresponding sub-data structures for each group of data instances to obtain target aggregate instances, executing the following operation on each group of data instances under the condition that the sub-data structures correspond to table-level metadata, wherein each group of data instances are regarded as a first group of data instances, adding and filling the table-level data source attribute parameters of each data instance in the first group of data instances to obtain target table-level data source attribute parameters, setting the table-level identification attribute parameters of each data instance in the first group of data instances to fill according to a first language service text to obtain target table-level identification attribute parameters corresponding to the target aggregate instances, setting the table-level description attribute parameters of each data instance in the first group of data instances to fill according to a second language service text to obtain target table-level description attribute parameters, wherein languages used by the first language service text and the second language service text are different, and generating the target aggregate instances based on the target table-level data source attribute parameters, the target table-level identification attribute parameters and the target table-level description attribute parameters.

Optionally, in the embodiment of the present application, the target table-level data source attribute parameter refers to an attribute parameter obtained by merging source attributes of table-level metadata in each set of data instances in the data fusion process. This parameter typically contains all the data source information of the merged data instance. For example, if order tables from different databases are combined, the target table-level data source attribute parameters may include the names or identifications of all of these databases, separated by commas, such as "database A, database B, database C".

Optionally, in the embodiment of the present application, the target table-level identifier attribute parameter refers to an attribute parameter obtained by merging identifier attributes of table-level metadata in each group of data instances in a data fusion process. This parameter is typically used to provide the service identification of one or more of the merged data tables. For example, if multiple order tables are combined, the target table level identification attribute parameter may be set to "order," which is an identification filled in the first language (e.g., english) business text that succinctly describes the main business content of the combined data table.

Optionally, in the embodiment of the present application, the target table-level description attribute parameter refers to an attribute parameter obtained by merging description attributes of table-level metadata in each set of data instances in a data fusion process. This parameter provides a detailed service description of the merged data table. For example, if multiple order tables are combined, the target table level description attribute parameter may be set to "include order information from different platforms," which is a description filled in with business text in a second language (e.g., chinese), which details the data content and business context included in the combined data table.

It should be noted that, in the case where the sub data structure corresponds to the table metadata, the operations performed on each set of data instances may be diversified according to different service requirements and data characteristics, which is not limited by the present application. For example, for additive population of table level data source attribute parameters, dimensions such as the name, version number, and data format of the data source may be included, such as merging data source names from different versions of the database to form "database V1, database V2". The operation can adapt to the merging requirements of data sources of different versions, and can also be adjusted according to different data format requirements.

Further, for the setting of the table-level identification attribute parameter and the table-level description attribute parameter, diversified processing can be performed according to different service scenes and language requirements. For example, the first language business text may be English for identifying attribute parameters, such as unifying the identification of multiple Order forms as "Order", while the second language business text may be Chinese for describing attribute parameters, such as setting the description of the Order form as "Order information containing e-commerce platform and travel platform". The setting is not only suitable for data fusion of different language environments, but also can be customized according to different service fields and data contents. The target aggregation instance generated based on the target table level data source attribute parameter, the target table level identification attribute parameter and the target table level description attribute parameter can be generated in a diversified mode according to different data fusion rules and business logic. For example, aggregation can be performed according to the dimensions of the business importance of the data, the update frequency of the data, the access rights of the data, and the like, so as to generate an aggregation instance meeting the specific business requirements. The aggregation instance can be used for different data analysis, report generation or decision support systems, so that the flexibility and the practicability of data fusion are improved.

Illustratively, consider a data fusion scenario for an e-commerce platform that requires integration of sales data from different regional subsidiaries in order to conduct global sales analysis. The following is a specific implementation flow:

s1, designing a unified meta model, namely firstly, defining a meta model, wherein the meta model comprises table-level meta data, field-level meta data, element-level meta data and dictionary-level meta data information. This metamodel will serve as the basis for data fusion.

And S2, extracting the total basic information of the data source, namely extracting the total basic information from the databases of all regional subsidiaries by using a data source exploration technology such as DataX or FlinkX, wherein the total basic information comprises fine-grained basic information of libraries, tables, fields, records, dictionaries and the like.

And S3, constructing a unified full data source instance, namely storing the extracted full basic information of the data source according to the definition and rules of the meta-model to form the unified data source instance.

And S4, defining a meta-model instance cluster merging rule, namely defining a rule, and organically concentrating data with different sources, formats and characteristic properties logically or physically. For example, all sales data tables are grouped according to business semantics.

S5, storing the defined cluster merging rules into a RAG knowledge base, so that the RAG model can process the data according to the rules.

And S6, carrying out cluster merging processing through the RAG model, namely carrying out cluster merging processing on the packaged full-quantity meta-model instance through the RAG model. And the RAG model performs clustering merging operation on the data according to rules in the knowledge base, and generates a new aggregation instance.

S7, performing table-level data source attribute parameter addition filling on the first group of data instances, wherein if the data are from 'North America' and 'European', for example, the target table-level data source attribute parameters may be filled as 'North America, european'.

S8, setting the table-level identification attribute parameters, namely setting the table-level identification attribute parameters of each data instance in the first group of data instances to be filled according to the business text of the first language (such as English), for example, 'SALES DATA'.

S9, setting the table level description attribute parameters, wherein the table level description attribute parameters of each data instance in the first group of data instances are set to be filled according to a second language (such as Chinese) business text, for example, 'including sales data' of North America and European subsidiaries.

S10, generating a target aggregation instance, namely generating a target aggregation instance containing sales data of all regional subsidiaries based on the target table level data source attribute parameter, the target table level identification attribute parameter and the target table level description attribute parameter.

Through the flow, the aim of effectively integrating sales data distributed in different regional subsidiaries is fulfilled. The process not only improves the efficiency of data processing and reduces errors of manual operation, but also enhances the consistency and accuracy of data. Through automatic metadata generation and data fusion, enterprises can quickly obtain sales analysis at a global view angle, and more accurate business decisions are supported. In addition, due to the adoption of multi-language business text filling, the target aggregation instance can better adapt to language habits of different areas, and the readability and usability of data are improved. Finally, the method improves the automation and intelligence level of data fusion, and provides an overall, accurate and consistent data view for enterprises.

The method comprises the steps of combining attribute parameters of corresponding sub-data structures for each group of data instances to obtain target aggregate instances, executing the following operation on each group of data instances under the condition that the sub-data structures correspond to field-level metadata, wherein each group of data instances is regarded as a second group of data instances, performing semantic deduplication on field-level description attribute parameters and field-level data item attribute parameters of each data instance in the second group of data instances, setting the field-level description attribute parameters after deduplication to fill according to second language service text to obtain target field-level description attribute parameters, setting field-level identification attribute parameters of each data instance in the second group of data instances to fill according to first language service text to obtain target field-level identification attribute parameters, wherein languages used by the first language service text and the second language service text are different, filling field-level data type attribute parameters of each data instance in the second group of data instances according to original values to obtain target field-level data type attribute parameters, setting the field-level description attribute parameters after deduplication to fill according to second language service text to obtain target field-level description attribute parameters, and generating the target field-level attribute parameters according to maximum value dimension maximum value, and target field-level attribute parameters, and target field-level attribute parameters.

Alternatively, in the embodiment of the present application, the above-mentioned field-level metadata refers to a data structure constituting detailed description information of each field in the data table. The field level metadata includes attributes of name, type, size, precision, etc. of the field, which define the data characteristics and storage requirements of the field. For example, one field level metadata may contain a field name "customer_id", a field type "integer", a size "10", an accuracy "0", and a field description "unique identifier of the customer". Field level metadata is an important component of a data table structure that provides the necessary information for the storage, retrieval and management of data.

Optionally, in the embodiment of the present application, semantic deduplication refers to a process of identifying and merging field descriptions with the same or similar business meaning in the data fusion process. This process ensures consistency and accuracy of the field description after data fusion. For example, when merging order information from different databases, it may be that both fields describe the customer name, but the field names may be different, one may be "customer_name" and the other may be "name". Semantic deduplication recognizes these two fields as fields of the same meaning, and only one is reserved in the target aggregate instance.

Optionally, in the embodiment of the present application, the first language service text and the second language service text refer to different language texts used for filling the field level identification attribute parameter and the field level description attribute parameter in the data fusion process. These texts reflect the business meaning of the fields and use different languages. For example, a first language business text may be English for identifying attribute parameters, such as the field "order_date" as "Order Date", and a second language business text may be Chinese for describing attribute parameters, such as the same field as "Order Date".

Optionally, in an embodiment of the present application, the field-level data type attribute parameter refers to metadata defining a field data type. This parameter specifies the type of data that the field may store, such as integers, floating point numbers, strings, etc. For example, one field level data type attribute parameter may be a "character type" corresponding to a field in which text data may be stored.

Optionally, in an embodiment of the present application, the field-level size attribute parameter refers to metadata defining a maximum storage size of a field. This parameter specifies the maximum number of characters or bytes that the field can store. For example, a field level size attribute parameter may be "255", indicating that the maximum character length that the field may store is 255 characters.

It should be noted that, in the case that the sub-data structure corresponds to the field metadata, the operations on the second set of data instances may be diversified according to different data features and service requirements, which is not limited by the present application. For example, in performing semantic deduplication of field level description attribute parameters and field level data item attribute parameters, consideration may be given from three dimensions of data content, data format, and data source. The semantic similarity of field descriptions may need to be compared across data content, the field naming convention may need to be unified across data formats for different data sources, and the same fields from different systems or departments may need to be identified and merged across data sources.

When setting the field level identification attribute parameters, the method can conduct diversified processing from three dimensions of language, region and service field. In the language, the first language business text can be English, the second language business text can be Chinese to adapt to the reading habit of users in different areas, the areas can be adjusted according to the language preference of the places of the subsidiaries, and in the business field, the business text of the field identification can be customized according to different business scenes, such as finance, medical treatment or education.

For field level data type attribute parameter population, three dimensions of data storage requirements, query efficiency, and data consistency may be considered. In terms of storage requirements, proper data types are required to be selected according to actual types of data, query performance of a database can be optimized by selecting the proper data types in terms of query efficiency, and data consistency is maintained, so that errors in the data fusion process can be reduced.

In the filling of field level size attribute parameters, diversification processing can be performed from the three dimensions of maximum length, performance optimization and data integrity of the field. The maximum value is selected to ensure that all data can be stored, the reasonable size can improve the efficiency of database operation in terms of performance optimization, and the field size can accommodate all possible data values in terms of data integrity to maintain the integrity of the data.

The target aggregate instance generated based on the target field level description attribute parameter, the target field level identification attribute parameter, the target field level data type attribute parameter, and the target field level size attribute parameter may be considered from three dimensions of availability, consistency, and business adaptability of the data. The method has the advantages that the usability is improved, the aggregation instance provides a unified data view, the user can conveniently access and analyze the data, the accuracy and the reliability of the data are guaranteed by the aggregation instance in consistency, and the aggregation instance can be customized and optimized according to different service requirements in service adaptability. Through the diversified operations, target aggregation examples meeting the requirements of different business scenes can be generated, and the flexibility and the effectiveness of data processing are improved.

According to the embodiment of the application, a target aggregation instance integrating a plurality of data source information can be generated based on the target field level description attribute parameter, the target field level identification attribute parameter, the target field level data type attribute parameter and the target field level size attribute parameter. This example provides a unified view of data for data analysis and reporting while taking into account the needs of users in different languages, improving the usability and accessibility of the data.

As an alternative scheme, combining the attribute parameters of the corresponding sub-data structures for each group of data instances to obtain a target aggregate instance, wherein the method comprises the steps of executing the following operation on each group of data instances under the condition that the sub-data structures correspond to dictionary-level metadata, regarding one group of data instances each time as a third group of data instances, performing semantic deduplication on dictionary-level value parameters of each data instance in the third group of data instances to obtain target dictionary-level value attribute parameters, filling dictionary-level description attribute parameters and dictionary-level data type attribute parameters of each data instance in the third group of data instances according to original values to obtain target dictionary-level description attribute parameters and target dictionary-level data type attribute parameters, and generating the target aggregate instance based on the target dictionary-level value attribute parameters, the target dictionary-level description attribute parameters and the target dictionary-level data type attribute parameters.

Optionally, in an embodiment of the present application, the dictionary-level metadata refers to metadata related to dictionary type fields in the data table. Dictionary level metadata contains a collection of dictionary entries, each corresponding to a particular value and description. For example, in an order management system, the order status may be a dictionary field, and the dictionary-level metadata may include states of "to pay", "paid", "shipped", "completed", and the like, each corresponding to an integer value. Dictionary-level metadata is an important component of a data model that provides standardized classification and coding of data, facilitating consistency and comparative analysis of the data.

Alternatively, in embodiments of the present application, a dictionary level value parameter refers to a value of a dictionary item, which is typically unique, for identifying a particular dictionary item in a database. For example, in the dictionary field of the order status described above, each status such as "to pay" may correspond to an integer value such as 1, which is a dictionary-level value parameter. In the data fusion process, semantic deduplication of dictionary-level value parameters means that different values with the same business meaning are identified and combined, and uniqueness and consistency of dictionary items after data fusion are ensured.

Optionally, in an embodiment of the present application, the dictionary-level description attribute parameter refers to a text description of a dictionary item, and provides a detailed explanation and business meaning of the dictionary item. For example, for dictionary item "1" of the order status, its dictionary level description attribute parameter may be "to pay", this description helping the user understand the specific meaning of each dictionary value. Maintaining the original values of dictionary level description attribute parameters during the data fusion process means that specific descriptions of the dictionary entry in each data source are preserved to ensure that business meaning of the data is not lost.

Optionally, in an embodiment of the present application, the dictionary-level data type attribute parameter refers to metadata defining a data type of the dictionary entry value. This parameter specifies the type of storage of dictionary entry values, such as integers, strings, etc. For example, if the dictionary level value parameters of the order status are stored in integer form, then their dictionary level data type attribute parameters are "integer". Filling the dictionary data type attribute parameters according to the original values in the data fusion process means that the data type of each dictionary item value is kept unchanged in the generated target aggregation instance so as to ensure the compatibility and consistency of the data.

It should be noted that, in the case that the sub-data structure corresponds to the dictionary-level metadata, the operations on the third set of data instances may be diversified according to different business rules, data specifications, and system requirements, which is not limited by the present application. For example, business implications, data formats, and system compatibility may be considered in performing semantic deduplication of dictionary-level value parameters. In terms of business meaning, the same or similar business concepts in different data instances need to be identified, for example, different codes representing 'completed' states in different systems are unified into a standard value, codes in different formats need to be converted into unified formats in data formats so as to facilitate data comparison and processing, and in terms of system compatibility, dictionary-level value parameters after duplication removal need to be ensured to be correctly identified and processed in each system. When the dictionary-level description attribute parameters are filled, diversified processing can be performed from three dimensions of the detailed degree of description, language difference and cultural difference. In detail, the detail level of the description needs to be determined according to the service requirement, for example, more detailed dictionary item descriptions may need to be provided in some cases, the dictionary item descriptions need to be localized according to language habits of different areas in terms of language difference, and the same dictionary item may have different understanding and expression in terms of cultural difference in consideration of different cultural backgrounds. For the population of dictionary-level data type attribute parameters, consideration may be given to three dimensions of data storage efficiency, processing performance, and data security. The storage efficiency is required to select proper data types to optimize the use of storage space, the processing performance is required to consider the processing efficiency of the data types, such as the processing speed of integer types is generally faster than that of character strings, and the data security is required to ensure that the selection of the data types does not introduce security risks, such as avoiding the storage of sensitive information in a plaintext form.

According to the embodiment of the application, based on the target dictionary level value attribute parameter, the target dictionary level description attribute parameter and the target dictionary level data type attribute parameter, a unified target aggregation instance containing the duplicate dictionary items, the original service description and the data types can be generated. This example provides a standardized dictionary reference for data analysis and reporting while ensuring accuracy and usability of the data model after data fusion.

The application is further illustrated by the following examples:

In the process of implementing data sharing, as data provided by different users may come from different approaches, the data content, the data format and the data quality of the data are quite different, and sometimes even the troublesome problems of incapability of converting the data format or loss of information after converting the data format are encountered, which seriously hinders the flow and sharing of the data in various departments and various software systems.

The data warehouse is a structured data environment for decision support systems and online analysis application data sources. Data warehouse research and solves the problem of retrieving information from databases. The data warehouse is integrated, the data of the data warehouse is of a scattered operation type data, required data is extracted from original data, and the data warehouse can be accessed after processing, integration, unification and synthesis are carried out;

Data fusion is to organically concentrate data with different sources, formats and characteristics logically or physically, so as to provide comprehensive data sharing for enterprises. The design of a consistent data model ensures the integrity, accuracy and consistency of data, which is the key of data fusion. And metadata is the basis and tie for building data models and sharing data.

RAG is a model that combines search and generation techniques. The method generates answers or contents by referring to the information of the knowledge base, has strong interpretability and customization capability, and is suitable for a plurality of natural language processing tasks such as a question-answering system, document generation, intelligent assistant and the like. The RAG model has the advantages of strong universality, realization of instant knowledge updating and provision of more efficient and accurate information service by an end-to-end evaluation method.

The application provides a metadata self-generating data fusion method based on RAG, which comprises the steps of firstly extracting the data with different sources, formats and characteristic properties to be accessed into basic information of a full-quantity data source through a DSE (Data Source Exploration ) process, packaging the basic information with uniformly designed metadata (comprising table-level metadata, field-level metadata, element-level metadata and dictionary-level metadata), completing basic preparation work of the metadata, secondly defining a metadata clustering and merging rule, storing the metadata clustering and merging rule into a RAG knowledge base, completing metadata rule definition work, and finally clustering and merging the packaged full-quantity metadata instance through the RAG model, outputting a result and completing automatic metadata generation work.

The application provides a metadata self-generation data fusion method based on RAG. The method integrates a plurality of technologies such as DSE, RAG and the like, converts the traditional process of manually generating metadata into an automatic generation mode through an artificial intelligence means, and the implementation principle of the method is described as follows:

1. The method comprises the steps of designing a unified meta model, extracting full basic information of a data source, constructing unified full data source examples based on the meta model, defining a meta model example clustering merging rule, storing the meta model example clustering merging rule into an RAG knowledge base, carrying out clustering merging on the full meta model examples through the RAG model, generating an aggregation example, and automatically generating metadata based on the aggregation example.

Fig. 3 is a flowchart of a metadata generation method according to an embodiment of the present application, and as shown in fig. 3, a specific implementation procedure of the present application is described in detail as follows:

s302, designing a unified meta-model. The meta model is defined and stored by using a JSON format and comprises table-level metadata, field-level metadata, element-level metadata and dictionary-level metadata information, and the meta model is specifically shown as follows:

And S304, extracting the full basic information of the data source. The data source exploration technology DataX or FlinkX is utilized to acquire the whole basic information of the data source, including fine-grained basic information such as database names, table descriptions, field names, field lengths, field precision, field descriptions, data item contents, data item types, data item descriptions, dictionary contents, dictionary types, dictionary descriptions and the like.

S306, constructing a unified full-volume data source instance based on the meta-model. And storing the extracted data source total basic information according to the definition and rules of the metamodel. Examples are as follows:

and S308, defining a meta-model instance cluster merging rule, and storing the meta-model instance cluster merging rule into a RAG knowledge base. The purpose of defining the metamodel cluster merging rule is to organically integrate data with different sources, formats and characteristic properties logically or physically, in other words, merge multiple tables with the same service attribute into one table, and the metadata of the table is needed to be created. Rule definition can be understood as a procedural prompt word engineering, and the rule is stored in the knowledge base of the RAG after definition is completed, and specific examples are as follows:

(1) And carrying out service semantic grouping according to the schema value of the table-level metadata table_ infos of the full-volume meta-model instance.

(2) Generating new meta-model table metadata table_ infos according to the group, wherein the source value is added and filled with all instance source values in the group and separated by '', the name value is filled with service English words, and the schema value is filled with service Chinese words.

(3) Generating new meta-model field-level metadata tables_fields according to the groups respectively, wherein semantic deduplication is carried out according to the schema value of each field in the meta-model instance tables_fields and the data value of each element in the corresponding table_ elemenst, the name value is filled with service English words, the schema value is filled with service Chinese words, the type is filled with original values, and the size value is filled with maximum values.

(4) New meta model dictionary level metadata tables_ dicts are generated separately in groups, where semantic deduplication is performed according to the values of each dictionary in the group of meta model instance tables_ dicts, and types and schemas are filled with original values.

S310, carrying out cluster merging on the full-scale meta-model examples through the RAG model, and generating a new aggregation example. RAG is a model that combines search and generation techniques. The method generates answers or contents by referring to information of the knowledge base, and has strong interpretability and customization capability. And loading the full-scale meta-model instance into an RAG model, and carrying out clustering merging operation on the RAG model according to rules designed in a knowledge base and generating a new aggregation instance. Examples of meta-models 01 and 02 after aggregation in the third step are as follows:

and S312, automatically generating metadata based on the aggregation instance. The method comprises the following specific steps:

(1) Automatically creating a table according to the SQL statement assembled by the content of the table_ infos;

(2) Automatically creating fields according to the content assembly SQL statement of the table_fields;

(3) The dictionary is automatically created from the content-assembled SQL statement of table_ dicts.

The application provides a metadata self-generation data fusion method and a metadata-based data fusion design idea, which realize automatic generation of metadata, can remarkably improve the efficiency and quality of data fusion and effectively solve various defects existing in a manual mode.

From the description of the above embodiments, it will be clear to a person skilled in the art that the method according to the above embodiments may be implemented by means of software plus the necessary general hardware platform, but of course also by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) comprising several instructions for causing a terminal device (which may be a mobile phone, a computer, a server, or a network device, etc.) to perform the above-mentioned methods of the various embodiments of the present application.

The embodiment also provides a metadata generating device, which is used for implementing the foregoing embodiments and preferred embodiments, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

Fig. 4 is a block diagram of a metadata generation apparatus according to an embodiment of the present application, as shown in fig. 4, the apparatus including:

An obtaining module 402, configured to obtain a pre-created meta-model and full basic information extracted from at least one data source, where the meta-model includes at least two sub-data structures, and the full basic information includes at least two granularities, where each sub-data structure has a granularity corresponding to the granularity;

A packaging module 404, configured to use the meta-model to package the full-scale basic information to obtain a full-scale data instance, where the full-scale data instance represents the full-scale basic information stored according to a rule indicated by the meta-model;

the clustering module 406 is configured to cluster-merge the full-volume data instances by using a search enhancement generation model based on a preset cluster-merge mode, and generate target metadata, where the search enhancement generation model is configured to cluster the full-volume data instances according to the preset cluster-merge mode, and the preset cluster-merge mode includes grouping according to description information of the full-volume data instances.

As an alternative scheme, the device is used for carrying out cluster combination on the full-volume data instance through a retrieval enhancement generation model based on a preset cluster combination mode to generate target metadata, wherein the target aggregation instance is determined by carrying out cluster combination on the full-volume data instance through the retrieval enhancement generation model based on the preset cluster combination mode, the target aggregation instance comprises an instance corresponding to the table-level metadata, the at least two sub-data structures comprise the table-level metadata, and the target metadata is generated based on the target aggregation instance, wherein the target metadata is set to be assembled according to a target machine language.

The device is used for carrying out cluster combination on the full-volume data instance through the retrieval enhancement generation model based on a preset cluster combination mode to determine a target aggregation instance, wherein the preset cluster combination mode is set, the preset cluster combination mode is stored in a knowledge base corresponding to the retrieval enhancement generation model, the knowledge base is used for referring to the retrieval enhancement generation model, the full-volume data instance is automatically subjected to cluster combination, and the target aggregation instance is determined.

The device is used for carrying out cluster combination on the full-volume data instance through the retrieval enhancement generation model based on a preset cluster combination mode to determine a target aggregation instance, wherein the description information corresponding to each table-level metadata in the full-volume data instance is obtained through the retrieval enhancement model, the full-volume data instance is grouped according to the description information to obtain at least two groups of data instances respectively corresponding to the at least two sub-data structures, and the data instances of each group are combined according to attribute parameters of the corresponding sub-data structure to obtain the target aggregation instance.

The device is used for combining attribute parameters of corresponding sub-data structures for each group of data instances to obtain the target aggregation instance, and executing the following operation on each group of data instances under the condition that the sub-data structures correspond to the table-level metadata, wherein each group of data instances executing the following operation is regarded as a first group of data instances, the table-level data source attribute parameters of each data instance in the first group of data instances are added and filled to obtain target table-level data source attribute parameters, the table-level identification attribute parameters of each data instance in the first group of data instances are set to be filled according to a first language service text to obtain target table-level identification attribute parameters corresponding to the target aggregation instance, the table-level description attribute parameters of each data instance in the first group of data instances are set to be filled according to a second language service text to obtain target table-level description attribute parameters, and the target table-level data source attribute parameters are generated based on the target-level data source attribute parameters and the target aggregation instance attribute parameters.

The device is used for combining the data instances of each group according to attribute parameters of a corresponding sub-data structure to obtain the target aggregate instance, and executing the following operation on each group of the data instances under the condition that the sub-data structure corresponds to field level metadata, wherein one group of the data instances executing the following operation at a time is regarded as a second group of the data instances, semantic deduplication is carried out on a field level description attribute parameter and a field level data item attribute parameter of each data instance in the second group of the data instances, the field level description attribute parameter after deduplication is set to be filled according to a second language service text to obtain a target field level description attribute parameter, the field level identification attribute parameter of each data instance in the second group of the data instances is set to be filled according to a first language service text to obtain a target field level identification attribute parameter, the field level description attribute parameter of each data instance in the second group is set to be filled according to a first language service text, the field level description attribute parameter of each data instance in the second group is set to obtain a target field level identification attribute parameter, the aggregate attribute parameter of each data instance in the second language service text is different from the language used for the second language service text, the aggregate instance in the aggregate instance is obtained according to the field level description attribute parameter of the first language service text, and the aggregate attribute parameter of each data instance in the second group is obtained to obtain the target field level attribute parameter is filled according to the target field level attribute parameter, and the target field level attribute parameter is obtained according to the target field level attribute parameter is obtained according to the target field attribute parameter.

As an alternative, the device is used for merging attribute parameters of the corresponding sub-data structures for each group of data instances to obtain the target aggregate instance, and executing the following operation on each group of data instances under the condition that the sub-data structures correspond to dictionary-level metadata, wherein one group of data instances which execute the following operation each time are regarded as a third group of data instances, semantic deduplication is carried out on dictionary-level value parameters of each data instance in the third group of data instances to obtain target dictionary-level value attribute parameters, dictionary-level description attribute parameters and dictionary-level data type attribute parameters of each data instance in the third group of data instances are filled according to original values to obtain target dictionary-level description attribute parameters and target dictionary-level data type attribute parameters, and the target aggregate instance is generated based on the target dictionary-level value attribute parameters, the target dictionary-level description attribute parameters and the target dictionary-level data type attribute parameters.

It should be noted that each of the above modules may be implemented by software or hardware, and the latter may be implemented by, but not limited to, the above modules all being located in the same processor, or each of the above modules being located in different processors in any combination.

Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.

In an exemplary embodiment, the computer readable storage medium may include, but is not limited to, a U disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, etc. various media in which a computer program may be stored.

An embodiment of the application also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.

In an exemplary embodiment, the electronic device may further include a transmission device connected to the processor, and an input/output device connected to the processor.

Specific examples in this embodiment may refer to the examples described in the foregoing embodiments and the exemplary implementation, and this embodiment is not described herein.

It will be appreciated by those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps of them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.

The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present application should be included in the protection scope of the present application.

Claims

1. A method for generating metadata, characterized in that:

The method comprises: obtaining a pre-created meta-model and full basic information extracted from at least one data source, wherein the meta-model comprises at least two sub-data structures, the full basic information comprises at least two granularities, and each sub-data structure has a granularity corresponding thereto;

Encapsulating the full amount of basic information using the meta-model to obtain a full amount of data instance, wherein the full amount of data instance represents the full amount of basic information stored according to the rules indicated by the meta-model;

Based on a preset clustering and merging method, the full amount of data instances are clustered and merged through a retrieval enhanced generation model to generate target metadata, wherein the retrieval enhanced generation model is used to cluster the full amount of data instances according to the preset clustering and merging method, and the preset clustering and merging method includes grouping according to the description information of the full amount of data instances.

2. The method according to claim 1, characterized in that

The step of clustering and merging the full amount of data instances by searching and enhancing the generation model based on a preset clustering and merging method to generate target metadata includes:

Based on a preset clustering and merging method, clustering and merging the full amount of data instances through the retrieval enhancement generation model to determine a target aggregate instance, wherein the target aggregate instance includes an instance corresponding to the table-level metadata, and the at least two sub-data structures include the table-level metadata;

Generate target metadata based on the target aggregate instance, wherein the target metadata is configured to be assembled according to a target machine language.

3. The method according to claim 2, characterized in that

The clustering and merging of the full amount of data instances through the retrieval enhancement generation model based on a preset clustering and merging method to determine a target aggregated instance includes:

Setting the preset cluster merging mode;

Storing the preset clustering merging method in the knowledge base corresponding to the retrieval enhancement generation model;

The knowledge base is referenced by the retrieval enhancement generation model, the full amount of data instances are automatically clustered and merged, and the target aggregated instance is determined.

4. The method according to claim 1, characterized in that:

Acquire the description information corresponding to each table-level metadata in the full data instance through the retrieval enhancement model;

Grouping the full amount of data instances according to the description information to obtain at least two groups of data instances corresponding to the at least two sub-data structures respectively;

For each group of the data instances, the data instances are merged according to the attribute parameters of the corresponding sub-data structures to obtain the target aggregate instance.

5. The method according to claim 4, characterized in that

The step of merging each group of data instances according to the attribute parameters of the corresponding sub-data structures to obtain the target aggregate instance includes:

In the case where the sub-data structure corresponds to the table-level metadata, the following operations are performed on each group of the data instances, wherein a group of the data instances on which the following operations are performed each time is regarded as the first group of data instances:

Add and fill the table-level data source attribute parameters of each data instance in the first group of data instances to obtain target table-level data source attribute parameters;

Setting the table-level identification attribute parameter of each data instance in the first group of data instances to be filled in accordance with the first language business text, obtaining the target table-level identification attribute parameter corresponding to the target aggregate instance, and setting the table-level description attribute parameter of each data instance in the first group of data instances to be filled in accordance with the second language business text, obtaining the target table-level description attribute parameter, wherein the first language business text and the second language business text use different languages;

The target aggregate instance is generated based on the target table-level data source attribute parameter, the target table-level identification attribute parameter, and the target table-level description attribute parameter.

6. The method according to claim 4, characterized in that

In the case where the sub-data structure corresponds to the field-level metadata, the following operation is performed on each group of the data instances, wherein a group of the data instances for which the following operation is performed each time is regarded as a second group of data instances:

Semantically deduplicate the field-level description attribute parameters and the field-level data item attribute parameters of each data instance in the second group of data instances, and set the deduplicated field-level description attribute parameters to be filled in according to the second language business text to obtain target field-level description attribute parameters;

Setting the field-level identification attribute parameter of each data instance in the second group of data instances to be filled according to the first language business text to obtain a target field-level identification attribute parameter, wherein the first language business text and the second language business text use different languages;

Filling the field-level data type attribute parameter of each data instance in the second group of data instances with the original value to obtain the target field-level data type attribute parameter;

Filling the field-level size attribute parameter of each data instance in the second group of data instances according to the maximum value to obtain a target field-level size attribute parameter;

The target aggregate instance is generated based on the target field-level description attribute parameter, the target field-level identification attribute parameter, the target field-level data type attribute parameter, and the target field-level size attribute parameter.

7. The method according to claim 4, characterized in that

In the case where the sub-data structure corresponds to the dictionary-level metadata, the following operation is performed on each group of the data instances, wherein a group of the data instances for which the following operation is performed each time is regarded as the third group of data instances:

Performing semantic deduplication on the dictionary-level value parameter of each data instance in the third group of data instances to obtain a target dictionary-level value attribute parameter;

Fill the dictionary-level description attribute parameters and dictionary-level data type attribute parameters of each data instance in the third group of data instances according to the original values to obtain target dictionary-level description attribute parameters and target dictionary-level data type attribute parameters; generate the target aggregate instance based on the target dictionary-level value attribute parameters, the target dictionary-level description attribute parameters and the target dictionary-level data type attribute parameters.

8. A metadata generation device, characterized in that:

include:

An acquisition module, used to acquire a pre-created meta-model and full basic information extracted from at least one data source, wherein the meta-model includes at least two sub-data structures, and the full basic information includes at least two granularities, and each sub-data structure has a granularity corresponding thereto;

An encapsulation module is used to use the meta-model to encapsulate the full basic information to obtain a full data instance, wherein the full data instance represents the full basic information stored according to the rules indicated by the meta-model; a clustering module is used to cluster and merge the full data instance through a retrieval enhanced generation model based on a preset clustering merging method to generate target metadata, wherein the retrieval enhanced generation model is used to cluster the full data instance according to the preset clustering merging method, and the preset clustering merging method includes grouping according to the description information of the full data instance.

9. A computer-readable storage medium, characterized in that:

The computer-readable storage medium stores a computer program, wherein the computer program implements the steps of the method described in any one of claims 1 to 7 when executed by a processor.

10. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that:

When the processor executes the computer program, the steps of the method described in any one of claims 1 to 7 are implemented.