CN120676067A

CN120676067A - Multi-protocol extensible data acquisition system architecture and method with unified interface standard

Info

Publication number: CN120676067A
Application number: CN202511170946.7A
Authority: CN
Inventors: 陈斌; 聂子林; 夏冬冬
Original assignee: Shenzhen Hualei Xuntou Technology Co ltd
Current assignee: Shenzhen Hualei Xuntou Technology Co ltd
Priority date: 2025-08-21
Filing date: 2025-08-21
Publication date: 2025-09-19

Abstract

The invention relates to the technical field of data acquisition and discloses a multi-protocol extensible data acquisition system architecture and a method with unified interface standard, wherein the architecture comprises a data source management module, a protocol identifier and a data source management module, wherein the data source management module is used for storing data source configuration information and distributing the protocol identifier; the protocol adaptation module comprises an analysis unit and a matching unit, realizes data format conversion and protocol compatibility verification, generates target data through format conversion and semantic mapping, monitors data integrity based on an integrity algorithm and threshold comparison, monitors data quality and evaluates data quality through a consistency model and threshold comparison, and transmits early warning information to a target system through the result output module. The method comprises the steps of data source configuration, protocol adaptation, data conversion, integrity and quality monitoring, early warning and the like. The invention improves the expansibility and the data reliability of the system and is suitable for multi-scene data acquisition.

Description

Multi-protocol extensible data acquisition system architecture and method with unified interface standard

Technical Field

The invention relates to the technical field of data acquisition, in particular to a multi-protocol extensible data acquisition system architecture and a method with unified interface standard.

Background

Today, where information technology is rapidly evolving, the importance of data collection, processing and utilization is increasingly prominent as a key production element. However, the current data acquisition field is faced with a number of challenges that severely limit the efficient use of data and the expansion capabilities of the system.

From the viewpoint of diversity of data sources, with the wide application of technologies such as internet of things, cloud computing, big data and the like, the data sources show explosive growth and are complex and various in types. Different data sources may use completely different communication protocols, such as Modbus and OPCUA protocols, HTTP and RESTful protocols in the internet, zigBee and MQTT protocols in the sensor network, etc., which are common in the industry. The conventional data acquisition system is difficult to be compatible with multiple protocols, and an adaptation module is often required to be independently developed for each protocol, so that the system is high in development cost, long in period and extremely difficult to maintain. Meanwhile, the configuration information of the data sources is chaotic in management, and a unified storage and management mechanism is lacked, so that the system is difficult to quickly identify and call the configuration information of different data sources, and the efficiency and accuracy of data acquisition are seriously affected.

In terms of data format conversion, the data formats generated by different data sources are quite different, including JSON, XML, CSV and other structured data, and text, images, audio and video and other unstructured data. When the traditional data acquisition system is used for converting the data format, unified standards and specifications are lacking, and the conversion process is realized by manually writing codes, so that the efficiency is low and errors are easy to occur. In addition, understanding and extracting data semantics is also a difficult problem, and definition and interpretation of the same data by different data sources may be different, so that semantic ambiguity occurs in the process of sharing and interacting the data, and true meaning of the data cannot be accurately conveyed.

Data quality assurance is another key issue in the data acquisition process. During data collection, transmission and processing, problems such as data missing, data error, data inconsistency and the like may occur, and the problems seriously affect the usability and reliability of the data. The traditional data acquisition system lacks a perfect data quality monitoring and early warning mechanism, so that the data quality problem can not be found and processed in time, low-quality data can enter a subsequent data processing and analyzing link, and the decision accuracy is further affected.

The lack of scalability and compatibility of the system is also an important challenge for current data acquisition systems. As business progresses and demands change, data sources and data types may continue to increase and update, and conventional data acquisition systems have difficulty adapting quickly to these changes due to architectural design limitations. When a new data source is needed to be accessed or a new protocol is supported, the whole system is often required to be modified and upgraded in a large scale, so that the cost is high, the system is possibly stopped, and the continuity of the service is affected.

The traditional data acquisition system has obvious defects in the aspects of multi-protocol compatibility, data format conversion, data quality assurance, system expandability and the like, and a novel data acquisition system architecture and method capable of unifying interface standards, supporting multi-protocol expansion and guaranteeing data quality are urgently needed to meet the diversified demands of the modern data acquisition field.

Disclosure of Invention

The present invention is directed to a system architecture and method for multi-protocol scalable data acquisition with unified interface standard, so as to solve the problems set forth in the background art.

In order to achieve the above purpose, the invention provides a technical scheme that a multi-protocol extensible data acquisition system architecture of a unified interface standard comprises:

The data source management module stores different data source configuration information into a data source knowledge base according to the data source type, and distributes a unique protocol identifier for each data source;

the protocol adaptation module comprises a protocol analysis unit and a protocol matching unit, wherein the protocol analysis unit calls a corresponding protocol analysis library according to a protocol identifier after receiving external data, converts original data into a standard intermediate format, and the protocol matching unit verifies protocol compatibility and confirms data access after verification;

The data conversion module comprises a format conversion unit and a semantic mapping unit, wherein the format conversion unit converts intermediate format data into a target data structure by utilizing a preset rule, and the semantic mapping unit extracts data semantic tags according to a metadata model;

the data checking module is used for calculating a data integrity coefficient through a data integrity algorithm based on the data semantic tag of the data conversion module;

the abnormality detection module is used for judging the data integrity state based on the data integrity coefficient of the data verification module and triggering an early warning rule on an abnormality result;

The data quality monitoring module is used for calculating a data consistency coefficient through a data consistency model based on the data semantic tag of the data conversion module;

And the data quality early warning module is used for judging the data quality state based on the data consistency coefficient of the data quality monitoring module and triggering an early warning rule on the abnormal result.

Preferably, in the data source management module, different data source configuration information is stored in a data source knowledge base according to a data source type, and the configuration information includes a communication protocol type and a data acquisition frequency.

Preferably, in the protocol adaptation module, the protocol parsing unit invokes a corresponding protocol parsing library according to the protocol identifier, converts the original data into a standard intermediate format, performs protocol compatibility verification on the data source by the system, outputs a corresponding protocol configuration in the data source knowledge base, and verifies whether the data format accords with the target specification and supports dynamic expansion or not by the protocol matching unit, and confirms the data access after verification is correct.

Preferably, in the data conversion module, the format conversion unit converts the intermediate format data into the target data structure including a field mapping rule, a data type conversion rule and a time stamp alignment rule by using a preset rule, and the semantic mapping unit extracts the data semantic tag including the entity identifier, the attribute relationship map and the data dependency path according to the metadata model.

Preferably, in the data verification module, based on the data semantic tag transmitted by the data conversion module, the specific content of the data integrity coefficient calculated by the data integrity algorithm is as follows:

Step S01, generating a check reference value for a field set of the target data structure by adopting a hash digest algorithm, and calculating the number of missing fields based on the check reference value ;

Step S02, calculating the actual data field missing rate, wherein the calculation formula is as follows:

Wherein the method comprises the steps of The field loss rate is indicated as a function of the field loss rate,Indicating the number of missing fields,Representing the total field number;

Step S03, calculating a data integrity coefficient, wherein the calculation formula is as follows:

Wherein the method comprises the steps of Representing the data integrity coefficient.

Preferably, the anomaly detection module receives the data integrity coefficient transmitted by the data verification moduleAnd the data integrity coefficient is calculatedWith a preset integrity thresholdComparing, judging the data integrity state, if the data integrity coefficient isLess than a preset integrity thresholdTriggering data loss early warning, if the data integrity coefficient isGreater than or equal to a preset integrity thresholdAnd judging that the data is complete.

Preferably, in the data quality monitoring module, based on the semantic tag transmitted by the data conversion module, the specific content of the data consistency coefficient calculated by the data consistency model is as follows:

step S01, calculating the entity attribute conflict rate, wherein the calculation formula is as follows:

Wherein the method comprises the steps of The attribute conflict rate is indicated as being,Representing the number of conflicting attributes to be compared,Representing the total associated attribute number;

step S02, calculating a data consistency coefficient, wherein the calculation formula is as follows:

Wherein the method comprises the steps of Representing the data consistency coefficient.

Preferably, the data quality early warning module receives the data consistency coefficient transmitted by the data quality monitoring moduleAnd data consistency coefficientWith a preset consistency thresholdComparing, judging the data quality state, if the data consistency coefficient isLess than a preset consistency thresholdTriggering data collision early warning, and if the data consistency coefficient is the sameGreater than or equal to a predetermined uniformity thresholdAnd judging that the data are consistent.

Preferably, the architecture further comprises:

and the result output module is used for receiving the early warning information transmitted by the abnormality detection module and the data quality early warning module and transmitting the early warning information to the target application system through the API interface.

Preferably, the present invention further includes a method for collecting multiprotocol scalable data of a unified interface standard, which is applied to the architecture of the multiprotocol scalable data collection system of the unified interface standard, and the method includes the following steps:

step S1, storing different data source configuration information into a data source knowledge base according to the data source type, and distributing a unique protocol identifier for each data source;

S2, calling a corresponding protocol analysis library according to the protocol identifier, verifying protocol compatibility, and confirming data access after verification;

S3, converting the intermediate format data into a target data structure by using a preset rule, and extracting a data semantic tag according to the metadata model;

Step S4, calculating to obtain a data integrity coefficient through a data integrity algorithm based on the data semantic tag transmitted by the data conversion module;

S5, judging the data integrity state according to the data integrity coefficient, and triggering an early warning rule on an abnormal result;

S6, calculating a data consistency coefficient through a data consistency model based on the data semantic tags transmitted by the data conversion module;

And S7, judging the data quality state according to the data consistency coefficient, and triggering an early warning rule for an abnormal result.

Compared with the prior art, the invention has the beneficial effects that:

the data source management module stores different data source configuration information into the data source knowledge base according to the data source type, and distributes a unique protocol identifier for each data source, so that centralized management and unified identification of various data source configuration information are realized. The system can quickly identify and call configuration information of different data sources, and the efficiency and accuracy of data source management are greatly improved. When a new data source needs to be accessed, the data source can be quickly accessed only by adding configuration information into a data source knowledge base and distributing a protocol identifier, and the expansion cost and difficulty of a system are obviously reduced.

The protocol adaptation module can call a corresponding protocol analysis library according to the protocol identifier through the design of the protocol analysis unit and the protocol matching unit, convert the original data into a standard intermediate format, and verify whether the protocol compatibility and the data format meet the target specification and support dynamic expansion. The design effectively solves the problem of compatibility of multiple protocols of the traditional data acquisition system, so that the system can seamlessly access data sources of multiple different protocols, an adaptation module is not required to be independently developed for each protocol, and development cost and period are greatly reduced. Meanwhile, the protocol matching unit verifies the data format, so that the standardization and consistency of the access data are ensured, and a good foundation is laid for subsequent data processing.

The format conversion unit of the data conversion module converts the intermediate format data into a target data structure by utilizing a preset field mapping rule, a data type conversion rule and a time stamp alignment rule, so that efficient conversion among different data formats is realized. The semantic mapping unit extracts data semantic tags comprising entity identifiers, attribute relationship patterns and data dependency paths according to the metadata model, so that the problem of data semantic ambiguity is solved, the true meaning of the data can be accurately conveyed in the sharing and interaction processes, and the readability and usability of the data are improved.

The data integrity coefficient is calculated by the data verification module through a data integrity algorithm, so that the data missing condition can be accurately detected. The anomaly detection module judges the data integrity state based on the data integrity coefficient, and triggers an early warning rule for an anomaly result, so that real-time monitoring and early warning of the data integrity are realized, the problem of data missing is found in time, corresponding measures are taken, and the data integrity and reliability are ensured.

The data quality monitoring module calculates a data consistency coefficient through the data consistency model, and the data quality early warning module judges the data quality state and triggers the early warning rule based on the data consistency coefficient, so that the data consistency is effectively monitored and managed. The problem of data collision is found in time, low-quality data is prevented from entering a subsequent processing link, the quality and usability of the data are improved, and a reliable basis is provided for analysis and decision of the data.

The result output module transmits the early warning information transmitted by the abnormality detection module and the data quality early warning module to the target application system through the API interface, so that the early warning information is transmitted and processed in time, related personnel can respond to the data quality problem rapidly, and the intelligent and automatic level of the system is improved.

Drawings

FIG. 1 is a schematic diagram of the architecture of a unified interface standard multi-protocol extensible data acquisition system according to the present invention;

FIG. 2 is a schematic diagram of a data verification module;

fig. 3 is a design diagram of the abnormality detection module.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-3, the system comprises a data source management module, a protocol adaptation module, a data conversion module, a data verification module, an anomaly detection module, a data quality monitoring module and a data quality early warning module. The specific implementation mode is as follows:

The data source management module stores different data source configuration information into a data source knowledge base according to the data source type, and distributes a unique protocol identifier for each data source. The protocol adaptation module comprises a protocol analysis unit and a protocol matching unit, wherein the protocol analysis unit calls a corresponding protocol analysis library according to a protocol identifier after receiving external data, converts the original data into a standard intermediate format, and the protocol matching unit verifies protocol compatibility and confirms data access after verification.

The data conversion module comprises a format conversion unit and a semantic mapping unit, wherein the format conversion unit converts intermediate format data into a target data structure by using a preset rule, and the semantic mapping unit extracts data semantic tags according to the metadata model. The data check module calculates a data integrity coefficient through a data integrity algorithm based on the data semantic tag of the data conversion module. The abnormality detection module judges the data integrity state based on the data integrity coefficient of the data verification module, and triggers an early warning rule for an abnormality result. The data quality monitoring module calculates a data consistency coefficient through a data consistency model based on the data semantic tag of the data conversion module. The data quality early warning module judges the data quality state based on the data consistency coefficient of the data quality monitoring module, and triggers an early warning rule for an abnormal result.

The invention is described in further detail below in connection with specific examples:

Example 1:

In this embodiment, the core function of the data source management module is to implement standardized configuration management and unique identifier allocation for different types of data sources, ensure that the system can orderly identify and access various heterogeneous data sources, and provide a basic support for the subsequent data processing flow.

The data source management module is used for carrying out classified management on the data sources. The data sources are classified into different types such as sensor data sources, database data sources, file data sources, network interface data sources and the like according to the physical form, the data generation mode and the communication characteristics of the data sources. For example, sensor data sources may be further subdivided into industrial sensors, environmental monitoring sensors, etc. whose data is typically transmitted in real-time streams over a particular industrial protocol, database data sources including relational databases (e.g., mySQL, oracle) and non-relational databases (e.g., mongo db), data interactions over a database connection protocol, file data sources encompassing CSV, excel, JSON etc. formatted files, data acquired over a file system interface or network file transfer protocol, and network interface data sources including API interfaces that provide data services over HTTP, webSocket etc. protocols. Through clear type division, the module can formulate differentiated configuration strategies aiming at the characteristics of different types of data sources, and management efficiency is improved.

The module needs to store configuration information of different data sources to a data source knowledge base. The configuration information mainly comprises two main core parameters of a communication protocol type and data acquisition frequency, and can also comprise auxiliary information such as a data source name, a physical address, a belonging system, an update period, authority authentication information and the like in an extensible manner. The communication protocol type is a key basis for data interaction between the data source and the system, for example, industrial protocols such as ModbusRTU, modbusTCP, OPCUA and the like may be adopted as the sensor data source, connection protocols such as JDBC, ODBC and the like need to be configured as the database data source, transmission protocols such as FTP, SFTP, HTTP and the like need to be specified as the file data source, and interface protocols such as RESTful, graphQL and the like need to be defined as the network interface data source. The setting of the data acquisition frequency needs to be combined with the characteristics of a data source and service requirements, can be set to millisecond or second acquisition frequency for industrial sensors (such as temperature and pressure sensors on a production line) with extremely high real-time requirements so as to ensure that equipment running state data can be acquired in time, can be set to once-daily acquisition frequency for a batch updated service database (such as a transaction system database for data archiving in the early morning every day), and can set a timing acquisition task according to a file generation period (such as per hour and per day) for an unstructured log file data source.

When the configuration information is stored, the data source knowledge base adopts a structured storage mode, for example, a data table is constructed based on a relational database, the data source type is used as a classification index, each data source corresponds to a record, and the record contains fields of each configuration parameter. Through the structured storage, the module can quickly search and update the data source configuration information, and simultaneously support configuration management for a system administrator through a visual interface, such as adding new data sources, modifying the configuration of the existing data sources, deleting invalid data sources and the like. In addition, the knowledge base needs to be provided with a data backup and recovery mechanism, so that the safety and reliability of configuration information are ensured, and data loss caused by system faults or manual misoperation is avoided.

Assigning a unique protocol identifier to each data source is one of the key functions of the data source management module. The identifier is generated by adopting a specific coding rule, ensures uniqueness in the whole system range and can reflect the type and protocol characteristics of the data source. For example, the identifier may be composed of a data source type code, a protocol type code, a serial number, etc., where "S" represents a sensor data source, "DB" represents a database data source, "F" represents a file data source, "API" represents a network interface data source, the protocol type code may be abbreviated by a protocol name, such as "MOD" represents Modbus protocol, "OPC" represents OPCUA protocol, "JDBC" represents JDBC protocol, etc., and serial number is used to distinguish different data sources under the same type and protocol, and incremental digital encoding is used. For example, an identifier of a temperature sensor data source using ModbusTCP may be defined as "S-MOD-TCP-001" and an identifier of a JDBC protocol-based MySQL database data source may be defined as "DB-JDBC-MySQL-002".

The unique protocol identifier functions throughout the data acquisition process. In the protocol adaptation stage, the protocol analysis unit can rapidly locate the corresponding protocol analysis library through the identifier, and can automatically complete protocol matching without manual intervention, in the data routing and processing process, the system can identify the type and the protocol characteristics of the data source according to the identifier and call corresponding processing logic and resources, and in the system maintenance and expansion, the identifier provides convenient indexes for the management and tracking of the data source, for example, configuration information, data flow and processing state of the related data source can be rapidly located by searching the specific protocol identifier.

In addition, the data source management module also needs to have a data source state monitoring function, and track the connection state of the data source, the execution condition of the acquisition task and the like in real time. For the sensor data source, the module can periodically send heartbeat packet to detect whether the equipment is online, if the equipment does not receive response for a plurality of times, the equipment is marked as an offline state and triggers early warning, for the database data source, the indexes such as the use condition of a connection pool, the inquiry response time and the like can be monitored, abnormal connection or performance bottleneck can be found in time, for the file data source, whether files are generated on time and the file size is normal can be monitored, and data acquisition failure caused by file deletion or damage is avoided. Through state monitoring, the module can timely find out the problem of the data source layer, and guarantees the stability and reliability of the system.

In the aspect of dynamic adjustment of data acquisition frequency, the module supports automatic or manual adjustment of acquisition frequency according to service requirements or system load conditions. For example, during peak hours of industrial production, to monitor the operating state of the equipment more densely, the acquisition frequency of the sensor data sources can be temporarily increased from once every minute to once every second, and when the system detects that the connection load of a database data source is too high, the acquisition frequency can be automatically reduced so as to relieve the pressure of the database server. The dynamic adjustment mechanism is realized by being linked with a task scheduling module and a resource monitoring module of the system, so that the balance between the efficiency and the resource consumption of data acquisition work is ensured.

For the requirement of multi-protocol compatibility, the data source management module supports the same data source to configure a plurality of standby protocols so as to cope with the situations of main protocol faults or system upgrades and the like. For example, a certain sensor data source can configure ModbusTCP protocol and MQTT protocol simultaneously, when a main protocol (ModbusTCP) cannot communicate due to network port failure, a module can be automatically switched to a standby protocol (MQTT) to collect data, so as to ensure continuity of data transmission. The configuration and switching logic of the standby protocol is predefined in the data source knowledge base and is in seamless butt joint with the protocol analysis base of the protocol adaptation module.

Example 2:

in this embodiment, the protocol adaptation module is used as a key hub of the data acquisition system, and bears the tasks of protocol analysis and compatibility verification during external data access, and its core function is to convert the original data of different protocols into a standard intermediate format that can be processed by the system, and ensure the normalization and reliability of data access.

The operation of the protocol adaptation module starts with the receipt of external data by the protocol parsing unit. When the data source sends data to the system through a physical interface (such as a serial port, a network port) or a network protocol (such as TCP/IP, HTTP), the protocol analysis unit firstly obtains the original format of the data and a corresponding protocol identifier, which is pre-allocated by the data source management module, and uniquely identifies the communication protocol (such as Modbus, OPCUA, MQTT, etc.) adopted by the data source. Based on the identifier, the protocol parsing unit invokes the corresponding parsing rule from a protocol parsing library built in the system. The protocol analysis library is a predefined rule set, and standardized definition is carried out for grammar structures, data frame formats, coding modes and the like of different protocols, for example, for a Modbus protocol, the analysis rule needs to identify a slave address, a function code, a data field and a check code in a data frame, and for API data in a JSON format, the analysis rule needs to extract key value pair information according to JSONschema definition.

The core task of the protocol parsing unit is to convert the raw data into a standard intermediate format. The standard intermediate format is a system self-defined unified data structure, and shields the bottom layer difference of different protocols, for example, a key value pair or a JSON-like structure is adopted to store data fields, wherein the data fields comprise universal fields such as time stamps, equipment identifiers, data types, numerical values and the like. In the conversion process, the protocol analysis unit needs to decode, extract and reorganize the original data according to the protocol analysis rule. Taking Modbus protocol as an example, the original data frame may contain 16 byte stream, the analysis unit needs to judge the data type (such as coil state and register value) according to the function code, extract the corresponding initial address and data value, and map to the field of the device address and measured value in the intermediate format, and for the XML data returned by the HTTP interface, the analysis unit needs to extract the node content by the XML analyzer and convert the node content into the structured field in the intermediate format.

After the data format conversion is completed, the system enters a protocol compatibility verification link. Protocol compatibility verification is performed by the protocol matching unit, which is aimed at ensuring that the protocol type, version and configuration parameters of the data source are consistent with the scope supported by the system. In the verification process, the system firstly invokes the protocol configuration information corresponding to the data source from the data source knowledge base, wherein the protocol configuration information comprises a protocol type, a version number, transmission parameters (such as baud rate, data bits and stop bits), security authentication information (such as a user name, a password and a certificate path) and the like. Then, the protocol matching unit checks:

Protocol type compatibility-checking whether the protocol used by the data source is within a list of protocols supported by the system. The list of protocols supported by the system is preconfigured by the developer, e.g., industrial protocol (Modbus, OPCUA, CANopen), internet of things protocol (MQTT, coAP), database protocol (JDBC, ODBC), etc., and protocols not in the list will be denied access.

Protocol version compatibility, for a protocol supporting multiple versions (such as modbustcpv1.1 and v 1.2), verifying whether a protocol version used by a data source is compatible with a system parsing library, and avoiding parsing errors caused by protocol version differences.

And the transmission parameter matching is to verify whether the parameters such as the baud rate, the data bit, the stop bit, the checking mode and the like are consistent with the system configuration or not according to the serial port communication protocol (such as ModbusRTU), and to verify whether the parameters such as the IP address, the port number, the overtime time and the like are correct or not according to the network protocol.

And (3) safety authentication effectiveness, namely if the protocol needs identity authentication (such as SSL certificate verification of HTTPS and user authority verification of a database), the protocol matching unit needs to call a corresponding authentication interface to verify whether the identity credential of the data source is effective or not, and unauthorized equipment is prevented from accessing the system.

After the protocol compatibility verification is passed, the protocol matching unit further executes data format verification to ensure that the converted intermediate format data accords with a target specification preset by a system. The target specification includes integrity, format correctness, and logical consistency of the data fields:

Field integrity-checking if the intermediate format data contains a mandatory field required by the system, such as a timestamp, a data source identification, at least one valid data field, etc., the data missing the mandatory field will be considered invalid.

And verifying the format correctness of the data field, namely verifying whether the format of the data field accords with a preset rule, for example, the timestamp needs to be a character string in the ISO8601 format, the numeric field needs to be an integer or a floating point number, and the enumeration field needs to be matched with a preset enumeration value list.

Logical consistency, checking whether the logical relationship between the data fields is reasonable, for example, the value of the "start time" field cannot be later than the value of the "end time" field, and the device status field (such as "running" and "fault") needs to be matched with the value range of other related parameters (such as current and voltage).

In addition to static authentication, the protocol adaptation module also supports a dynamic extension mechanism to accommodate protocol types or data formats that may be newly added in the future. The dynamic expansion mechanism is realized by the following modes:

the pluggable design of the protocol analysis library adopts a modularized architecture, so that a developer is allowed to expand the protocol support range by adding new protocol analysis plug-ins (such as DLL files and Python scripts). The plug-in unit is required to follow the unified interface specification, realize the core functions of data analysis, field mapping and the like, and complete integration without modifying the bottom code of the system.

Dynamic loading of data format templates for custom protocols or non-standard data formats, the system supports defining data parsing templates through visual interfaces or configuration files. The template contains field extraction rules, data type conversion rules, protocol parameter configuration and other contents, for example, for a certain private protocol, the rules of 'extracting 4 bytes from the 3 rd bit of a byte stream as a device ID, taking the 7 th bit to the 10 th bit as a temperature value (hexadecimal/decimal operation is needed)' can be defined, and the template can be validated in real time after being stored.

And the compatibility of protocol version upgrading supports that when the existing protocol analysis library needs to be upgraded to support a new version protocol, the system allows analysis rules of multiple versions to be reserved simultaneously, and the used protocol version is designated through the data source configuration information, so that the situation that an old data source cannot be accessed due to version upgrading is avoided.

In the whole data access process, the protocol adaptation module needs to record detailed log information, including data receiving time, data source identification, protocol type, key steps in the analysis process, verification results and the like. The log information is used for system debugging, fault checking and performance analysis, for example, when a certain data source frequently has protocol analysis errors, the log positioning is that the protocol analysis rule is wrong, the data format sent by the data source is abnormal, or the data is incomplete due to packet loss in the network transmission process.

In addition, the protocol adaptation module needs to have an error handling mechanism, and corresponding countermeasures are adopted for different types of abnormal conditions:

And if the original data does not accord with the protocol grammar rule (such as Modbus frame check error and JSON format grammar error), the system discards the data frame and records an error log, and meanwhile, the system can be configured as a retry mechanism to re-request the data source to send data after a set time interval.

And if the data source protocol version is incompatible or the transmission parameters are wrong, the system refuses to access the data source, and sends early warning information to an administrator through a result output module to prompt to check the data source configuration or upgrade the protocol analysis library.

Data format verification fails in that for data with field missing or format error, the system may choose to discard the data, populate a default value, or return error information to the data source, e.g., for data with missing timestamp, may automatically populate the current system time and mark "refill" for identification by the subsequent data verification module.

Example 3:

in this embodiment, the data conversion module serves as a bridge for connecting the protocol adaptation module with a subsequent data processing link, and has the core functions of further converting the standard intermediate format data output by the protocol adaptation module into a data structure meeting the requirements of a target system, extracting semantic tags of the data, and providing structured and semantic input for links such as data verification, quality monitoring and the like.

The core task of the format conversion unit is to convert the intermediate format data into a target data structure by using preset rules. The preset rules are conversion logic predefined according to the data model and business requirements of the target system, and mainly comprise field mapping rules, data type conversion rules and time stamp alignment rules. The field mapping rule is used for establishing the corresponding relation between the intermediate format data field and the target data structure field, and solves the difference problem of field naming, meaning and hierarchical structure among different systems. For example, the "device_id" field in the intermediate format may need to be mapped to the "device unique identification" field in the target data structure, and if the target system adopts a hierarchical data structure (e.g., nested JSON objects), the field mapping rules may also need to define the nesting hierarchy of fields, for example, the "sensor temperature" field in the intermediate format is mapped to the "device status sensor data temperature value" in the target structure.

The data type conversion rule is used for processing the difference of data types among different systems, so that the data can be correctly analyzed and used in the target system. Common data type conversion scenarios include converting string type timestamps (e.g., "2023-10-0112: 00") in an intermediate format to timestamp values (e.g., unix timestamps) required by the target system, converting Boolean type "run status" fields (true/false) to enumeration values required by the target system ("0" for stop, "1" for run), and rounding or truncating floating point type temperature data (e.g., 23.5 ℃) according to the accuracy requirements of the target system. The data type conversion rule needs to strictly follow the data type definition of the target system, so that data storage errors or calculation anomalies caused by type mismatch are avoided.

The time stamp alignment rule is designed for a multi-source data fusion scene and is used for solving the problem that the time references of different data sources are inconsistent. For example, some sensor data sources may generate timestamps at device local time, while database data sources may use UTC time, and the format conversion unit needs to uniformly convert the timestamps of all data sources into system global time (e.g., utc+8 standard time). The time stamp alignment process comprises the steps of identifying the format (such as ISO8601, unix time stamp, custom format) of the original time stamp, analyzing the time value, adjusting the time zone offset and the like, so that all data points in the same data set are ensured to have comparability in the time dimension, and a basis is provided for subsequent operations such as time sequence analysis, data aggregation and the like.

The function of the semantic mapping unit is to extract data semantic tags according to the metadata model, wherein the tags are used for describing business meaning, entity relation and dependency path of the data, so that the data is converted from simple numerical values or character strings to information with business readability. Metadata models are abstract modeling of knowledge in the data field, and generally include content such as entity definitions, attribute relationships, business rules, and the like. For example, in an industrial internet of things scenario, a metadata model may define that a "sensor" entity has properties such as "device number", "type", "installation location", etc., and that an "contain" relationship exists between a "production line" entity and a "sensor" entity.

The semantic tags extracted by the semantic mapping unit mainly comprise entity identifiers, attribute relationship maps and data dependency paths. The entity identifier is used for uniquely identifying a physical entity or a logical entity corresponding to the data, for example, the entity identifier in the sensor data may be a physical number (such as "SN-2023001") of the sensor, and the entity identifier in the service data may be an order number, a user ID, and the like. The accuracy of the entity identifier directly affects the traceability of the data in cross-system interactions, for example, when an anomaly occurs in certain sensor data, the corresponding physical device can be quickly located through the entity identifier.

The attribute relation graph is a graphical description of association relation among entities, and the dependency, inclusion, association and other relation among the entities and the attributes thereof are represented by the form of nodes and edges. For example, in a smart manufacturing scenario, an attribute relationship graph may represent that a "device" entity is connected to a "production line" entity by a "mount-in" relationship, and a "sensor" entity is connected to a "device" entity by a "belong-to" relationship, with the attributes (e.g., device model, sensor accuracy) of each entity stored as attribute values for the nodes. The attribute relationship graph not only helps to understand the business context of the data, but also can be used for data checksum quality monitoring, for example, when the "device number" attribute value of certain sensor data cannot find the corresponding record in the device entity, it can be determined as invalid data.

The data dependence path records the source and conversion history of the data in the process of generation, transmission and processing, for example, a certain piece of temperature data can be derived from the original measured value of a sensor A, and the data is generated after processing steps such as protocol analysis, format conversion, unit conversion and the like. The extraction of the data dependent path needs to combine the protocol identifier of the data source management module, the analysis log of the protocol adaptation module and the rule application record of the format conversion unit to form a complete data tracing chain. The function of the data dependence path is that when the data quality is in problem, a problem link can be located through tracing analysis, for example, unit conversion rule errors of a certain batch of data can be found, and the corresponding rule configuration errors in the format conversion unit can be traced back.

In practical applications, the format conversion unit and the semantic mapping unit generally need to cooperate. For example, when converting intermediate format data into a target structure, the format conversion unit needs to determine the hierarchical structure of the fields according to the entity relationship in the semantic tag, if a field belongs to the "position" attribute of the "equipment" entity, then the field is placed under the path of the "equipment information and position" in the target structure, and when extracting the attribute relationship map, the semantic mapping unit needs to refer to the field mapping result after format conversion to ensure that the tag is consistent with the target data structure. The collaboration mechanism of the two is realized through a shared metadata model, and the metadata model not only defines the semantic rules of the data, but also prescribes the organization mode of the data structure, so that the format conversion and semantic extraction process follow the unified business logic.

The data conversion module also needs to have a rule management function, so that a system administrator is allowed to configure and modify preset rules through a visual interface. The rule management interface typically includes modules such as a field mapping table, a data type conversion dictionary, a timestamp format library, and the like. The field mapping table shows the corresponding relation between the intermediate field and the target field in a table form, supports batch import/export and online editing, the data type conversion dictionary lists all supported type conversion rules, such as 'character string to integer', 'enumeration value mapping', and the like, an administrator can select proper conversion rules for the fields through a drop-down menu, and the timestamp format library predefines common time format templates (such as RFC3339 and Unix timestamps) and supports custom format expressions (such as matching specific time strings through regular expressions).

To ensure the accuracy of data conversion, a verification mechanism is required to be built in the module. After the rule configuration is completed, the system can automatically generate test cases, and the sample data is used for verifying whether the conversion result meets the expectations. For example, intermediate format data including "device_id: SN-123, timestamp:2023-10-01T12:00:00Z, value:25.5" is entered, and it is verified whether the converted target data correctly maps fields, converts the timestamp format, and retains numerical accuracy. If errors are found in the verification process, the system prompts an administrator to correct the errors, and the error rules are prevented from being applied to the actual data processing flow.

In addition, the data conversion module needs to support dynamic rule loading, when the data model of the target system is changed (such as adding fields and adjusting field types), an administrator can update preset rules without restarting the system, and the new rules take effect in real time. The dynamic rule loading mechanism is implemented by a hot deployment technique, for example, the rule is stored in a JSON format configuration file, the module periodically scans the configuration file for changes and reloads the rule, or the module receives a rule update request through an API interface and takes effect immediately.

When large-scale data are processed, the data conversion module adopts pipeline architecture design, and splits operations such as field mapping, type conversion, time stamp alignment, semantic extraction and the like into a plurality of processing nodes, each node is responsible for a specific conversion task, and the data sequentially pass through each node in a stream processing mode. The pipeline architecture can improve processing efficiency, support parallel computing and load balancing, for example, distribute data of different data sources to different processing threads, and avoid that the processing delay of a single data source affects the whole flow.

Example 4:

in this embodiment, the data checking module and the anomaly detection module together form a data integrity management system, and the core function of the data integrity management system is to quantitatively evaluate the data integrity through an algorithm based on the semantic tag output by the data conversion module and trigger an early warning for an anomaly state.

The core of the data verification module is the execution of a data integrity algorithm, and the algorithm realizes the integrity assessment of the target data structure through field level verification. The execution of the algorithm depends on semantic tags provided by the data conversion module, in particular information such as entity identifiers, attribute relationship graphs and the like, and is used for locating the association relationship and the verification range of the data fields. The algorithm specifically comprises the following steps:

And generating a check reference value for the field set of the target data structure by adopting a hash digest algorithm. The hash digest algorithm is a one-way encryption function that converts input data of arbitrary length into a digest value of fixed length (e.g., MD5 generates a 128-bit digest, SHA-256 generates a 256-bit digest). In this step, the algorithm combines all fields (e.g., device number, measurement value, timestamp, etc.) in the target data structure into one input string, and generates a unique verification reference value through a hash function. The check reference value represents the complete state of the field set, and if the subsequent data field is added or deleted or modified, the digest value will change significantly. For example, for data containing three fields of "device number", "temperature value", "acquisition time", the combined string may be "device number_12345 temperature value_23.5 acquisition time_2023-10-0112:00:00", and the digest value generated by the SHA-256 algorithm may be used as a reference for complete data.

The number of missing fields is calculated based on the check reference value. The system identifies fields that are not included in the actual data by comparing the actual data fields to a list of fields defined by the target data structure. The field list of the target data structure is predefined by the metadata model, defining the mandatory fields and optional fields that each entity should contain. For example, the target structure of the sensor data may specify "device number", "measurement value", "acquisition time" as the mandatory field and "status description" as the optional field. In the verification process, if the actual data lacks any necessary field (such as 'acquisition time' not transmitted), counting as a missing record, wherein the missing of the optional field is not generally included in the calculation, but whether to participate in the verification can be configured according to the service requirement. The number of missing fields is noted asThis value directly reflects the integrity defect of the data field.

And calculating the actual data field missing rate. The field missing rate is determined by the ratio of the number of missing fields to the total number of fields, and the calculation formula is as follows:

Wherein, the The field loss rate is indicated as a function of the field loss rate,Representing the total number of fields (including the mandatory fields and optional fields to participate in the verification) defined by the target data structure. For example, if the total number of fields is 10 and 2 necessary fields are missing, the missing rate is 20%. The field loss rate is a value between 0 and 1, with larger values indicating more serious data integrity problems.

Finally, calculating a data integrity coefficient through the field missing rate, wherein the calculation formula is as follows:

Wherein, the Representing a data integrity factor, which takes a value in the range of 0 to 1. When (when)When the data field is completely completeWhen this indicates that the data field is missing entirely. For example, if the field loss rate is 20%, the integrity factor is 0.8, indicating that the data integrity is at a higher level.

Data integrity coefficient generated by data verification moduleTransmitting the information to an abnormality detection module in real time, and executing complete state judgment by the module. The abnormality detection module is internally provided with a preset integrity threshold valueThe threshold is set according to traffic demands and data criteria, typically a value between 0.8 and 0.95 (e.g. 0.9). The judgment logic is as follows:

If it is And judging that the data integrity does not reach the standard, and triggering data missing early warning. The early warning information comprises data source identification, integrity coefficient, missing field list and other contents, such as 'data source SN-001 integrity coefficient 0.75, missing field: acquisition time and state description'.

If it isAnd judging that the data is complete, and not triggering early warning.

In practical applications, the integrity threshold valueDynamic configuration is supported. A system administrator may set a differentiation threshold for different types of data sources through a background interface, such as a higher threshold (0.95) for real-time monitoring data (e.g., industrial sensor data) and a lower threshold (0.8) for non-real-time business data (e.g., log file data). The threshold configuration needs to combine the importance, acquisition frequency and historical integrity performance of the data, so that frequent early warning caused by too tight threshold setting or invalid data flowing into the system caused by too loose threshold setting is avoided.

The early warning trigger mechanism of the abnormality detection module is integrated with a message notification system of the system, and supports various early warning modes:

And (3) recording logs, namely writing early warning information into a system log, and recording time, data sources, anomaly types and coefficient values in detail for subsequent audit and fault tracing.

And (3) interface alarming, namely marking an abnormal data source in a system monitoring interface in a striking color (such as red), and popping up a prompt box to display early warning details.

Message pushing, namely sending an early warning notice to a designated manager through mails, short messages and instant messaging tools (such as enterprise WeChat and nails) so as to ensure timely response.

And (3) pushing the early warning information to an external system (such as an operation and maintenance management platform) through a preset API interface, and triggering an automatic processing flow (such as data re-acquisition and fault troubleshooting script starting).

The whole flow of data checking and anomaly detection needs to keep a detailed processing log, and the log content comprises:

And recording the specific time of the data entering the verification module.

The data source identifies a unique identifier assigned by the data source management module, such as "API-HTTP-003".

And checking a reference value, namely comparing field set abstract values generated by the hash algorithm to judge whether the data is changed.

The missing field details comprise information such as field names, belonged entities, whether fields are necessary to be filled or not and the like.

Integrity coefficient calculation process, record、、、Specific values of (2).

Threshold value comparison result is displayedAnd (3) withIs a numerical comparison and decision of (a).

In order to meet the real-time verification requirement of large-scale data, a data verification module adopts a distributed computing architecture. And a plurality of check nodes process data of different data sources in parallel, and tasks are distributed through a load balancer, so that overload of a single node is avoided. Each check node independently executes hash calculation, field comparison and coefficient calculation, and the results are summarized to a central controller to perform unified threshold judgment and early warning distribution. The distributed architecture can remarkably improve the processing efficiency, support the real-time verification of tens of thousands of data per second, and meet the performance requirements of scenes such as industrial Internet of things, real-time data analysis and the like.

In the dynamic extension scenario of the data field, the data verification module supports automatic adjustment of the verification rules according to the updating of the metadata model. For example, when the target data structure is newly populated with "geographic location" mandatory fields, the system automatically updatesAnd incorporates the new field into the miss check. The mechanism is realized through real-time synchronization of the metadata model and the check logic, ensures that the check rule is consistent with the definition of the data structure, and avoids check failure caused by field change.

In addition, for data containing nested structures or complex relationships (e.g., hierarchical data in JSON format), the data verification module traverses all subfields using a recursive parsing scheme. For example, for a "device-sensor-measurement" three-level nesting structure, the system checks the mandatory fields for each level layer by layer, ensuring the integrity of the nested fields. The recursion analysis process combines the attribute relation graph in the semantic tag to determine the hierarchical attribution and the dependency relation of the fields, so that the omission of the deeply nested fields is avoided.

The design of the data integrity algorithm fully considers the balance of computational efficiency and accuracy. The selection of the hash digest algorithm is based on the data volume and security requirements, namely, for lightweight data (such as single sensor record), an algorithm with higher calculation efficiency such as MD5 is adopted, and for large-scale data sets (such as batch file data), an algorithm with stronger security such as SHA-256 is adopted. Meanwhile, the algorithm supports an increment check mode, when the data field is not changed, the history check reference value can be directly multiplexed, repeated calculation steps are skipped, and the processing efficiency is improved.

Example 5:

in this embodiment, the data quality monitoring module and the data quality early warning module together construct a data consistency management system, and the core function of the system is to quantitatively evaluate data consistency through a model based on the semantic tag output by the data conversion module and trigger early warning on a collision state.

The core of the data quality monitoring module is the operation of a data consistency model, and the model realizes the evaluation of the data logic consistency through entity attribute conflict analysis. The model operates in dependence upon semantic tags provided by the data transformation module, particularly attribute relationship maps and data dependent paths therein, for identifying possible attribute associations and conflicts between entities. The model specifically comprises the following steps:

And calculating the entity attribute conflict rate. Entity attribute conflicts refer to inconsistent attribute values of the same entity in different data sources or different processing links, for example, the attribute of a manufacturer of the same device is manufacturer A in sensor data and manufacturer B in business system data. The conflict rate is determined by the ratio of the number of conflict attributes to the number of total association attributes, and the calculation formula is as follows:

Wherein, the The attribute conflict rate is indicated as being,Indicating the number of conflicting properties that are detected,Representing the total number of associated attributes involved in the consistency check. The association attribute refers to an attribute in which a business association is defined in the metadata model, such as "equipment number" and "equipment model", "order number" and "time of order", and the like. In the verification process, the system traverses all the associated attribute pairs through the attribute relationship map, and judges whether the attribute values are consistent or not based on the business rule. For example, if the "device number" is "SN-123", the "device model" attribute of the "device number" should be consistent between the sensor data and the asset management system data, and if a discrepancy occurs, the "device model" attribute is counted as a collision.

Calculating a data consistency coefficient through the attribute conflict rate, wherein the calculation formula is as follows:

Wherein, the And the data consistency coefficient is represented, and the value range is 0 to 1. When (when)When all the associated attribute values are completely consistent, whenAnd when the association attribute values are in conflict, indicating that all the association attribute values have conflict. For example, if the total number of associated attributes is 50 pairs and 5 pairs of conflicting attributes are detected, the conflict rate is 10% and the consistency factor is 0.9, indicating that the data consistency is at a high level.

Data consistency coefficient generated by data quality monitoring moduleAnd transmitting the data to a data quality early warning module in real time, and executing consistency state judgment by the module. The data quality early warning module is internally provided with a preset consistency threshold valueThe threshold is set according to the traffic rules and data criteria, typically a value between 0.8 and 0.95 (e.g. 0.9). The judgment logic is as follows:

If it is And judging that the data have consistency conflict, and triggering data conflict early warning. The early warning information comprises a data source identifier, a consistency coefficient, a conflict attribute list and a specific conflict value, for example, the data source DB-002 consistency coefficient is 0.85, and the conflict attribute is a device model (sensor data is ModelX ', service data is ModelY').

If it isAnd if the data are consistent, the early warning is not triggered.

Consistency thresholdIn support of differentiated configurations, a system administrator may set different thresholds based on the level of trust of the data source. For example, a higher threshold (0.95) is set for homologous data (e.g., different tables of the same database), and a lower threshold (0.85) is set for heterologous data (e.g., sensor data and third party system data). The threshold configuration is combined with the fault tolerance capability of the business scenario, e.g., financial transaction data requires extremely high consistency (threshold 0.99), while log analysis data can tolerate some degree of conflict (threshold 0.8).

The early warning trigger mechanism of the data quality early warning module is integrated with the event management flow of the system, and supports a multi-level response strategy:

primary early warning when Near threshold (e.g) And triggering yellow early warning, and prompting an administrator to pay attention to data fluctuation through intra-system notification.

Mid-level early warning whenBut when the serious conflict level is not reached (for example, the conflict rate is lower than 30%), triggering orange early warning, notifying related personnel through mails and short messages, and automatically generating a conflict data list.

Advanced early warning whenWhen the collision threshold is lower than a serious collision threshold (such as 0.5), triggering red early warning, suspending data access, starting an automatic repair process (such as retrying data acquisition and calling a data cleaning interface), and notifying a technical team of intervention.

The whole flow of data quality monitoring and early warning needs to record detailed audit logs, and log contents comprise:

And the data processing time is the specific time when the recorded data enter the monitoring module.

The data source identification is a unique identifier associated with the data source management module, such as "F-SFTP-001".

And the related attribute verification details comprise attribute pair names, data source A values, data source B values and conflict judgment rules (such as 'complete matching of character strings', 'fuzzy matching threshold 80%').

The consistency coefficient calculating process is to record、、、Specific values of (2).

Threshold value comparison result is displayedAnd (3) withNumerical comparison and early warning level of (2).

In order to improve the consistency verification efficiency of large-scale data, the data quality monitoring module adopts a distributed memory computing technology (such as SparkStreaming). By loading the attribute relationship graph into the distributed memory, each computing node can quickly inquire the entity association relationship and execute attribute value comparison in parallel. For example, for millions of device data, the system may hash the device number to different nodes, each node independently check the device attribute consistency of the corresponding bucket, and the final result is summarized by the aggregation node to achieve a second level response.

The data quality monitoring module supports time window based deferred consistency checks in processing cross-entity association data (e.g., association of order data with inventory data). For example, the "stock quantity" attribute of certain order data needs to be consistent with the data in the same time window (e.g. 10 minutes) of the stock system, if temporary conflict is caused by data synchronization delay, the system can set a buffer time (e.g. 30 minutes), and automatically retry verification in the buffer period, so as to avoid erroneous judgment of data conflict due to network delay.

The conflict determination rules of the data consistency model support custom configuration to accommodate diversified business logic. For example:

And the accurate matching rule requires that the attribute values are identical (such as an identity card number and a unique identifier of equipment), and is suitable for the main key attribute.

Fuzzy matching rule, which is to allow some difference (such as simplified and complicated form of name and different expression of address) of attribute value, calculate matching degree by similarity algorithm (such as edit distance and cosine similarity), and judge conflict when matching degree is lower than preset threshold (such as 80%).

Business rule matching, namely, consistency is judged based on industry standards or enterprise custom rules (such as a date format must be YYYY-MM-DD, a numerical value must be in a specified interval), for example, an 'age' attribute value must be more than 0 and less than 150, otherwise, conflict is judged.

In addition, the data quality monitoring module supports historical data comparison analysis, and gradual change type conflicts (such as inconsistent historical data caused by slow change of equipment model) are identified by tracking time series changes of the attribute values of the same entity. The system traces back the source of the attribute value through the data dependent path, compares the value difference of different time points, generates a conflict evolution trend report, and provides decision basis for data management.

And the data quality early warning module is linked with the result output module and pushes early warning information to the target application system through the API interface. For example, when the order data and the logistics data of the e-commerce platform are in consistent conflict, the early warning information can trigger the order system to automatically mark an abnormal order, the logistics system generates an abnormal work order, and the customer service system synchronously sends a notification to a user to form a cross-system collaborative processing flow.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A multi-protocol scalable data acquisition system architecture with a unified interface standard, characterized by comprising:

The data source management module stores different data source configuration information into the data source knowledge base according to the data source type and assigns a unique protocol identifier to each data source;

The protocol adaptation module includes a protocol parsing unit and a protocol matching unit. After receiving external data, the protocol parsing unit calls the corresponding protocol parsing library according to the protocol identifier to convert the original data into a standard intermediate format. The protocol matching unit verifies the protocol compatibility and confirms data access after verification.

A data conversion module includes a format conversion unit and a semantic mapping unit. The format conversion unit converts the intermediate format data into the target data structure using preset rules, and the semantic mapping unit extracts data semantic tags according to the metadata model.

The data verification module calculates the data integrity coefficient through the data integrity algorithm based on the data semantic labels of the data conversion module;

The anomaly detection module determines the data integrity status based on the data integrity coefficient of the data verification module and triggers warning rules for abnormal results;

The data quality monitoring module calculates the data consistency coefficient based on the data semantic labels of the data conversion module through the data consistency model;

The data quality warning module judges the data quality status based on the data consistency coefficient of the data quality monitoring module and triggers warning rules for abnormal results.

2. A multi-protocol scalable data acquisition system architecture with a unified interface standard according to claim 1, characterized in that: in the data source management module, different data source configuration information is stored in the data source knowledge base according to the data source type, and the configuration information includes the communication protocol type and data acquisition frequency.

3. According to the multi-protocol extensible data acquisition system architecture with a unified interface standard in claim 1, it is characterized in that: in the protocol adaptation module, the protocol parsing unit calls the corresponding protocol parsing library according to the protocol identifier, converts the original data into a standard intermediate format, the system verifies the protocol compatibility of the data source, and outputs the corresponding protocol configuration in the data source knowledge base, the protocol matching unit verifies whether the data format meets the target specification and whether it supports dynamic expansion, and confirms data access after verification.

4. A multi-protocol extensible data acquisition system architecture with a unified interface standard according to claim 1 is characterized in that: in the data conversion module, the format conversion unit uses preset rules to convert intermediate format data into a target data structure including field mapping rules, data type conversion rules and timestamp alignment rules, and the semantic mapping unit extracts data semantic labels including entity identifiers, attribute relationship graphs and data dependency paths according to the metadata model.

5. The multi-protocol scalable data acquisition system architecture with a unified interface standard according to claim 1, wherein the data integrity coefficient calculated by the data integrity algorithm in the data verification module based on the data semantic tags transmitted by the data conversion module is as follows:

Step S01: Use the hash algorithm to generate a checksum value for the field set of the target data structure, and calculate the number of missing fields based on the checksum value. ;

Step S02: Calculate the actual data field missing rate, the calculation formula is:

in represents the field missing rate, Indicates the number of missing fields, Indicates the total number of fields;

Step S03: Calculate the data integrity coefficient, the calculation formula is:

in Indicates the data integrity coefficient.

6. The multi-protocol scalable data acquisition system architecture with a unified interface standard according to claim 1, wherein: the anomaly detection module receives the data integrity coefficient transmitted by the data verification module , and the data integrity coefficient With the preset integrity threshold Compare and judge the data integrity status. If the data integrity coefficient Less than the preset integrity threshold , then trigger the data missing warning, if the data integrity coefficient Greater than or equal to the preset integrity threshold , the data is determined to be complete.

7. The multi-protocol scalable data acquisition system architecture with a unified interface standard according to claim 1, wherein the data quality monitoring module calculates the data consistency coefficient using the data consistency model based on the semantic tags transmitted by the data conversion module. The specific content is as follows:

Step S01: Calculate the entity attribute conflict rate, the calculation formula is:

in represents the attribute conflict rate, Indicates the number of conflicting attributes, Indicates the total number of associated attributes;

Step S02: Calculate the data consistency coefficient, the calculation formula is:

in Represents the data consistency coefficient.

8. The multi-protocol scalable data acquisition system architecture with a unified interface standard according to claim 1, wherein: the data quality warning module receives the data consistency coefficient transmitted by the data quality monitoring module , and the data consistency coefficient Consistency threshold with the preset Compare and judge the data quality status. If the data consistency coefficient Less than the preset consistency threshold , then trigger the data conflict warning, if the data consistency coefficient Greater than or equal to the preset consistency threshold , then the data are determined to be consistent.

9. The multi-protocol scalable data acquisition system architecture with a unified interface standard according to claim 1, wherein the architecture further comprises:

The result output module is used to receive the warning information transmitted by the anomaly detection module and the data quality warning module, and transmit the warning information to the target application system through the API interface.

10. A multi-protocol scalable data acquisition method with a unified interface standard, applied to a multi-protocol scalable data acquisition system architecture with a unified interface standard as claimed in any one of claims 1 to 9, characterized in that it comprises the following steps:

Step S1: storing different data source configuration information into a data source knowledge base according to the data source type, and assigning a unique protocol identifier to each data source;

Step S2: Call the corresponding protocol parsing library according to the protocol identifier to verify the protocol compatibility. After verification, confirm data access;

Step S3: Convert the intermediate format data into the target data structure using preset rules, and extract data semantic labels based on the metadata model;

Step S4: Based on the data semantic label transmitted by the data conversion module, a data integrity coefficient is calculated using a data integrity algorithm;

Step S5: Determine the data integrity status based on the data integrity coefficient and trigger warning rules for abnormal results;

Step S6: Based on the data semantic labels transmitted by the data conversion module, a data consistency coefficient is calculated using a data consistency model;

Step S7: Determine the data quality status based on the data consistency coefficient and trigger warning rules for abnormal results.