+

CN120257967B - Automatic verification method and device for deviation of clinical test scheme and computer equipment - Google Patents

Automatic verification method and device for deviation of clinical test scheme and computer equipment

Info

Publication number
CN120257967B
CN120257967B CN202510746150.5A CN202510746150A CN120257967B CN 120257967 B CN120257967 B CN 120257967B CN 202510746150 A CN202510746150 A CN 202510746150A CN 120257967 B CN120257967 B CN 120257967B
Authority
CN
China
Prior art keywords
data
consistency
result
language model
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202510746150.5A
Other languages
Chinese (zh)
Other versions
CN120257967A (en
Inventor
曹晓春
闻增玉
单彬
明鸣
苏宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Tigermed Consulting Co ltd
Original Assignee
Hangzhou Tigermed Consulting Co ltd
Filing date
Publication date
Application filed by Hangzhou Tigermed Consulting Co ltd filed Critical Hangzhou Tigermed Consulting Co ltd
Priority to CN202510746150.5A priority Critical patent/CN120257967B/en
Publication of CN120257967A publication Critical patent/CN120257967A/en
Application granted granted Critical
Publication of CN120257967B publication Critical patent/CN120257967B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a method, a device and computer equipment for automatically checking deviation of a clinical test scheme. The method comprises the steps of obtaining PD records and PD classification grading data from a CTMS system and an EDC system to obtain two pieces of data of the same item of the CTMS system and the EDC system, analyzing the two pieces of data by using a large language model to obtain two analysis results, carrying out consistency check on the two pieces of data according to the two analysis results, comparing differences to obtain detection results, generating a report containing consistency check and difference comparison according to the detection results, and outputting the report. By implementing the method of the invention, the verification efficiency, the accuracy and the system compatibility can be obviously improved, the manual intervention and the learning cost are reduced, and the seamless integration with the existing workflow is ensured.

Description

Automatic verification method and device for deviation of clinical test scheme and computer equipment
Technical Field
The present invention relates to data processing methods, and more particularly to methods, apparatus and computer devices for automated verification of deviations from clinical trial protocols.
Background
In clinical trials, PD (protocol deviation ) refers to the situation that is not performed according to the predetermined clinical trial protocol during the trial execution. For example, the subject missed a scheduled visit, administered a dose error, or did not complete some detection on demand, etc. These deviations may be caused by researchers, subjects, or other interested parties and may affect the integrity of the test, the reliability of the data, or the safety of the subject. Thus, accurate recording and reporting of protocol deviations is critical to ensuring compliance of the test, integrity of the data, and safety of the subject.
In clinical trial management, CTMS (clinical trial management system ) and EDC (electronic data acquisition system, electronic Data Capture) are core tools, and the recorded protocol deviation data of both are different in source and function. First, the clinical trial management system is a piece of software for managing the operation of clinical trials, focusing on the administration and project management of the trials. The method has the main functions of test planning and tracking, managing the schedule, milestones and resource allocation of the test, site management, tracking research site and researcher information, subject management, monitoring the progress of subject recruitment, screening, visit and the like, document management, storing test related documents such as informed consent and ethical wholesale, and finance and compliance, managing budget and payment, and ensuring compliance with the regulation requirements. CTMS can provide a global view, help research teams optimize operation, and ensure trial progress as planned. The electronic data acquisition system is used for collecting and managing clinical test data, and replaces the traditional paper data acquisition. The method has the core functions of directly collecting clinical data of a subject, such as medical history and laboratory results, internally arranging data verification rules to ensure the accuracy of the data, monitoring the data in real time, reducing data entry errors, providing data security assurance, ensuring compliance and providing an audit trail function. EDC aims to improve data quality and acquisition efficiency, ensuring reliability and traceability of data.
The functional emphasis of CTMS and EDC systems is different, but there is an intersection between them. CTMS focuses more on the management of test operations (e.g., site information, subject status), while EDC focuses on the collection and validation of clinical data. Due to the different functions and data entry ways, PD records may be inconsistent, missing, or misclassified. These inconsistencies may lead to missing or erroneous data, affecting the reliability of the test results and the validity of the statistical analysis. Furthermore, inconsistent records may cause compliance problems, increasing regulatory risks. Thus, it is important to check the differences in pattern deviation records in CTMS and EDC systems.
Currently, the ways of checking deviations of protocols in clinical trials can be broadly divided into two types, manual and semi-automated. First, the manual check is a conventional check method by deriving data related to the scheme deviation from CTMS and EDC systems, respectively, and then manually comparing the records by a data manager to check whether there is a difference, omission or repetition. The discrepancy may be consolidated into a report and notes the type of problem and the possible cause. However, this method has drawbacks in that it is labor-intensive, particularly in large multi-center experiments, is inefficient, and is prone to human error, such as missed searches or erroneous judgments. In addition, manual verification is also difficult to achieve in real-time monitoring, and may affect the progress of the test. Semi-automated verification uses technical tools to extract data from CTMS and EDC systems and convert the data into a unified format. Preliminary screening is performed by simple rules to identify differences. Compared with manual auditing, the method has higher efficiency, can reduce repetitive work, still needs a certain technical capability, and can bring additional work for writing and maintaining scripts. For descriptive text or complex solution deviations, the difficulty of automated processing is great, and thus manual review cannot be completely replaced.
Therefore, it is necessary to devise a new method that achieves significant improvements in verification efficiency, accuracy, and system compatibility, reduces human intervention and learning costs, and ensures seamless integration with existing workflows.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a method, a device and computer equipment for automatically checking deviation of clinical test schemes.
In order to achieve the purpose, the invention adopts the following technical scheme that a clinical test scheme deviates from an automatic checking method, and the method comprises the following steps:
Acquiring PD records and PD classification grading data from a CTMS system and an EDC system to obtain two pieces of data of the same item of the CTMS system and the EDC system;
analyzing the two data by using a large language model to obtain two analysis results;
Consistency checking is carried out on the two data according to the two analysis results, and differences are compared to obtain a detection result;
generating a report containing consistency check and difference comparison according to the detection result;
And outputting the report.
The method further comprises the steps of analyzing the two data by using a large language model to obtain two analysis results, wherein the method comprises the following steps:
Classifying the two data by using a large language model to obtain deviation record data and classification data;
Automatically mapping different system fields to the deviation record data and the classification grading data and analyzing unstructured PD descriptions to extract key information;
And converting the key information into a unified internal data structure to obtain two analysis results.
The method further comprises classifying the two data by using a large language model to obtain deviation record data and classification data, wherein the method comprises the following steps:
Dividing the two data into data blocks, classifying each data block by utilizing zero sample classification capability of a large language model, and determining a final class by combining a voting mechanism to obtain deviation record data and classification data.
The method further comprises the steps of automatically mapping different system fields to the deviation record data and the classification and grading data and analyzing unstructured PD descriptions to extract key information, wherein the method comprises the following steps:
constructing a prompt word;
Inputting the deviation record data and the classification grading data into a large language model by combining the prompt words so as to analyze unstructured PD description and obtain an analysis result in a JSON format;
And de-serializing the analysis result in the JSON format into an object in a programming language to obtain key information.
The further technical scheme is that the consistency check is carried out on the two data according to the two analysis results, and the difference is compared to obtain a detection result, and the method comprises the following steps:
Checking the internal consistency of the two analysis results through a large language model to obtain a consistency result;
comparing the two analysis results according to the subjects and time, marking the missing record and evaluating the consistency of the two deviation record data so as to determine the missing mark and the difference reason;
The detection results comprise a consistency result, a deletion mark and a difference reason.
The method further comprises the steps of checking the internal consistency of the two analysis results through a large language model to obtain a consistency result, wherein the method comprises the following steps:
Constructing a prompt word according to the two analysis results;
Inputting the two analysis results and the prompt word into a large language model for internal consistency analysis to obtain a consistency result;
Wherein the consistency result comprises a consistency result, a confidence level, a non-consistency reason and a suggestion.
The method comprises the steps of comparing the two analysis results according to subjects and time, marking missing records and evaluating consistency of two deviation record data to determine missing marks and difference reasons, wherein the method comprises the following steps:
Grouping the two analysis results according to the screening number of the subject, and classifying records under the same subject according to the PD type to obtain a plurality of groups of data;
sequencing each group of data according to the occurrence time of the event to obtain a sequencing result;
constructing a comparison prompt word;
and inputting the sequencing result and the comparison prompt word into a large language model to obtain a comparison result, wherein the comparison result comprises a mark missing condition and a difference reason.
The further technical scheme is that the report comprising consistency check and difference comparison is generated according to the detection result, and the report comprises:
And generating an Excel file which contains consistency check and difference comparison and is structured according to the detection result so as to obtain a report.
The invention also provides a clinical trial plan deviation automatic checking device, comprising:
The acquisition unit is used for acquiring PD records and PD classification grading data from the CTMS system and the EDC system so as to obtain two pieces of data of the same item of the CTMS system and the EDC system;
the analysis unit is used for analyzing the two data by using the large language model to obtain two analysis results;
The PD checking unit is used for carrying out consistency check on the two data according to the two analysis results and comparing the differences to obtain a detection result;
the report generation unit is used for generating a report containing consistency check and difference comparison according to the detection result;
and the output unit is used for outputting the report.
The invention also provides a computer device which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the method when executing the computer program.
Compared with the prior art, the method has the advantages that PD records and classification and grading data thereof are obtained from the CTMS system and the EDC system, the data are analyzed by utilizing the large language model, the system can efficiently and automatically process consistency check and difference comparison of the data, the checking efficiency and accuracy are remarkably improved, compatibility among different systems is ensured, and manual intervention and learning cost are reduced. By generating the report containing consistency check and difference comparison, the system can seamlessly integrate the existing workflow, realize automation and optimization of the workflow, and greatly improve the working efficiency and quality.
The invention is further described below with reference to the drawings and specific embodiments.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an application scenario of a deviation automatic verification method for a clinical trial scenario provided by an embodiment of the present invention;
FIG. 2 is a flow chart of a method for automatically verifying deviations of a clinical trial protocol according to an embodiment of the present invention;
FIG. 3 is a schematic illustration of a sub-flowchart of a method for automatically verifying deviation of a clinical trial protocol according to an embodiment of the present invention;
FIG. 4 is a schematic flow chart of a clinical trial plan deviation automatic checking method according to an embodiment of the present invention;
FIG. 5 is a schematic flow chart of a clinical trial plan deviation automatic checking method according to an embodiment of the present invention;
FIG. 6 is a schematic illustration of a sub-flowchart of a method for automatically verifying deviation of a clinical trial protocol according to an embodiment of the present invention;
FIG. 7 is a schematic flow chart of a clinical trial plan deviation automatic checking method according to an embodiment of the present invention;
FIG. 8 is a schematic block diagram of a clinical trial plan deviation automatic checking apparatus provided by an embodiment of the present invention;
Fig. 9 is a schematic block diagram of a computer device according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be understood that the terms "comprises" and "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.
It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should be further understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.
Referring to fig. 1 and 2, fig. 1 is a schematic diagram of an application scenario of a deviation automatic checking method for a clinical trial scheme according to an embodiment of the present invention. FIG. 2 is a schematic flow chart of a clinical trial protocol deviation automatic verification method provided by an embodiment of the present invention. The clinical trial protocol deviates from the automatic checking method being applied in a server. The server performs data interaction with the terminal, and the verification efficiency, accuracy and system compatibility are remarkably improved through application of the large language model. The method reduces manual intervention by automatically analyzing data from the CTMS and EDC system and carrying out consistency check and difference comparison, effectively improves the accuracy of data processing and ensures consistency among data by utilizing large language model classification, automatic mapping field and analysis of unstructured data, and in addition, the method can automatically generate reports and export the reports into a structured format, simplifies the report generation process, reduces learning cost, ensures seamless integration with the existing workflow and greatly improves the automation and intelligent level of checking.
FIG. 2 is a flow chart of a method for automatically verifying deviation of a clinical trial protocol according to an embodiment of the present invention. As shown in fig. 2, the method includes the following steps S110 to S150.
S110, acquiring PD records and PD classification grading data from the CTMS system and the EDC system to obtain two pieces of data of the same item of the CTMS system and the EDC system.
In the present embodiment, two pieces of data of the same item of the CTMS system and the EDC system, that is, the PD record and the PD classification rating data of the CTMS system and the PD record and the PD classification rating data of the EDC system.
Data relating to the deviation is obtained from a clinical trial management system and an electronic data acquisition system.
CTMS is a system for managing and monitoring clinical trials. Here, visit data for subjects and deviations during the trial are recorded in the CTMS system.
The system extracts data from the CTMS system through API interface or file export. The extracted data includes:
structured fields such as subject ID (subject_id), visit date (visual_date), deviation type (device_type), etc.
Unstructured data, e.g., descriptive text of PD (e.g., "subject missed visit 3 due to traffic problem").
EDC is a system for recording and managing data in clinical trials, typically for collecting patient data and various recordings during the trial.
Like the CTMS system, the EDC system also provides an API interface or supports file export functions to obtain PD-related data, the content and structure of which are typically different from the data format in CTMS.
The data extracted from the EDC system also includes structured fields (e.g., subject ID, date of visit) and unstructured descriptive text (e.g., PD descriptive text).
The types of the acquired data include:
PD records deviations or events associated with subjects during the course of the clinical trial.
PD classification data refers to classification of PD records (e.g., "delay of visit", "dosing error", etc.) and classification information (e.g., "slight", "significant", etc.).
Accessing data through the interface of CTMS and EDC typically requires logging in and authentication. Through the interface, the system can automatically pull the data of the item, including PD record and classification information.
If the API interface cannot be used, the data can be acquired by importing files in the format of CSV, excel and the like. In this case, the file format needs to conform to a prescribed structure to ensure that the data can be correctly parsed. There is no need to adapt different CTMS and EDC systems.
This step is critical in the overall system, as it is the basis for subsequent data parsing and verification. Any data acquisition and quality issues directly affect the subsequent analysis and verification results.
Accuracy and integrity in data acquisition ensures successful execution of subsequent steps (e.g., data parsing, verification, etc.).
Step S110 is the first link in the whole automation verification system, and aims to acquire two pieces of data about PD records through API interfaces or file import modes of CTMS and EDC systems. The data provides a basis for a subsequent intelligent analysis and verification module, and ensures the consistency and accuracy of the data.
In this embodiment, the PD record data is sensitive privacy data inside the enterprise, and for data security, the data is not allowed to be transmitted to the cloud. Therefore, the large model deployed by the external cloud cannot be used for data processing, and all data processing must be completed on an internal server or workstation of the enterprise, so as to reduce the risk of data leakage. Thus, enterprises need to privately deploy large models in their internal environments.
Ollama is an open source tool designed specifically for efficient running of large language models in a local environment that helps enterprises achieve safe, efficient, low cost AI deployment. Cooperating with the nmginx reverse proxy Ollama can ensure that interfaces of large models are not accessed without authorization. The following is a specific flow for deploying Ollama and nmginx reverse proxies:
first, an existing hardware environment needs to be evaluated to ensure that it can support the implementation of subsequent steps. The evaluation process includes checking whether the key resources such as the processor, the memory, the storage space, the network connection, etc. are sufficient, and adjusting or upgrading the hardware as necessary to meet the deployment requirements.
After confirming that the hardware environment meets the requirements, the next step is to download and install "Ollama" software. This process typically involves accessing an official website or designated download link, retrieving the latest installation package, and completing the installation of the software according to the installation instructions.
After Ollama is installed, the next step is to download and run the "qwen2.5 32B" large model. This typically means that pre-trained model files are obtained from a particular model library or platform, and then the model is loaded and launched using Ollama or other related tools for use in subsequent applications or experiments.
To optimize the network communications of the system, to increase its security and stability, it is next necessary to install and configure Nginx as a reverse proxy server. The process comprises the steps of downloading an Nginx installation package, executing an installation command, and modifying the configuration file of the Nginx according to actual requirements so as to realize functions of load balancing, SSL encryption and the like and ensure the efficient operation and stability of the system.
Finally, to further enhance the security of the system, an HTTP basic authentication mechanism needs to be configured for nginnx. This configuration may be achieved by editing the configuration file of nginnx, adding the corresponding authentication instructions and user credentials. After the settings are completed, only the clients passing the identity verification can access the protected resources, thereby improving the security protection capability of the system.
Through the steps, enterprises can ensure that large language models are deployed and operated efficiently and safely in a local privateization environment, meanwhile, the risk of data leakage is effectively avoided, and the control and the safety of system access are ensured.
S120, analyzing the two data by using a large language model to obtain two analysis results.
In this embodiment, the two analysis results refer to the key information corresponding to the PD record data and the PD classification data obtained after classification and analysis, respectively.
In one embodiment, referring to fig. 3, the step S120 may include steps S121 to S123.
S121, classifying the two data by using a large language model to obtain deviation record data and classification data.
In this embodiment, the two data are divided into data blocks, each data block is classified by using the zero sample classification capability of the large language model, and a voting mechanism is combined to determine a final class, so as to obtain deviation record data and classification data.
The two pieces of input data (typically from different CTMS and EDC systems) are partitioned, and each block of data is classified using the zero sample classification capability of the large language model. In this way, the model can identify the type of the data block, such as PD record data and PD classification data.
First, two pieces of data are split into multiple data blocks (chunk). Each data block contains several lines of data, the specific number is typically determined according to the input constraints of the model (e.g., 5 to 10 lines). Each data block is processed through a large language model as an independent input item.
Next, each data block is classified using a large language model. Because of the zero sample learning capability of large language models, the system can identify the class of data blocks from Prompt words (Prompt) without explicit training. For example, a data block may be labeled "PD record", "PD classification" or "other".
In order to improve the accuracy of classification, a voting mechanism is employed. Specifically, the classification results of the plurality of data blocks are aggregated by the inference capabilities of the large language model to determine the final class of each data block. Each data block returns a classification result and its confidence, and the final classification is determined based on the aggregate result of the confidence.
Through these steps, the system is able to automatically identify and sort PD record data and sort classification data, eliminating the need for manual sorting or traditional hard-coded sorting.
S122, automatically mapping different system fields to the deviation record data and the classification grading data and analyzing unstructured PD descriptions to extract key information.
In one embodiment, referring to fig. 4, the step S122 may include steps S1221 to S1223.
S1221, constructing a prompt word;
S1222, inputting the deviation record data and the classification grading data into a large language model by combining the prompt words so as to analyze unstructured PD description and obtain an analysis result in a JSON format;
S1223, inversely sequencing the analysis result in the JSON format into an object in a programming language to obtain key information.
In the present embodiment, the key information refers to key fields extracted from PD record data and PD classification data, such as subject ID, deviation type, classification code, and the like.
In this embodiment, data fields obtained from different CTMS and EDC systems are automatically mapped to a unified data structure and unstructured PD descriptions are parsed to extract key information. The process aims to solve the difference of field names and data formats among different systems and ensure the consistency of data.
Since field names in CTMS and EDC systems may differ (e.g. "subject_id" from "event_id"), it is necessary to automatically map field names in different systems. Large language models can recognize these differences and automatically map through semantic understanding capabilities. For example, "event_id" is mapped to "subject_id" to ensure that data of different systems can be uniformly understood.
In addition to the structured field, the PD record may also contain an unstructured description (e.g. "subject missed visit 3 due to traffic problems"). The large language model, with its semantic understanding capabilities, is able to extract key information from these texts, such as bias type, bias cause, etc. This step is critical to processing complex, irregular text data.
Through these mapping and parsing processes, the system can ensure that data from different systems is standardized and structured, facilitating subsequent processing and analysis.
S123, converting the key information into a unified internal data structure to obtain two analysis results.
In this embodiment, the classified and parsed key information is converted into a unified internal data structure for subsequent data processing, verification and analysis.
In this embodiment, two main data structures are defined:
the PD RECORD data structure (PD_RECORD) includes fields of "subject_id", visit date, category, descriptive text, level, etc.
The PD classification hierarchical data structure (PD_TYPE_LEVEL) comprises fields such as main classification code, sub classification code, classification name and classification.
And converting the extracted key information into a unified data structure through the design of the prompt words of the large language model. The process automatically selects the corresponding structure according to the type of the input data (PD record or classification hierarchical data), and fills the parsed field values into the structure. The finally obtained structured data can be expressed in JSON format, and the standardization of the data is ensured.
After this step, the system will output two structured data results that facilitate consistency checking and comparative analysis by the subsequent PD checking module.
Step S120 solves the problem of data format difference between CTMS and EDC systems by using the powerful semantic understanding and self-adapting capabilities of the large language model through a series of operations, and automatically classifies, parses and constructs the data of different systems. Specifically, step S121 performs data blocking and classification, S122 performs automatic field mapping and parsing of unstructured descriptions, and S123 performs transformation of key information into a unified data structure, thereby ensuring that two pieces of data can be efficiently parsed and unified. Through the process, the system realizes efficient and automatic PD record checking and data integration, and the efficiency and accuracy of data processing are obviously improved.
Specifically, through semantic understanding capability of a Large Language Model (LLM), intelligent analysis and structuring processing of Patient Data (PD) records and classified hierarchical data in a Clinical Trial Management System (CTMS) and an electronic data capture system (EDC) are achieved. The module breaks through the limitation of the traditional method in processing heterogeneous data, and remarkably improves the automation level, adaptability and maintainability of the system.
In clinical trial data management, CTMS and EDC systems typically store PD record and classification data in different formats. Although these data are more regular (i.e., field structures are more canonical), there are some subtle format differences such as field names that are different (e.g., "subject_id" and "event_id"), date formats that are inconsistent (e.g., "yyyyy-MM-DD" and "DD/MM/YYYY"), and classification hierarchy expression differences (e.g., "slight Deviation" and "Minor expression"). The traditional method is generally adaptive to different CTMS and EDC systems one by one through a hard coding mode, and specific analysis logic, mapping field names and conversion formats are required to be written for each system, so that the code quantity and complexity are increased. When CTMS or EDC systems are updated (e.g., API changes or field adjustments), the code needs to be modified and the system is redeployed, which is time consuming and labor intensive and has poor scalability. Therefore, when new systems are added, new adapters must be developed that cannot quickly accommodate the diverse data sources.
In order to solve these problems, the present embodiment uses the semantic understanding capability and adaptivity of LLM to automatically parse and structure PD records and classify hierarchical data, thereby fundamentally solving the limitations of the conventional method.
Specifically, PD records (including structured fields and unstructured descriptions) and classification hierarchical data are identified from the input data.
By utilizing the semantic understanding capability of LLM, the heterogeneous data formats of different CTMS and EDC systems are automatically adapted without hard coding adaptation.
Identifying field semantics, automatically mapping field names of different systems (e.g., "event_id" to "subject_id"), including automatic mapping of Chinese and English field names.
The unstructured PD description is parsed and key information (e.g., type of deviation, cause) is extracted.
LLM is guided by Prompt words (promptt), and the parsed data is converted into a unified internal data structure.
The automatic search is to locate PD classification rating data and PD record data from two pieces of data provided. Formally, automatic searching can be thought of as a text classification process that divides an input data table into appropriately sized data blocks (chunk), classifies each data block using the zero sample classification capability of a large language model, and determines the final class of the data block by a voting mechanism, thereby achieving the localization of the desired data. Four aspects of data preprocessing, classification process, voting mechanism and outcome output are described in detail below.
Let the input data be divided into K data tables, let one of the data tables be D, containing N data:
D={d1,d2,...,dN};
To improve classification efficiency and model understanding capability, consecutive (k) data are combined into one data block (chunk) to form a data block set (C).
C={c1,c2,...,cM},;
Where cj is the (j) th data block, containing (k) rows of data (the last data block may be less than (k) rows). k is typically selected based on the data size and model input constraints (e.g., maximum number of tokens), typically 5-10 rows.
And classifying each data block cj by utilizing the zero sample classification capability of the large language model, and judging whether the data block cj contains PD classification data or PD record data. The class label set l= { "PD record", "PD class hierarchy", "other" }, represents three possible classes of data blocks. And inputting each data block into a large language model for classification through the design of the prompt words. The model returns the classification result (e.g. "PD record") and gives the confidence level P.
Individual classification results may be inaccurate due to the volatility of the large model return results. Thus, the classification results are aggregated by a voting mechanism to determine the final class of the data block. And obtaining the result (Lj, pj) of each data block Cj in the data table D for the data block set C of each data table D, obtaining the confidence sum of each category L, and selecting the category L with the largest confidence sum as the category of the data table.
The PD classification hierarchical data structure is designed as follows:
json
{
"main_category_code": "string"// Main Classification code;
"main_category_name": "string"// main category name
"Sub_category_code": string "// sub-category encoding;
"sub_category_name": "string",// sub category name;
"level": "string"// ranking;
}
the PD record data structure is designed as follows:
json
{
"subject_id": "string"// subject ID;
"visual_date": "string"// date of visit (format: YYYY-MM-DD);
"main_category_name": "string"// main category name;
"sub_category_name": "string",// sub category name;
"description": "string"// PD description text (e.g. "subject missed 3 rd visit due to traffic problem");
"level": "string",// classification;
"source": "string"// data source ("CTMS" or "EDC");
}
After determining the class of the data table (e.g., PD record or classification hierarchical data) and defining the unified data structure by automatic search, the data parsing module automatically parses the contents of the data table and maps the required fields into the unified data structure using the semantic understanding capabilities of the Large Language Model (LLM). The process takes JSON as an intermediary to ensure the structuring and consistency of data in the parsing and mapping process. The method comprises the following steps:
the method comprises the steps of constructing a prompt word and inputting the prompt word into a large language model, wherein the prompt word is constructed according to the following mode:
The text of the data line is { text };
From the { type } data table;
please try to resolve into the JSON object array of { struct } structure, if a line cannot be resolved, please output the JSON object of UNKOWN character strings
Where text is the text of a data line, type is the type of data table (possibly a PD classification hierarchy table or a PD record table), struct is the structure determined according to type. If TYPE is a PD class hierarchy table, struct is a PD_TYPE_LEVEL structure, and if TYPE is a PD RECORD table, struct is a PD_RECORD structure.
And (3) inversely sequencing the JSON character strings returned by the large language model into objects in the programming language, so that data analysis can be completed.
S130, carrying out consistency check on the two data according to the two analysis results, and comparing differences to obtain a detection result.
In this embodiment, the detection result refers to the final output generated based on the consistency check and the data comparison, including the comparison result of the PD records in the two systems, the mark of the missing record, and the cause of the difference found in the comparison process. The detection result provides basis for data cleaning and quality control, and helps to identify and correct inconsistent or missing problems in the data.
And (3) performing consistency check and contrast check on PD (adverse event) records in a CTMS (clinical trial management system) and an EDC (electronic data acquisition system) through a Large Language Model (LLM). And specifically, checking the consistency of data, identifying missing data and analyzing the difference between the two systems by utilizing the semantic understanding and reasoning capability of LLM.
The step is responsible for carrying out consistency check and difference comparison on two data analysis results extracted from the CTMS and the EDC, and finally obtaining a detection result. The detection results comprise consistency assessment of the data, marks of missing records and reason analysis of differences.
In one embodiment, referring to fig. 5, the step S130 may include steps S131 to S132.
S131, checking the internal consistency of the two analysis results through a large language model to obtain a consistency result.
In this embodiment, the consistency result refers to an evaluation result obtained by performing semantic analysis on PD records of the same subject and the same visit in the CTMS and EDC systems through a large language model, and then determining whether elements such as data description, category, time and the like in the two systems are matched. If the data of the two systems are consistent in the semantic level, judging that the data are consistent, otherwise, judging that the data are inconsistent, and giving out a specific reason of the inconsistency.
In one embodiment, referring to fig. 6, the step S131 may include steps S1311 to S1312.
S1311, constructing a prompt word according to the two analysis results;
S1312, inputting the two analysis results and the prompt word into a large language model for internal consistency analysis to obtain a consistency result;
Wherein the consistency result comprises a consistency result, a confidence level, a non-consistency reason and a suggestion.
In this embodiment, first, the internal consistency check is performed on the two analysis results. The method specifically comprises the following substeps:
Based on the data in the CTMS and EDC system, a Prompt term (Prompt) is constructed. The hint words will include information such as "description" (description), "main category_name", "sub category_name", etc., in order to provide the LLM for further analysis.
And inputting the constructed prompt words and the analysis results of the CTMS and the EDC into a large language model for consistency check. The large language model automatically evaluates whether the description and the category are consistent or not according to the deep semantic understanding capability. For example, the model would check whether the semantic logic is met between the "visit delay" category and "subject missed visit 3 due to traffic problems".
The results included the following aspects:
Consistency results (whether consistent);
confidence (between 0 and 1, indicating the degree of reliability of the consistency);
reasons for the inconsistency (if there is an inconsistency, the model may infer the reason and make a suggestion).
S132, comparing the two analysis results according to the subjects and time, marking the missing record and evaluating the consistency of the two deviation record data so as to determine the missing mark and the difference reason;
The detection results comprise a consistency result, a deletion mark and a difference reason.
In one embodiment, referring to fig. 7, the step S132 may include steps S1321 to S1324.
S1321, grouping the two analysis results according to the screening number of the subjects, and classifying records under the same subject according to the PD type to obtain a plurality of groups of data;
s1322, sorting each group of data according to the occurrence time of the event to obtain a sorting result;
s1323, constructing a comparison prompt word;
s1324, inputting the sorting result and the comparison prompt word into a large language model to obtain a comparison result, wherein the comparison result comprises a marker missing condition and a difference reason.
In this embodiment, the step S132 mainly focuses on comparing PD records of the same subject and the same visit in CTMS and EDC systems, and identifies data differences and deletions. The process comprises the following substeps:
PD records in CTMS and EDC are grouped according to screening number (e.g., unique ID) of the subject. Each group was ensured to represent data from the same subject.
In the data of the same subject, classification is further performed according to PD type. This is done to ensure that the data structure of the comparison is uniform, and avoid comparison errors caused by different categories of records.
Each set of data is ordered by event occurrence time. By time sequential ordering, it can be ensured that the same event (e.g., a delay of a visit) in both systems is correctly aligned.
And (3) constructing a comparison prompt word, wherein the content of the prompt word comprises a PD record list of the same subject and the same visit in the CTMS and the EDC system, and the LLM is required to be compared according to the description, the time and other factors, so as to identify the difference between the records. The comparison results are classified into three categories:
both systems have records and are consistent;
CTMS has records but EDC has no records (labeled "missing");
EDC has a record but CTMS has no record (labeled "missing").
The output of LLM will include the case of a marker miss record, as well as the analysis of the cause of the discrepancy (e.g., data entry errors or system synchronization problems).
In this embodiment, the final detection result includes the following key elements:
And (3) evaluating whether the two data are matched semantically or not through internal consistency check.
Missing markers-comparing PD records in both systems, marking which records are missing in one system.
And the reasons for the difference are the reasons for data inconsistency between the two systems, such as data errors, system synchronization problems and the like, are analyzed and explained.
The embodiment realizes the comprehensive automatic check of the PD records of the CTMS and EDC system by fully utilizing the powerful semantic analysis and reasoning capability of the large language model. The method comprises the following technical measures:
The intelligent prompt word is generated by constructing the prompt word according to the actual data and transmitting the prompt word to a large language model for checking;
the automatic consistency check and contrast check is carried out, manual intervention is not needed, and high-efficiency consistency check and difference analysis can be carried out on mass data;
and (3) depth reasoning and correction advice, namely reasonably analyzing and suggesting inconsistent records, and providing support for data cleaning.
The method not only improves the checking efficiency and accuracy, but also reduces manual intervention, and remarkably improves the intelligence and compliance of data management.
In the method of this embodiment, each PD record in the CTMS and EDC system is checked for internal consistency, ensuring semantic consistency of the PD class (e.g. "visit delay") and the PD description (e.g. "subject missed visit 3 due to traffic problems"). With the zero sample text classification capability of the Large Language Model (LLM), inconsistent records are intelligently identified and marked and correction suggestions are provided. This provides the necessary support for subsequent data cleansing and quality control.
Differences between the two systems (e.g., missing records, misclassification, descriptive inconsistencies, etc.) are identified by comparing PD records of the same subject, the same visit in the CTMS and EDC systems. With the deep reasoning capabilities of LLM, the module can not only discover differences, but also infer their root cause (e.g., data entry errors or system synchronization problems), and evaluate the severity of differences, thereby helping to improve data consistency.
The method of the embodiment plays the advantages of the large language model in semantic analysis and reasoning, realizes the full-automatic process from consistency detection to difference comparison, remarkably improves the verification efficiency and accuracy, and reduces the requirement of manual intervention. In addition, the module design has flexible expansibility, can adapt to clinical trials of different scales and diversified CTMS/EDC systems, and provides an intelligent data management and supervision compliance solution.
Specifically, each PD record of the CTMS and EDC is traversed, extracting its main_category_name, sub_category_name, and description fields.
Constructing a prompt word, referring to the following structure:
description { description };
category { main_category_name }, { sub_category_name };
is the description and category consistent? consistency results (consistent/inconsistent) confidence and reasons and suggestions when inconsistent.
The LLM returns the result in text format, and the returned text is parsed and consistent results ("consistent" or "inconsistent"), confidence levels (0-1), reasons for the inconsistency, and suggestions are extracted. To improve efficiency, batch hint words may be constructed.
The PD record comparison checking procedure of CTMS and EDC is as follows:
PD records in CTMS and EDC are grouped according to subject screening number.
PD records under each subject screen number are grouped by PD type.
The same type of PD record under the same subject screen number is ordered chronologically.
Constructing a prompt word, referring to the following structure:
the PD record list from CTMS is { CTMS_PD_RECORDS };
The PD record list from EDC is { EDC_PD_RECORDS };
please compare one-to-one from the records from the CTMS's PD record list and the EDC's PD record list, infer whether it is the same PD based on description and time, and the comparison result is divided into three categories:
Both systems have records and are consistent.
CTMS has EDC no-mark as "deletion".
EDC is present, CTMS is not marked as "missing".
The LLM returns the result to be the text format, need to analyze the text returned and produce the required data structure object according to the analysis result.
The prompt word needs to be optimized according to specific conditions.
The steps are realized through the technology, the data checking work can be efficiently and accurately completed, and the quality and the compliance of clinical test data management are further improved.
And S140, generating a report containing consistency check and difference comparison according to the detection result.
In this embodiment, the report refers to a structured Excel file generated based on the detection result, including consistency check of PD records and difference comparison between CTMS and EDC system, and aims to provide detailed check result and correction advice.
Specifically, generating an Excel file containing consistency check and difference comparison to be structured according to the detection result so as to obtain a report.
The result recording module is a core output part of the PD record checking system and is responsible for storing and displaying the checking result in a structured Excel file form, so that a user can conveniently check, analyze and archive the checking result. The main functions of the module are as follows:
Recording a checking result:
and checking a result by the CTMS system, namely recording an internal consistency check result of the PD record in the CTMS system, and checking whether the PD category is matched with the description.
And checking the EDC system, namely recording the internal consistency checking result of the PD record in the EDC system, and also checking the consistency of the PD category and the description.
And (3) comparing and checking results of the CTMS and the EDC system, namely recording PD records in the CTMS and the EDC system, comparing and checking results, and identifying the difference between the CTMS and the EDC system.
Generating a structured report:
outputting the checking result in the form of an Excel file, and dividing the checking result into three independent worksheets (sheets) which respectively correspond to the three parts.
The clear field structure is provided, so that a user can conveniently and quickly understand the checking result.
The Excel file generated by the result recording module comprises three worksheets, each worksheet corresponds to a part of checking result, and the Excel file has a clear structure and reasonable field design. The following is a detailed structure of each part:
excel file integral structure
The file name is PD_verification_result_ [ Timestamp ]. Xls;
Worksheet:
Ctms_ Consistency CTMS PD records internal consistency check results.
EDC_ Consistency EDC PD records the internal consistency check result.
CTMS_vs_EDC_ DIFFERENCES CTMS vs EDC PD record contrast check results.
Ctms_ Consistency worksheet field structure;
record_id: unique identification of PD Record (consisting of subject ID and visit number).
Visit _date, visit Date.
Device_type: PD class.
Description PD Description text.
Consistency _result-consistency check Result (labeled "consistent" or "inconsistent").
Confidence _score-confidence of consistency check (range 0-1, zero sample classification based on large language model).
Inconsistency _details if there is a discrepancy, a detailed description is provided (e.g. "described as a delay of visit, but category labeled as dosing error").
Suggestion correction advice (e.g. "advice update category to 'visit delay'").
Edc_ Consistency worksheet:
The field structure of the worksheet is the same as ctms_ Consistency, but the data originates from the EDC system.
CTMS_vs_EDC/u DIFFERENCES worksheets:
the worksheets are grouped according to subject ID in order to track the condition of the subject.
Record_id: unique identification of PD Record.
Visit _date-visit Date (the dates of CTMS and EDC may not be consistent).
Device_type_ctms PD class in CTMS.
Device_type_edc: PD class in EDC.
Description_ctms PD Description in CTMS.
Description_EDC, PD Description in EDC.
Difference_Type, difference Type (e.g., "missing", "value disagreement", "semantic disagreement").
Missing-one has a record and the other has no record.
Value inconsistencies-there is an inconsistency between structured fields (e.g., categories).
Semantic inconsistency, namely, the difference exists in description semantics.
Similarity_score: semantic Similarity of descriptions (0-1, semantic vector comparison based on large language model).
Severity severity of the difference (e.g. "slight" or "significant").
Details of differences (e.g. "CTMS record present, EDC absent").
Suggestion correction advice (e.g. "advice for supplemental recording in EDC").
S150, outputting the report.
In this embodiment, the report is output to the terminal.
The method of this embodiment obtains PD (product data) records by interfacing with CTMS (clinical trial management system) and EDC (electronic data acquisition) systems, or by file import. The semantic understanding capability of the large language model is utilized to carry out intelligent analysis on the acquired PD records, the method is automatically adapted to heterogeneous data formats and description modes among different systems, original unstructured text data are converted into structured data, and the structured data are unified in format, so that consistent input data are provided for subsequent verification. And on the basis of the analyzed data, performing consistency check on the PD records, and comparing the PD records in the CTMS and EDC system. Inconsistent records (such as missing, classification errors or semantic inconsistencies) are identified by accurately matching structured fields with the semantic analysis unstructured description, and classified according to the types and priorities of the differences, so that the accuracy and operability of the checking result are ensured. And generating a detailed check report, clearly displaying the difference details and the semantic analysis results, and providing corresponding correction suggestions.
Through the development based on Microsoft Office ecology, the system can run in an environment familiar to a user, reduce operation barriers, realize seamless integration with the existing workflow, and promote compatibility and convenience. By adopting the semantic understanding capability of a large language model, the automatic adaptation of PD record formats in different CTMS and EDC systems can be realized in a data analysis module, multiple heterogeneous data formats and description modes are intelligently analyzed and unified, the structure and expression difference among the systems are solved, and the cross-platform data integration is supported, so that the verification efficiency and consistency are greatly improved. Depending on semantic understanding of a large language model and zero sample text classification capability, the system can perform consistency check of PD categories and descriptions in PD records, automatically find and correct problems of mismatch of classification (for example, visit delay is marked as drug delivery error by mistake), and adapt to different PD description scenes without additional training data. By utilizing the deep reasoning capability of the large language model, the system can perform difference comparison check on PD records in the CTMS and EDC system, comprehensively analyze structured fields and unstructured descriptions, identify subtle differences (such as semantic inconsistencies, classification errors or data missing) among the records, infer potential root causes and provide accurate check results and improvement suggestions, thereby improving data consistency and realizing the intellectualization of the check process.
Conventional methods (e.g., manual auditing or semi-automated processes) typically require extensive manual intervention, which is time consuming and inefficient. For example, in a large multi-center clinical trial involving thousands of PD records, manual comparison of data in CTMS and EDC often takes days, while the method of this embodiment shortens the required time to several minutes through automated verification, and the total time after manual confirmation can be reduced to within 1 hour.
When the unstructured PD description is processed by the traditional method, manual judgment or simple rules are often relied on, records with similar expressions and different semantics are difficult to accurately identify, and therefore verification results are not accurate enough. The method of the embodiment can check the data more accurately through an intelligent means.
The method of the embodiment fully utilizes the wide application of the Microsoft Office environment, and adopts the familiar operation interface of the user, thereby reducing the invasiveness of the system and ensuring that the system can be seamlessly integrated into the existing workflow. The design greatly reduces the learning cost of the user and improves the compatibility and the deployment convenience of the system.
According to the clinical test scheme deviation automatic checking method, the PD records and the classified and classified data thereof are obtained from the CTMS system and the EDC system, the data are analyzed by using the large language model, the system can efficiently and automatically process consistency check and difference comparison of the data, the checking efficiency and accuracy are remarkably improved, compatibility among different systems is ensured, and manual intervention and learning cost are reduced. By generating the report containing consistency check and difference comparison, the system can seamlessly integrate the existing workflow, realize automation and optimization of the workflow, and greatly improve the working efficiency and quality.
Fig. 9 is a schematic block diagram of a clinical trial plan deviation automatic checking apparatus 300 provided by an embodiment of the present invention. As shown in FIG. 9, the present invention also provides a clinical trial deviation automatic checking apparatus 300 corresponding to the above clinical trial deviation automatic checking method. The clinical trial deviation automatic checking apparatus 300, which includes means for performing the above-described clinical trial deviation automatic checking method, may be configured in a server. Specifically, referring to fig. 9, the clinical trial deviation automatic checking apparatus 300 includes an acquisition unit 301, an analysis unit 302, a PD checking unit 303, a report generating unit 304, and an output unit 305.
The system comprises an acquisition unit 301 for acquiring PD records and PD classification hierarchical data from a CTMS system and an EDC system to obtain two pieces of data of the same item of the CTMS system and the EDC system, an analysis unit 302 for analyzing the two pieces of data by using a large language model to obtain two analysis results, a PD checking unit 303 for carrying out consistency check on the two pieces of data according to the two analysis results and comparing differences to obtain detection results, a report generating unit 304 for generating a report containing consistency check and difference comparison according to the detection results, and an output unit 305 for outputting the report.
In one embodiment, the parsing unit 302 includes:
The system comprises a classification subunit, an information extraction subunit, a transformation subunit and a transformation subunit, wherein the classification subunit is used for classifying the two data by using a large language model to obtain deviation record data and classification data, the information extraction subunit is used for automatically mapping different system fields to the deviation record data and the classification data and analyzing unstructured PD descriptions to extract key information, and the transformation subunit is used for transforming the key information into a unified internal data structure to obtain two analysis results.
In an embodiment, the classifying subunit is configured to divide the two data into data blocks and classify each of the data blocks by using a zero sample classification capability of a large language model, and determine a final class in combination with a voting mechanism to obtain the deviation record data and the classification data.
In an embodiment, the information extraction subunit comprises:
The system comprises a first construction module, an analysis module and a reverse sequence module, wherein the first construction module is used for constructing a prompt word, the analysis module is used for inputting the deviation record data and the classification grading data into a large language model by combining the prompt word so as to analyze unstructured PD description to obtain an analysis result in a JSON format, and the reverse sequence module is used for reversely sequencing the analysis result in the JSON format into an object in a programming language so as to obtain key information.
In an embodiment, the PD checking unit 303 includes:
The system comprises a large language model, an internal consistency checking subunit, a comparison subunit and a comparison subunit, wherein the large language model is used for checking the internal consistency of the two analysis results to obtain a consistency result, the comparison subunit is used for comparing the two analysis results according to subjects and time, marking missing records and evaluating the consistency of two deviation record data to determine missing marks and difference reasons, and the detection result comprises the consistency result, the missing marks and the difference reasons.
In one embodiment, the internal consistency check subunit comprises:
the analysis module is used for inputting the two analysis results and the prompt word into a large language model for internal consistency analysis so as to obtain a consistency result;
Wherein the consistency result comprises a consistency result, a confidence level, a non-consistency reason and a suggestion.
In one embodiment, the contrast subunit comprises:
the system comprises a test subject screening number module, a grouping module, a sorting module, a third construction module and a comparison module, wherein the test subject screening number module is used for grouping the two analysis results, the records under the same test subject are classified according to PD types to obtain a plurality of groups of data, the sorting module is used for sorting each group of data according to event occurrence time to obtain a sorting result, the third construction module is used for constructing comparison prompt words, and the comparison module is used for inputting the sorting result and the comparison prompt words into a large language model to obtain comparison results, wherein the comparison results comprise marked missing conditions and difference reasons.
In an embodiment, the report generating unit 304 is configured to generate an Excel file including consistency check and difference comparison to be structured according to the detection result, so as to obtain a report.
It should be noted that, as those skilled in the art can clearly understand, the above clinical test scheme deviates from the specific implementation procedure of the automatic checking device 300 and each unit, reference may be made to the corresponding description in the foregoing method embodiments, and for convenience and brevity of description, the description is omitted here.
The above-described clinical trial deviation from the automatic checking apparatus 300 may be implemented in the form of a computer program that can be run on a computer device as shown in fig. 9.
Referring to fig. 9, fig. 9 is a schematic block diagram of a computer device according to an embodiment of the present application. The computer device 500 may be a server, where the server may be a stand-alone server or may be a server cluster formed by a plurality of servers.
With reference to FIG. 9, the computer device 500 includes a processor 502, memory, and a network interface 505 connected by a system bus 501, where the memory may include a non-volatile storage medium 503 and an internal memory 504.
The non-volatile storage medium 503 may store an operating system 5031 and a computer program 5032. The computer program 5032 includes program instructions that, when executed, cause the processor 502 to perform a clinical trial protocol deviation from an automatic checking method.
The processor 502 is used to provide computing and control capabilities to support the operation of the overall computer device 500.
The internal memory 504 provides an environment for the execution of a computer program 5032 in the non-volatile storage medium 503, which computer program 5032, when executed by the processor 502, causes the processor 502 to perform a clinical trial protocol deviating from the auto-checking method.
The network interface 505 is used for network communication with other devices. It will be appreciated by those skilled in the art that the architecture shown in fig. 9 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting of the computer device 500 to which the present inventive arrangements may be implemented, as a particular computer device 500 may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
Wherein the processor 502 is configured to execute a computer program 5032 stored in a memory to implement the steps of:
The method comprises the steps of obtaining PD records and PD classification grading data from a CTMS system and an EDC system to obtain two pieces of data of the same item of the CTMS system and the EDC system, analyzing the two pieces of data by using a large language model to obtain two analysis results, carrying out consistency check on the two pieces of data according to the two analysis results, comparing differences to obtain detection results, generating a report containing consistency check and difference comparison according to the detection results, and outputting the report.
In one embodiment, when the step of parsing the two pieces of data using the large language model to obtain two parsing results is implemented by the processor 502, the following steps are specifically implemented:
The method comprises the steps of classifying the two data by using a large language model to obtain deviation record data and classification data, automatically mapping different system fields to the deviation record data and the classification data and analyzing unstructured PD descriptions to extract key information, and converting the key information into a unified internal data structure to obtain two analysis results.
In one embodiment, when the step of classifying the two data using the large language model to obtain the deviation record data and the classification data is implemented by the processor 502, the following steps are specifically implemented:
Dividing the two data into data blocks, classifying each data block by utilizing zero sample classification capability of a large language model, and determining a final class by combining a voting mechanism to obtain deviation record data and classification data.
In one embodiment, when the step of automatically mapping the deviation record data and the classification hierarchical data to different system fields and parsing unstructured PD descriptions to extract key information is implemented by the processor 502, the following steps are specifically implemented:
The method comprises the steps of constructing a prompt word, inputting the deviation record data and the classification hierarchical data into a large language model by combining the prompt word to analyze unstructured PD description to obtain an analysis result in a JSON format, and inversely sequencing the analysis result in the JSON format into an object in a programming language to obtain key information.
In an embodiment, when the processor 502 performs the step of performing consistency check on the two data according to the two analysis results and comparing differences to obtain a detection result, the following steps are specifically implemented:
The method comprises the steps of obtaining two analysis results, checking internal consistency of the two analysis results through a large language model to obtain consistency results, comparing the two analysis results according to subjects and time, marking missing records and evaluating consistency of two deviation record data to determine missing marks and difference reasons, wherein the detection results comprise consistency results, missing marks and difference reasons.
In one embodiment, when the step of checking the internal consistency of the two parsing results through the large language model to obtain a consistency result is implemented by the processor 502, the following steps are specifically implemented:
Inputting the two analysis results and the prompt word into a large language model for internal consistency analysis to obtain a consistency result;
Wherein the consistency result comprises a consistency result, a confidence level, a non-consistency reason and a suggestion.
In one embodiment, when the step of comparing the two analysis results by subject and time, the processor 502 marks the missing record and evaluates the consistency of the two deviation record data to determine the missing mark and the cause of the difference, the following steps are specifically implemented:
The method comprises the steps of grouping the two analysis results according to a subject screening number, classifying records under the same subject according to PD types to obtain a plurality of groups of data, sorting each group of data according to event occurrence time to obtain a sorting result, constructing a comparison prompt word, and inputting the sorting result and the comparison prompt word into a large language model to obtain a comparison result, wherein the comparison result comprises a mark missing condition and a difference reason.
In one embodiment, when the step of generating the report including consistency check and difference comparison according to the detection result is implemented by the processor 502, the following steps are specifically implemented:
And generating an Excel file which contains consistency check and difference comparison and is structured according to the detection result so as to obtain a report.
It should be appreciated that in embodiments of the present application, the Processor 502 may be a central processing unit (Central Processing Unit, CPU), the Processor 502 may also be other general purpose processors, digital signal processors (DIGITAL SIGNAL processors, DSPs), application SPECIFIC INTEGRATED Circuits (ASICs), off-the-shelf Programmable gate arrays (Field-Programmable GATE ARRAY, FPGA) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. Wherein the general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Those skilled in the art will appreciate that all or part of the flow in a method embodying the above described embodiments may be accomplished by computer programs instructing the relevant hardware. The computer program comprises program instructions, and the computer program can be stored in a storage medium, which is a computer readable storage medium. The program instructions are executed by at least one processor in the computer system to implement the flow steps of the embodiments of the method described above.
Accordingly, the present invention also provides a storage medium. The storage medium may be a computer readable storage medium. The storage medium stores a computer program which, when executed by a processor, causes the processor to perform the steps of:
The method comprises the steps of obtaining PD records and PD classification grading data from a CTMS system and an EDC system to obtain two pieces of data of the same item of the CTMS system and the EDC system, analyzing the two pieces of data by using a large language model to obtain two analysis results, carrying out consistency check on the two pieces of data according to the two analysis results, comparing differences to obtain detection results, generating a report containing consistency check and difference comparison according to the detection results, and outputting the report.
In one embodiment, when the processor executes the computer program to implement the step of parsing the two pieces of data using a large language model to obtain two parsing results, the steps are specifically implemented as follows:
The method comprises the steps of classifying the two data by using a large language model to obtain deviation record data and classification data, automatically mapping different system fields to the deviation record data and the classification data and analyzing unstructured PD descriptions to extract key information, and converting the key information into a unified internal data structure to obtain two analysis results.
In one embodiment, when the processor executes the computer program to implement the step of classifying the two data using a large language model to obtain deviation record data and classification data, the steps are specifically implemented as follows:
Dividing the two data into data blocks, classifying each data block by utilizing zero sample classification capability of a large language model, and determining a final class by combining a voting mechanism to obtain deviation record data and classification data.
In one embodiment, when the processor executes the computer program to automatically map the different system fields to the deviation record data and the classification data and parse unstructured PD descriptions to extract key information, the processor specifically implements the following steps:
The method comprises the steps of constructing a prompt word, inputting the deviation record data and the classification hierarchical data into a large language model by combining the prompt word to analyze unstructured PD description to obtain an analysis result in a JSON format, and inversely sequencing the analysis result in the JSON format into an object in a programming language to obtain key information.
In an embodiment, when the processor executes the computer program to perform the step of performing consistency check on the two data according to the two analysis results and comparing differences to obtain a detection result, the following steps are specifically implemented:
Comparing the two analysis results according to the subjects and time, marking the missing record and evaluating the consistency of the two deviation record data to determine the missing mark and the difference reason;
The detection results comprise a consistency result, a deletion mark and a difference reason.
In one embodiment, when the processor executes the computer program to implement the step of checking the internal consistency of the two parsing results through a large language model to obtain a consistency result, the processor specifically implements the following steps:
The method comprises the steps of establishing a prompt word according to the two analysis results, inputting the two analysis results and the prompt word into a large language model for internal consistency analysis to obtain a consistency result, wherein the consistency result comprises a consistency result, confidence level, inconsistency reasons and suggestions.
In one embodiment, when the processor executes the computer program to perform the steps of comparing the two analysis results by subject and time, marking the missing record and evaluating the consistency of the two deviation record data to determine the missing mark and the cause of the difference, the steps are specifically implemented as follows:
The method comprises the steps of grouping the two analysis results according to a subject screening number, classifying records under the same subject according to PD types to obtain a plurality of groups of data, sorting each group of data according to event occurrence time to obtain a sorting result, constructing a comparison prompt word, and inputting the sorting result and the comparison prompt word into a large language model to obtain a comparison result, wherein the comparison result comprises a mark missing condition and a difference reason.
In one embodiment, when the processor executes the computer program to implement the step of generating a report including consistency check and difference comparison according to the detection result, the method specifically includes the following steps:
And generating an Excel file which contains consistency check and difference comparison and is structured according to the detection result so as to obtain a report.
The storage medium may be a U-disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk, or other various computer-readable storage media that can store program codes.
Those of ordinary skill in the art will appreciate that the elements and algorithm steps described in connection with the embodiments disclosed herein may be embodied in electronic hardware, in computer software, or in a combination of the two, and that the elements and steps of the examples have been generally described in terms of function in the foregoing description to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are merely illustrative. For example, the division of each unit is only one logic function division, and there may be another division manner in actual implementation. For example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed.
The steps in the method of the embodiment of the invention can be sequentially adjusted, combined and deleted according to actual needs. The units in the device of the embodiment of the invention can be combined, divided and deleted according to actual needs. In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The integrated unit may be stored in a storage medium if implemented in the form of a software functional unit and sold or used as a stand-alone product. Based on such understanding, the technical solution of the present invention is essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a terminal, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention.
While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made and equivalents will be apparent to those skilled in the art without departing from the scope of the invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (7)

1. A method for automatically verifying deviations from a clinical trial regimen comprising:
Acquiring PD records and PD classification grading data from a CTMS system and an EDC system to obtain two pieces of data of the same item of the CTMS system and the EDC system;
analyzing the two data by using a large language model to obtain two analysis results;
Consistency checking is carried out on the two data according to the two analysis results, and differences are compared to obtain a detection result;
generating a report containing consistency check and difference comparison according to the detection result;
outputting the report;
Analyzing the two data by using the large language model to obtain two analysis results, wherein the analyzing comprises the following steps:
Classifying the two data by using a large language model to obtain deviation record data and classification data;
Automatically mapping different system fields to the deviation record data and the classification grading data and analyzing unstructured PD descriptions to extract key information;
converting the key information into a unified internal data structure to obtain two analysis results;
the classifying the two pieces of data using a large language model to obtain deviation record data and classification data includes:
dividing the two data into data blocks, classifying each data block by utilizing zero sample classification capability of a large language model, and determining a final class by combining a voting mechanism to obtain deviation record data and classification data;
The automatically mapping the deviation record data and classification data into different system fields and parsing unstructured PD descriptions to extract key information includes:
constructing a prompt word;
Inputting the deviation record data and the classification grading data into a large language model by combining the prompt words so as to analyze unstructured PD description and obtain an analysis result in a JSON format;
And de-serializing the analysis result in the JSON format into an object in a programming language to obtain key information.
2. The method of claim 1, wherein the performing a consistency check on the two data based on the two analysis results and comparing differences to obtain a detection result comprises:
Checking the internal consistency of the two analysis results through a large language model to obtain a consistency result;
comparing the two analysis results according to the subjects and time, marking the missing record and evaluating the consistency of the two deviation record data so as to determine the missing mark and the difference reason;
The detection results comprise a consistency result, a deletion mark and a difference reason.
3. The method of claim 2, wherein said checking internal consistency of the two resolved results by a large language model to obtain a consistency result comprises:
Constructing a prompt word according to the two analysis results;
Inputting the two analysis results and the prompt word into a large language model for internal consistency analysis to obtain a consistency result;
Wherein the consistency result comprises a consistency result, a confidence level, a non-consistency reason and a suggestion.
4. The method of claim 3, wherein comparing the two resolved results by subject and time, marking the missing record and evaluating the consistency of two deviation record data to determine the missing mark and the cause of the discrepancy comprises:
Grouping the two analysis results according to the screening number of the subject, and classifying records under the same subject according to the PD type to obtain a plurality of groups of data;
sequencing each group of data according to the occurrence time of the event to obtain a sequencing result;
constructing a comparison prompt word;
and inputting the sequencing result and the comparison prompt word into a large language model to obtain a comparison result, wherein the comparison result comprises a mark missing condition and a difference reason.
5. The method of claim 1, wherein generating a report including consistency checks and variance comparisons based on the test results comprises:
And generating an Excel file which contains consistency check and difference comparison and is structured according to the detection result so as to obtain a report.
6. A clinical trial deviation automatic checking apparatus, wherein the apparatus uses the clinical trial deviation automatic checking method according to any one of claims 1 to 5, comprising:
The acquisition unit is used for acquiring PD records and PD classification grading data from the CTMS system and the EDC system so as to obtain two pieces of data of the same item of the CTMS system and the EDC system;
the analysis unit is used for analyzing the two data by using the large language model to obtain two analysis results;
The PD checking unit is used for carrying out consistency check on the two data according to the two analysis results and comparing the differences to obtain a detection result;
the report generation unit is used for generating a report containing consistency check and difference comparison according to the detection result;
and the output unit is used for outputting the report.
7. A computer device, characterized in that it comprises a memory on which a computer program is stored and a processor which, when executing the computer program, implements the method according to any of claims 1-5.
CN202510746150.5A 2025-06-05 Automatic verification method and device for deviation of clinical test scheme and computer equipment Active CN120257967B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202510746150.5A CN120257967B (en) 2025-06-05 Automatic verification method and device for deviation of clinical test scheme and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202510746150.5A CN120257967B (en) 2025-06-05 Automatic verification method and device for deviation of clinical test scheme and computer equipment

Publications (2)

Publication Number Publication Date
CN120257967A CN120257967A (en) 2025-07-04
CN120257967B true CN120257967B (en) 2025-10-17

Family

ID=

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
临床试验中严重不良事件一致性核对的优化;周蓓;于浩;;中国临床药理学与治疗学;20180426;第23卷(第04期);正文第1-6页 *

Similar Documents

Publication Publication Date Title
US12265918B2 (en) Systems and methods for enriching modeling tools and infrastructure with semantics
CN102713834B (en) Management accounts format information
US8166000B2 (en) Using a data mining algorithm to generate format rules used to validate data sets
Halkidi et al. Data mining in software engineering
US20170109636A1 (en) Crowd-Based Model for Identifying Executions of a Business Process
CN118245441B (en) An industrial and commercial digital archive management system capable of automatic classification
US10318481B2 (en) System and method to determine quality of a document screening process
CN117725122A (en) An order synchronization method for business management platform
US20220405235A1 (en) System and method for reference dataset management
CN115547466A (en) Medical institution registration and review system and method based on big data
US11816112B1 (en) Systems and methods for automated process discovery
CN119987829A (en) Automated code review method, computer and storage medium
CN119204016A (en) Scientific dataset naming standard checking and automatic updating of model training methods and systems
WO2025019581A1 (en) Data digitization via custom integrated machine learning ensembles
US12159203B1 (en) Creation and execution of portable software for execution on one or more remote computers
CN120257967B (en) Automatic verification method and device for deviation of clinical test scheme and computer equipment
CN117311777A (en) Automatic operation and maintenance platform and method
CN120257967A (en) Method, device and computer equipment for automatic verification of deviation from clinical trial protocol
Fumagalli et al. Liveschema: A gateway towards learning on knowledge graph schemas
CN120297382A (en) A method, device and equipment for automatically constructing a survey and design specification mapping map
CN120763062A (en) Software testing task management method and device based on interventional instruction knowledge base
CN120672296A (en) Low-code-based intelligent process construction and dynamic optimization method
CN120162353A (en) A method and system for classifying and retrieving technical achievements
WO2025149849A1 (en) Machine learning techniques for improving context and understanding of user interaction-based data
WO2024127060A1 (en) Method for verifying a defined classification of a log file of a drilling operation

Legal Events

Date Code Title Description
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载