Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines

Wihl, Jonas; Rosenkranz, Enrike; Schramm, Severin; Berberich, Cornelius; Griessmair, Michael; Woźnicki, Piotr; Pinto, Francisco; Ziegelmayer, Sebastian; Adams, Lisa C.; Bressem, Keno K.; Kirschke, Jan S.; Zimmer, Claus; Wiestler, Benedikt; Hedderich, Dennis; Kim, Su Hwan

doi:10.1186/s41747-025-00600-2

Original article
Open access
Published: 19 June 2025

Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines

Jonas Wihl¹,
Enrike Rosenkranz¹,
Severin Schramm¹,
Cornelius Berberich¹,
Michael Griessmair^1,2,
Piotr Woźnicki³,
Francisco Pinto³,
Sebastian Ziegelmayer⁴,
Lisa C. Adams⁴,
Keno K. Bressem⁵,
Jan S. Kirschke¹,
Claus Zimmer¹,
Benedikt Wiestler^1,6,
Dennis Hedderich¹ &
…
Su Hwan Kim ORCID: orcid.org/0000-0002-5383-2041¹

European Radiology Experimental volume 9, Article number: 61 (2025) Cite this article

569 Accesses
Metrics details

Abstract

Background

To evaluate the impact of an annotation guideline on the performance of large language models (LLMs) in extracting data from stroke computed tomography (CT) reports.

Methods

The performance of GPT-4o and Llama-3.3-70B in extracting ten imaging findings from stroke CT reports was assessed in two datasets from a single academic stroke center. Dataset A (n = 200) was a stratified cohort including various pathological findings, whereas dataset B (n = 100) was a consecutive cohort. Initially, an annotation guideline providing clear data extraction instructions was designed based on a review of cases with inter-annotator disagreements in dataset A. For each LLM, data extraction was performed under two conditions: with the annotation guideline included in the prompt and without it.

Results

GPT-4o consistently demonstrated superior performance over Llama-3.3-70B under identical conditions, with micro-averaged precision ranging from 0.83 to 0.95 for GPT-4o and from 0.65 to 0.86 for Llama-3.3-70B. Across both models and both datasets, incorporating the annotation guideline into the LLM input resulted in higher precision rates, while recall rates largely remained stable. In dataset B, the precision of GPT-4o and Llama-3-70B improved from 0.83 to 0.95 and from 0.87 to 0.94, respectively. Overall classification performance with and without the annotation guideline was significantly different in five out of six conditions.

Conclusion

GPT-4o and Llama-3.3-70B show promising performance in extracting imaging findings from stroke CT reports, although GPT-4o steadily outperformed Llama-3.3-70B. We also provide evidence that well-defined annotation guidelines can enhance LLM data extraction accuracy.

Relevance statement

Annotation guidelines can improve the accuracy of LLMs in extracting findings from radiological reports, potentially optimizing data extraction for specific downstream applications.

Key Points

LLMs have utility in data extraction from radiology reports, but the role of annotation guidelines remains underexplored.
Data extraction accuracy from stroke CT reports by GPT-4o and Llama-3.3-70B improved when well-defined annotation guidelines were incorporated into the model prompt.
Well-defined annotation guidelines can improve the accuracy of LLMs in extracting imaging findings from radiological reports.

Graphical Abstract

Background

Computed tomography (CT) imaging in suspected acute stroke, including unenhanced CT (NECT), CT angiography, and CT perfusion, provides critical insights into the type of stroke (ischemic or hemorrhagic), the presence of vessel occlusions, and the extent of brain damage. The findings recorded in radiological reports are pivotal in determining eligibility for intravenous thrombolysis or mechanical thrombectomy [1]. Importantly, the data contained in these reports holds enormous value beyond its utility in clinical decision-making, enabling studies on epidemiology [2], pathophysiology [3], treatment efficacy [4, 5], and patient outcomes [6, 7]. Key variables can further be utilized as labels for training machine learning algorithms for tasks such as detecting large vessel occlusion [8] or automatic evaluation of the Alberta Stroke Program Early CT Score (ASPECTS) [9]. Imaging findings also play a crucial role in national stroke registries aiming to monitor and improve the quality of stroke care [10, 11].

Yet, given that most radiology reports still consist of prose and lack standardized terminology, analysis of reported findings previously necessitated labor-intensive manual annotations by experts, limiting scalability [12]. Natural language processing systems based on machine learning have shown promising results in automating information extraction from radiology reports, but were limited by the scarcity of annotated training data and the variability of reports [13,14,15].

Recently, large language models (LLMs) have demonstrated great potential in overcoming these limitations. In radiology, LLMs have shown significant promise in tasks such as report generation [16, 17], report translation [18], and differential diagnosis (DDx) [19, 20]. The data extraction performance of LLMs has been evaluated in different modalities, ranging from x-ray to interventional angiography. Notably, both proprietary [21,22,23] and open-source LLMs [24, 25] have been assessed, with open-source models offering the advantage of local data processing, enhancing patient data privacy. In models of both categories, accuracy levels of more than 90% of correctly extracted parameters have been reported [22, 24], demonstrating their potential utility.

However, methodological inconsistencies in these types of studies pose challenges. For instance, many studies relied on manual annotations by only a single annotator [21, 22, 26], making the reference standard prone to subjective bias and human error. In addition, a scoping review by Reichenpfader et al [27] pointed out that only 9% (3/34) of studies reported annotation guidelines, unveiling a frequent lack of standardization and transparency in the annotation process. Moreover, many studies modeled findings in radiology reports as simple binary variables, which fail to capture the nuanced levels of diagnostic uncertainty expressed in the textual descriptions [24, 27, 28].

The aim of this study was to evaluate the performance of LLMs in data extraction from stroke CT reports, with or without annotation guidelines supplied as additional input.

Methods

This retrospective study was approved by the Institutional Review Board of the Technical University of Munich (TUM), and the need for informed consent was waived.

Datasets

This study employed two datasets from a single German academic institution with a comprehensive stroke center. Both included patients who underwent a CT examination in suspected acute stroke, featuring either unenhanced CT with CT angiography only, or an additional CT perfusion. Reports were available in the German language.

Dataset A (n = 200) was a stratified cohort comprising five purposively sampled subgroups with the following imaging findings each (exam dates ranging from June 14, 2022 to July 14, 2024): ischemic stroke of the anterior circulation (n = 40), ischemic stroke of the posterior circulation (n = 40), extracranial pathology (e.g., carotid stenosis) (n = 40), intracranial hemorrhage (n = 40), and miscellaneous pathologies (n = 40). Covering various pathologies, this dataset served as the basis for creating a comprehensive annotation guideline. Dataset B (n = 100) contained a chronologically collected, consecutive cohort between August 1, 2024, and September 14, 2024.

Prior to conducting the LLM queries, reports from both datasets were manually reviewed and curated by S.H.K. and J.W. to remove potentially identifying information (e.g., names, examination dates).

A formal sample size calculation was not performed, given the exploratory nature of the evaluation and the absence of prior literature providing comparable effect sizes.

Data extraction parameters

A template with the following ten imaging findings was created and represented in JavaScript Object Notation (JSON) format (Fig. 1): intracerebral hemorrhage (ICH), epidural hemorrhage, subdural hemorrhage, subarachnoid hemorrhage (SAH), infarct demarcation, vascular occlusion, vascular stenosis, aneurysm, dissection, and perfusion deficit.

Manual annotations and annotation guidelines

A prototypical user interface (provided by Smart Reporting GmbH, Munich, Germany) was used to perform manual data entries. One radiology resident with two years of dedicated neuroradiology experience (S.H.K.) and one fourth-year medical student (J.W.) independently annotated dataset A (n = 200). A brief annotation guideline with general instructions (e.g., handling of missing data) defined by SHK was followed by both annotators. During the annotation process of dataset A, ambiguous instances were identified and recorded by the raters.

S.H.K. and D.H. (board-certified neuroradiologists with 10 years of experience) reviewed cases with inter-rater disagreement in dataset A, and added an addendum to the original annotation guideline addressing the identified edge cases. The resulting annotation guideline included example phrases and their correct labels. Such approaches, where LLMs are provided few examples of a task to guide the model’s output, are also known as ‘few-shot prompting’ [29]. Manual annotations of dataset A were revised according to the final guideline. Manual annotations of dataset B were conducted by S.H.K. and J.W. according to the same final guideline, and cases of inter-rater disagreement were resolved by D.H. Findings not mentioned in the report were considered absent. This reasoning principle, whereby anything not explicitly stated is assumed to be false or absent, is also known as the “closed-world assumption” [30]. Findings with uncertainty descriptors (“possible”, “DDx”, etc.) indicating no clear positive or negative tendency were labeled as “unknown” by annotators and omitted from the LLM data extraction analysis.

LLM infrastructure

Generative pre-trained transformer 4 omni (GPT-4o) (‘gpt-4o-2024-08-06’) by OpenAI [31] and Large Language Model Meta AI (Llama)—Llama-3.3-70B (‘Llama-3.3-70B-Instruct’) by Meta [32] were chosen as representative state-of-the-art proprietary and open-source LLMs, each at the time of the study. GPT-4o was accessed via OpenAI’s application programming interface (API) (https://platform.openai.com/docs/models#gpt-4o). Llama-3.3-70B was deployed in a local environment utilizing the Python library “llama-cpp-python”, which provides compressed, less memory-intensive LLM instances (‘quantization’). A quantization factor of Q4_K_M was chosen to allow full GPU offloading. A single NVIDIA Quadro RTX 8000 with 48 GB of video memory was used for local inferences.

For both GPT-4o and Llama-3.3-70B, the model temperature was set to 0.0 to ensure deterministic results. To explore the impact of temperature settings on data extraction performance, GPT-4o queries were additionally run with the default temperature setting of 1. Our scripts for executing both models are publicly available in our repository at: https://github.com/shk03/stroke_llm_data_extraction.

LLM queries

For both models, queries were performed with and without annotation guidelines. The base prompt was defined as follows (translated from German to English):

“Extract the information provided in the radiological report in the format of a JSON file.

Each of the ten parameters should be evaluated as ‘Yes’ or ‘No.’ Findings that are not mentioned are considered absent and should be evaluated as ‘No.’

Please take the following guidelines into account: // only included in the ‘with guideline’ group

{annotation_guidelines} // only included in the ‘with guideline’ group

The JSON file should have the following structure:

{json_schema}

Here is the report from which the information should be extracted:

{report_text}”

Queries with a temperature of 0 were conducted only once, assuming that this setting would lead to almost fully deterministic results. In contrast, GPT-4o queries with a temperature of 1 were repeated three times each to account for probabilistic variations. Execution times for LLM queries were recorded.

GPT-4o's performance in diagnostic certainty assessment

In an additional experiment, the ability of GPT-4o to correctly evaluate diagnostic uncertainty of report content was evaluated in dataset B, using the default temperature setting of 1. GPT-4o was instructed to classify the ten parameters into one of the following five categories: ‘certainly absent’, ‘unlikely’, ‘possible’, ‘likely’, ‘certainly present’. Accuracy was rated against manual annotations by a single annotator (J.W.).

Analysis

Statistical analyses and data visualizations were performed using Python (Version 3.11.8). To calculate accuracy metrics for GPT-4o queries with a temperature of 1, the mode of labels across three repetitions was used. The probabilistic variation was quantified as the percentage of cases producing consistent results across repetitions. For all conditions, precision (= positive predictive value), recall (= sensitivity), F1 (= harmonic mean of precision and recall), specificity, and negative predictive value were reported. Aggregated metrics across all parameters were computed using microaveraging, which consolidates true positives, false positives, and false negatives globally. Confidence intervals for precision and recall were determined using the Wilson score method [33]. The resulting lower and upper bounds were used to approximate the confidence interval of the F1 score. Overall classification performances were compared between groups with and without the annotation guideline using the McNemar test. A p-value < 0.05 was considered statistically significant. Correction for multiple testing was not performed, given the exploratory nature of the study.

Results

Cohort overview

An overview of the patient cohorts is presented in Table 1. Patients in both datasets had a median age of 79 (dataset A: interquartile range 72–85; dataset B: IQR 65–84), and an equal sex distribution (50.0 and 51.0% females each). Due to its purposive selection, dataset A exhibited a higher proportion of pathological findings than the consecutive dataset B, with particularly high occurrences of vascular occlusions (A: 43.0%, B: 21.0%), stenoses (A: 38.5%, B: 26.0%) and perfusion deficits (A: 42.5%, B: 25.0%). Intracranial hemorrhages were relatively rare, with ICHs being most common (A: 8.5%, B: 4.0%). Similarly, aneurysms (A: 4.5%, B: 4.0%) and arterial dissections (A: 1.5%, B: 1.0%) were found only on rare occasions.

Table 1 Cohort overview

Full size table

Annotation guidelines

The annotation guideline is presented in Table 2. The original guideline was used by the raters during initial annotation of dataset A, and contained general instructions on handling descriptive findings, indicators of diagnostic uncertainty, and contradictions within the report.

Table 2 Annotation guideline

Full size table

Annotators agreed in 96.0% (1,920/2,000) of data points in dataset A (Cohen’s kappa κ = 0.852). A thorough review of cases with inter-annotator disagreement (4.0%; 80/2,000) revealed 1.3% (26/2,000) of discrepancies originating from unclear guideline instructions, as opposed to 2.7% (54/2,000) of cases resulting from careless labeling errors by the annotators. To address the guideline deficiencies identified from a review of data points with inter-rater agreement in dataset A, an addendum was created that contained both general instructions (e.g., handling previous findings) and directions on classifying specific edge cases (e.g., not counting pseudoaneurysms as aneurysms). Using a few-shot prompting approach, potentially ambiguous instructions in the annotation guideline were complemented with one or more example expressions, along with the correct label. For instance, one instruction stated: “A finding is only deemed present if it is explicitly mentioned.” In addition, the following example was included to illustrate the instruction: “Hypodensity in the parenchyma is not equivalent to infarct demarcation.” The final annotation guideline containing the original instructions and addendum was provided to GPT-4o and Llama-3.3-70B.

Model performance

We excluded 1.7% (34/2,000) and 1.4% (14/1,000) of data points from dataset A and dataset B, respectively, as the report text indicated diagnostic uncertainty without a clear positive or negative tendency (e.g., “possible”, “DDx”) (Supplements 1 and 2).

Overall, GPT-4o consistently demonstrated superior performance over Llama-3.3-70B under identical conditions, with microaveraged precision (= positive predictive value) ranging from 0.86 to 0.95 for GPT-4o and from 0.65 to 0.86 for Llama-3.3-70B. In dataset B, GPT-4o and Llama-3.3-70B (temperature = 0) yielded a precision of 0.95 and 0.74, each, in the presence of the annotation guideline, while both exhibited equal recall values (= sensitivity; both 0.98).

Across all conditions, higher precision rates were observed when the annotation guideline was included in the prompt. The precision of GPT-4o (temperature = 1) improved from 0.87 to 0.94 in dataset A (p < 0.001) and from 0.83 to 0.95 in dataset B (p = 0.006). When using a temperature of 0, GPT-4o’s precision increases from 0.86 to 0.95 in dataset A (p < 0.001) and from 0.86 to 0.93 in dataset B, although the difference was not significant (p = 0.390). Similarly, the precision of Llama-3.3 (temperature = 0) rose from 0.78 to 0.86 (p < 0.001) in dataset A and from 0.87 to 0.94 in dataset B (p = 0.001). In contrast, recall rates largely remained stable, with values ranging from 0.98 to 0.99 in all conditions (Fig. 2).

Temperature settings of GPT-4o had only a minor impact on data extraction performance when the remaining conditions were equal. The largest difference in precision was seen in dataset B in the absence of guidelines, where an increase in the temperature resulted in a small drop in precision from 0.86 (temperature = 0) to 0.83 (temperature = 1).

Detailed variable-level metrics of GPT-4o in dataset B (temperature = 1) are presented in Table 3. Metrics varied widely, particularly in findings with low prevalence in the given dataset. The greatest increase in precision through the guideline adoption was seen in infarct demarcation (0.59–1.00), subdural hematoma (SDH) (0.67–1.00), and vascular stenosis (0.84–0.96). Exemplary cases where the annotation guideline influenced the LLM ratings are shown in Table 4. A review of data points where the LLM ratings were incorrect in both conditions showed that in some cases, the instruction to exclude previous findings (e.g., “status post”, “previously described […]”) was not followed. Granular variable-level metrics of remaining groups, along with specificity and negative predictive value, are provided in Supplements 3–13.

Table 3 Data extraction performance of GPT-4o (temperature = 1) with and without annotation guideline in dataset B (n = 100)

Full size table

Table 4 Exemplary data extraction cases influenced by the annotation guideline (GPT-4o, dataset B, temperature = 1)

Full size table

Processing time and test-retest reliability

Mean processing times for GPT-4o (accessed via API) were 465.1 s per 100 reports, as compared to 1,441.4 s per 100 reports for Llama-3.3-70B (operated on a local GPU). The mean time for manual annotations (measured in dataset B) was considerably longer than both models (9,302.0 s per 100 reports). GPT-4o with a temperature setting of 1 featured a very high test-retest reliability, with identical ratings across three repetitions in 97.6% of data points.

Diagnostic certainty assessment

GPT-4o accurately classified the diagnostic certainty level in 90.0% (900/1,000) of data points in dataset B. Yet, its performance was considerably lower in uncertain findings (categories ‘unlikely’, ‘possible’, ‘likely’), with only 35.0% (7/20) correct data points. In contrast, it performed markedly higher for certain findings (categories ‘definitely absent’, ‘definitely present’), achieving 91.1% accuracy (893/980).

Discussion

In this study, we evaluated the performance of GPT-4o and Llama-3.3-70B in extracting data parameters from stroke CT reports with and without comprehensive annotation guidelines.

In summary, we demonstrate the promising performance of GPT-4o and Llama-3.3-70B in extracting findings from stroke CT reports. Our work extends previous studies illustrating the utility of LLMs in extracting data from mechanical thrombectomy reports [22, 25], brain MRI reports [24], and more. Although GPT-4o invariably outperformed Llama-3.3-70B under identical conditions, Llama-3.3-70B showed great potential, with overall precision and recall scores of up to 0.86 and 0.99, each. This is in accordance with several recent studies highlighting that open-source models are rapidly catching up with proprietary models [34,35,36]. Notably, numerous advantages of open-source models for clinical use have been pointed out, including enhanced data privacy, greater control over updates and customization, transparency, and stronger community collaboration [25, 37, 38]. Hence, it is reasonable to expect continued interest in and support for open-source models among the medical community, even though their local implementation demands great technical expertise and an advanced hardware infrastructure.

To explore the role of LLM temperature, a hyperparameter influencing output randomness and creativity, we performed GPT-4o queries with two different settings (0 and 1). but observed only negligible differences in data extraction metrics, consistent with prior research reporting stable data extraction performance in a temperature range from 0 to 1.5 [39]. Generally, higher temperature values increase output variability and creativity. In contrast, lower temperature yields more deterministic and consistent results, as well as fewer hallucinations (i.e., factually incorrect information), which might be preferable in medical applications where reliability and reproducibility are critical [40].

Crucially, this study emphasizes the role of a comprehensive annotation guideline on LLM data extraction performance. In both models and both datasets evaluated, the inclusion of a detailed annotation guideline led to an increase in precision, while retaining very high recall rates. The annotation guideline, which was equally adopted by the human annotators defining the reference standard, included definitions of the individual data variables, along with instructions for specific edge cases. A granular analysis on the variable level reveals that the improvement in data extraction metrics was primarily driven by a more precise and narrower definition of several key parameters, including ‘infarct demarcation’ and ‘vascular stenosis’.

The guideline additionally provided directions on handling diagnostic uncertainty of findings, an inherent limitation of diagnostic interpretations. A wide variability in phrases conveying certainty levels in radiology reports has been reported previously [41, 42]. While the uncertainty of findings cannot be eliminated, the annotation process requires categorization into a binary variable. To resolve this issue, our guideline specified that uncertain findings with a clear positive or negative tendency be classified as “Yes” or “No”, respectively, though equivocal findings (e.g., “possible”, “DDx”) were manually excluded for the purpose of the analysis.

In a complementary experiment, we observed that GPT-4o displayed high accuracy in classifying findings on a 5-point certainty scale if findings were definitive (‘definitely present’, ‘definitely absent’) but struggled to correctly assign uncertain findings (‘likely’, ‘possible’, and ‘unlikely’), suggesting a potential weakness.

It is worth noting that the annotation guideline in this study was defined based on a meticulous review of cases with inter-annotator disagreement in one of the two datasets. This approach uncovered numerous edge cases and rating ambiguities that had not been anticipated in advance. When applying LLM-based data extraction in real-world use cases, annotation guidelines should be carefully designed to reflect the intended downstream use of the extracted data. For example, a more restrictive definition of variables leading to higher precision might be sensible if the accurate identification of certainly positive cases is decisive (e.g., in a retrospective study with strict inclusion criteria), whereas higher sensitivity (recall) should be prioritized if the primary goal is not to miss any true positives (e.g., identifying critical incidents).

Despite the fast-paced advancement of LLM capabilities, data extraction from unstructured radiology reports is constrained by their lack of standardization of content and terminology. Classifications such as ASPECTS [43], which are frequently assessed in study settings, cannot be meaningfully analyzed if not routinely reported. Extracting the location of a finding is complicated by its variable description (e.g., in terms of adjacent structures, vascular territories, or brain lobes). Furthermore, findings that are not explicitly stated introduce another layer of ambiguity. In our study, findings were presumed absent unless explicitly mentioned, although in some cases, lacking mentions might be indicative of findings missed by the radiologist. The impact of this ambiguity on clinical communication was exemplified in a survey study, where half of the referring clinicians believed the radiologist might not have evaluated a particular finding if not explicitly documented [44].

The increased availability of LLMs for reliable and accurate data extraction from radiology reports has broad implications for the healthcare community. In contrast to the previously widespread practice of manual text annotation, LLMs offer a far more efficient and scalable approach to converting unstructured narrative reports into structured, machine-readable data. This capability facilitates the seamless integration of radiology findings into population health databases [45], supports the training of machine learning algorithms for the detection and classification of radiological abnormalities [46], and enables robust analytics for quality assurance and outcome monitoring [47].

Importantly, implementing a patient privacy-preserving workflow for LLM-powered data extraction from radiology reports would require either a secure cloud-based infrastructure, or on-premise hardware capable of running open-weight LLMs locally. While the study demonstrates the feasibility of using LLMs for automated data extraction, future work should address how this capability can be incorporated into existing radiology workflows. Potential integration pathways include embedding LLM services within radiology information systems (RIS), picture archiving and communication systems (PACS), or electronic health records (EHRs). However, clinical deployment would also require careful consideration of regulatory and ethical frameworks, including compliance with data protection regulations (e.g., General Data Protection Regulation—GDPR, Health Insurance Portability and Accountability Act—HIPAA) to mitigate bias and ensure patient safety.

Several limitations of this study need to be acknowledged. First, the single-center nature of this study necessitates further validation to confirm the generalizability of our findings across multiple institutions. Second, due to the relatively small sample size of the consecutive cohort and the low occurrence of some findings in the dataset, the variable-level analysis of data extraction metrics was underpowered. This may also explain why the improvement in precision in GPT-4o (temperature = 0) did not reach statistical significance, although the difference was notable (from 0.86 to 0.93). Third, this study utilized only German reports, and the influence of language on LLM performance has not been explicitly assessed. While language imbalances in the training data of the utilized LLMs are likely, only minor differences in model performance have been observed between high-resource languages such as English and German. For instance, GPT-4 demonstrated an accuracy of 85.5% on the Massive Multitask Language Understanding—MMLU benchmark in English, compared to 83.7% in German [48]. Therefore, it is unlikely that a similar dataset in English would lead to substantially different results. Finally, the performance of guideline-enhanced LLMs in dataset A needs to be interpreted with caution, given that the annotation guideline was derived from ambiguous cases within the same dataset.

In conclusion, our results demonstrate the potential of GPT-4o and Llama-3-70B in extracting key image findings from stroke CT reports, with GPT-4o steadily exceeding the performance of Llama-3-70B. We further provide evidence that well-defined annotation guidelines can enhance LLM data extraction accuracy.

Data availability

The code for running the LLM queries is provided at: https://github.com/shk03/stroke_llm_data_extraction.

Abbreviations

ASPECTS:: Alberta stroke program early CT score
DDx:: Differential diagnosis
GPT-4o:: Generative pre-trained transformer 4 omni
JSON:: JavaScript object notation
Llama:: Large language model meta AI
LLM:: Large language model

References

Mokin M, Ansari SA, McTaggart RA et al (2019) Indications for thrombectomy in acute ischemic stroke from emergent large vessel occlusion (ELVO): report of the SNIS Standards and Guidelines Committee. J Neurointerv Surg 11:215–220. https://doi.org/10.1136/neurintsurg-2018-014640
Article PubMed Google Scholar
Li MD, Lang M, Deng F et al (2021) Analysis of stroke detection during the COVID-19 pandemic using natural language processing of radiology reports. AJNR Am J Neuroradiol 42:429–434. https://doi.org/10.3174/AJNR.A6961
Article CAS PubMed PubMed Central Google Scholar
Ginsberg MD (2018) The cerebral collateral circulation: relevance to pathophysiology and treatment of stroke. Neuropharmacology 134:280–292. https://doi.org/10.1016/j.neuropharm.2017.08.003
Article CAS PubMed Google Scholar
Caruso P, Naccarato M, Furlanis G et al (2018) Wake-up stroke and CT perfusion: effectiveness and safety of reperfusion therapy. Neurol Sci 39:1705–1712. https://doi.org/10.1007/s10072-018-3486-z
Article PubMed Google Scholar
Moftakhar P, English JD, Cooke DL et al (2013) Density of thrombus on admission CT predicts revascularization efficacy in large vessel occlusion acute ischemic stroke. Stroke 44:243–245. https://doi.org/10.1161/STROKEAHA.112.674127
Article PubMed Google Scholar
Douglas VC, Johnston CM, Elkins J et al (2003) Head computed tomography findings predict short-term stroke risk after transient ischemic attack. Stroke 34:2894–2898. https://doi.org/10.1161/01.STR.0000102900.74360.D9
Article PubMed Google Scholar
Cabral Frade H, Wilson SE, Beckwith A, Powers WJ (2022) Comparison of outcomes of ischemic stroke initially imaged with cranial computed tomography alone vs computed tomography plus magnetic resonance imaging. JAMA Netw Open 5:E2219416. https://doi.org/10.1001/jamanetworkopen.2022.19416
Article PubMed PubMed Central Google Scholar
Temmen SE, Becks MJ, Schalekamp S et al (2023) Duration and accuracy of automated stroke CT workflow with AI-supported intracranial large vessel occlusion detection. Sci Rep 13:12551. https://doi.org/10.1038/s41598-023-39831-x
Article CAS PubMed PubMed Central Google Scholar
Adamou A, Beltsios ET, Bania A et al (2023) Artificial intelligence-driven ASPECTS for the detection of early stroke changes in non-contrast CT: a systematic review and meta-analysis. J Neurointerv Surg 15:E298–E304. https://doi.org/10.1136/jnis-2022-019447
Article PubMed Google Scholar
Cadilhac DA, Kim J, Lannin NA et al (2016) National stroke registries for monitoring and improving the quality of hospital care: a systematic review. Int J Stroke 11:28–40. https://doi.org/10.1177/1747493015607523
Article PubMed Google Scholar
Fasugba O, Sedani R, Mikulik R et al (2024) How registry data are used to inform activities for stroke care quality improvement across 55 countries: a cross-sectional survey of Registry of Stroke Care Quality (RES-Q) hospitals. Eur J Neurol 31:e16024. https://doi.org/10.1111/ene.16024
Article PubMed Google Scholar
Brady AP (2018) Radiology reporting-from Hemingway to HAL? Insights Imaging 9:237–246. https://doi.org/10.1007/s13244-018-0596-3
Article PubMed PubMed Central Google Scholar
Cai T, Giannopoulos AA, Yu S et al (2016) Natural language processing technologies in radiology research and clinical applications. Radiographics 36:176–191. https://doi.org/10.1148/rg.2016150080
Article PubMed Google Scholar
Hassanpour S, Langlotz CP (2016) Information extraction from multi-institutional radiology reports. Artif Intell Med 66:29–39. https://doi.org/10.1016/j.artmed.2015.09.007
Article PubMed Google Scholar
Linna N, Kahn CE (2022) Applications of natural language processing in radiology: a systematic review. Int J Med Inform 163:104779. https://doi.org/10.1016/j.ijmedinf.2022.104779
Article PubMed Google Scholar
Ziegelmayer S, Marka AW, Lenhart N et al (2023) Evaluation of GPT-4’s chest x-ray impression generation: a reader study on performance and perception. J Med Internet Res 25:e50865. https://doi.org/10.2196/50865
Article PubMed PubMed Central Google Scholar
Tanno R, Barrett DGT, Sellergren A et al (2024) Collaboration between clinicians and vision–language models in radiology report generation. Nat Med 31:599–608. https://doi.org/10.1038/s41591-024-03302-1
Article CAS PubMed PubMed Central Google Scholar
Meddeb A, Lüken S, Busch F et al (2024) Large language model ability to translate CT and MRI free-text radiology reports into multiple languages. Radiology 313:e241736. https://doi.org/10.1148/radiol.241736
Article PubMed Google Scholar
Kottlors J, Bratke G, Rauen P et al (2023) Feasibility of differential diagnosis based on imaging patterns using a large language model. Radiology 308:e231167. https://doi.org/10.1148/radiol.231167
Article PubMed Google Scholar
Schramm S, Preis S, Metz M-C et al (2024) Impact of multimodal prompt elements on diagnostic performance of GPT-4(V) in challenging brain MRI cases. Radiology 314:e240689. https://doi.org/10.1148/radiol.240689
Article Google Scholar
Fink MA, Bischoff A, Fink CA et al (2023) Potential of ChatGPT and GPT-4 for data mining of free-text CT reports on lung cancer. Radiology 308:e231362. https://doi.org/10.1148/RADIOL.231362
Article PubMed Google Scholar
Lehnen NC, Dorn F, Wiest IC et al (2024) Data extraction from free-text reports on mechanical thrombectomy in acute ischemic stroke using ChatGPT: a retrospective analysis. Radiology 311:e232741. https://doi.org/10.1148/RADIOL.232741
Article PubMed Google Scholar
Park HJ, Huh JY, Chae G, Choi MG (2024) Extraction of clinical data on major pulmonary diseases from unstructured radiologic reports using a large language model. PLoS One 19:e0314136. https://doi.org/10.1371/journal.pone.0314136
Article CAS PubMed PubMed Central Google Scholar
Guellec, Le B, Lefèvre A, Geay C et al (2024) Performance of an open-source large language model in extracting information from free-text radiology reports. Radiol Artif Intell 6:e230364. https://doi.org/10.1148/RYAI.230364
Article PubMed PubMed Central Google Scholar
Meddeb A, Ebert P, Bressem KK et al (2024) Evaluating local open-source large language models for data extraction from unstructured reports on mechanical thrombectomy in patients with ischemic stroke. J Neurointerv Surg. https://doi.org/10.1136/jnis-2024-022078
Hu D, Liu B, Zhu X et al (2024) Zero-shot information extraction from radiological reports using ChatGPT. Int J Med Inform 183:1053231. https://doi.org/10.1016/j.ijmedinf.2023.105321
Article Google Scholar
Reichenpfader D, Müller H, Denecke K (2024) A scoping review of large language model based approaches for information extraction from radiology reports. NPJ Digit Med 7:222. https://doi.org/10.1038/s41746-024-01219-0
Article PubMed PubMed Central Google Scholar
Adams LC, Truhn D, Busch F et al (2023) Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: a multilingual feasibility study. Radiology 307:e230725. https://doi.org/10.1148/RADIOL.230725
Article PubMed Google Scholar
Dang H, Mecke L, Lehmann F et al (2022) How to prompt? Opportunities and challenges of zero- and few-shot learning for human-AI interaction in creative applications of generative models. Preprint at https://doi.org/10.48550/arXiv.2209.01390.
Minker J (1982) On indefinite databases and the closed world assumption. In: International conference on automated deduction. Springer Verlag, pp 292–308
OpenAI (2024) OpenAI Platform. https://platform.openai.com/docs/models/gp. Accessed 19 Dec 2024
Hugging Face (2024) meta-llama/Llama-3.3-70B-Instruct. https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct. Accessed 19 Dec 2024
Bender R (2001) Calculating confidence intervals for the number needed to treat. Control Clin Trials 22:102–110. https://doi.org/10.1016/S0197-2456(00)00134-3
Article CAS PubMed Google Scholar
Dorfner FJ, Jürgensen L, Donle L et al (2024) Is open-source there yet? A comparative study on commercial and open-source LLMs in their ability to label chest X-ray reports. Radiology 313:e241139. https://doi.org/10.1148/radiol.241139
Article PubMed Google Scholar
Wu S, Koo M, Blum L et al (2024) Benchmarking open-source large language models, GPT-4 and Claude 2 on multiple-choice questions in nephrology. NEJM AI 1:AIdbp2300092. https://doi.org/10.1056/aidbp2300092
Article Google Scholar
Shoham OB, Rappoport N (2024) MedConceptsQA: open source medical concepts QA benchmark. Comput Biol Med 182:109089. https://doi.org/10.1016/j.compbiomed.2024.109089
Article PubMed Google Scholar
Riedemann L, Labonne M, Gilbert S (2024) The path forward for large language models in medicine is open. NPJ Digit Med 7:339. https://doi.org/10.1038/s41746-024-01344-w
Article PubMed PubMed Central Google Scholar
Zhang G, Jin Q, Zhou Y et al (2024) Closing the gap between open source and commercial large language models for medical evidence summarization. NPJ Digit Med 7:239. https://doi.org/10.1038/s41746-024-01239-w
Article PubMed PubMed Central Google Scholar
Windisch P, Dennstädt F, Koechli C et al (2024) The impact of temperature on extracting information from clinical trial publications using large language models. Cureus 16:e75748. https://doi.org/10.7759/cureus.75748
Article PubMed PubMed Central Google Scholar
Azamfirei R, Kudchadkar SR, Fackler J (2023) Large language models and the perils of their hallucinations. Crit Care 27:1–2. https://doi.org/10.1186/S13054-023-04393-X
Article Google Scholar
Shinagare AB, Lacson R, Boland GW et al (2019) Radiologist preferences, agreement, and variability in phrases used to convey diagnostic certainty in radiology reports. J Am Coll Radiol 16:458–464. https://doi.org/10.1016/j.jacr.2018.09.052
Article PubMed Google Scholar
Callen AL, Dupont SM, Price A et al (2020) Between always and never: evaluating uncertainty in radiology reports using natural language processing. J Digit Imaging 33:1194–1201. https://doi.org/10.1007/s10278-020-00379-1
Article PubMed PubMed Central Google Scholar
Pexman JHW, Barber PA, Hill MD et al (2001) Use of the Alberta Stroke Program Early CT Score (ASPECTS) for assessing CT scans in patients with acute stroke. AJNR Am J Neuroradiol 22:1534–1542
CAS PubMed PubMed Central Google Scholar
Bosmans JML, Weyler JJ, De Schepper AM, Parizel PM (2011) The radiology report as seen by radiologists and referring clinicians: results of the COVER and ROVER surveys. Radiology 259:184–195. https://doi.org/10.1148/radiol.10101045
Article PubMed Google Scholar
Gilbert M, Crutchfield A, Luo B et al (2024) Using a large language model (LLM) for automated extraction of discrete elements from clinical notes for creation of cancer databases. Int J Radiat Oncol Biol Phys 120:e625. https://doi.org/10.1016/j.ijrobp.2024.07.1375
Article Google Scholar
Al Mohamad F, Donle L, Dorfner F et al (2025) Open-source large language models can generate labels from radiology reports for training convolutional neural networks. Acad Radiol. https://doi.org/10.1016/j.acra.2024.12.028
Kanemaru N, Yasaka K, Fujita N et al (2024) The fine-tuned large language model for extracting the progressive bone metastasis from unstructured radiology reports. J Imaging Inform Med 38:865–872. https://doi.org/10.1007/s10278-024-01242-3
Article PubMed PubMed Central Google Scholar
Open AI, Achiam J, Adler S et al (2023) GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Department of Diagnostic and Interventional Neuroradiology, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany
Jonas Wihl, Enrike Rosenkranz, Severin Schramm, Cornelius Berberich, Michael Griessmair, Jan S. Kirschke, Claus Zimmer, Benedikt Wiestler, Dennis Hedderich & Su Hwan Kim
Department of Diagnostic, Interventional and Pediatric Radiology, Inselspital Bern, University of Bern, Bern, Switzerland
Michael Griessmair
Smart Reporting GmbH, Munich, Germany
Piotr Woźnicki & Francisco Pinto
Department of Diagnostic and Interventional Radiology, TUM University Hospital, School of Medicine and Health, Technical University of Munich, Munich, Germany
Sebastian Ziegelmayer & Lisa C. Adams
Department of Cardiovascular Radiology and Nuclear Medicine, German Heart Center Munich, School of Medicine and Health, Technical University of Munich, Munich, Germany
Keno K. Bressem
AI for Image-Guided Diagnosis and Therapy, School of Medicine and Health, Technical University of Munich, Munich, Germany
Benedikt Wiestler

Authors

Jonas Wihl
View author publications
Search author on:PubMed Google Scholar
Enrike Rosenkranz
View author publications
Search author on:PubMed Google Scholar
Severin Schramm
View author publications
Search author on:PubMed Google Scholar
Cornelius Berberich
View author publications
Search author on:PubMed Google Scholar
Michael Griessmair
View author publications
Search author on:PubMed Google Scholar
Piotr Woźnicki
View author publications
Search author on:PubMed Google Scholar
Francisco Pinto
View author publications
Search author on:PubMed Google Scholar
Sebastian Ziegelmayer
View author publications
Search author on:PubMed Google Scholar
Lisa C. Adams
View author publications
Search author on:PubMed Google Scholar
Keno K. Bressem
View author publications
Search author on:PubMed Google Scholar
Jan S. Kirschke
View author publications
Search author on:PubMed Google Scholar
Claus Zimmer
View author publications
Search author on:PubMed Google Scholar
Benedikt Wiestler
View author publications
Search author on:PubMed Google Scholar
Dennis Hedderich
View author publications
Search author on:PubMed Google Scholar
Su Hwan Kim
View author publications
Search author on:PubMed Google Scholar

Contributions

SS, ER, CB, and MG curated the dataset. PW and FP provided the user interface for manual data extraction. JW, SS, and SHK analyzed and visualized the data. LCA, KKB, BW, and DH supervised the project. BW, JSK, and CZ provided the computational resources. JW and SHK drafted the manuscript. All authors critically revised the manuscript.

Corresponding author

Correspondence to Su Hwan Kim.

Ethics declarations

Ethics approval and consent to participate

This retrospective study was approved by the Institutional Review Board of the TUM.

Consent for publication

The need for informed consent was waived (2024-125-S-NP, May 29, 2024).

Competing interests

FP is a full-time employee of Smart Reporting GmbH, a provider of radiology reporting software. PW is a consultant for Smart Reporting GmbH.

Additional information

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplemental material

Supplement 1: Parameters excluded from dataset A (translated from German to English). Supplement 2: Parameters excluded from Dataset B (translated from German to English). Supplement 3: Data extraction performance of GPT-4o (temperature = 1) with and without annotation guideline in dataset A (n = 200). Metrics for GPT-4o were calculated based on the mode across three repetitions. Supplement 4: Data extraction performance of GPT-4o (temperature = 1) with and without annotation guideline in dataset A (n = 200). Metrics for GPT-4o were calculated based on the mode across three repetitions. Supplement 5: Data extraction performance of Llama-3.3-70B (temperature = 0) with and without annotation guideline in dataset A (n = 200). Supplement 6: Data extraction performance of Llama-3.3-70B (temperature = 0) with and without annotation guideline in dataset A (n = 200). Supplement 7: Data extraction performance of Llama-3.3-70B (temperature = 0) with and without annotation guideline in dataset B (n = 100). Supplement 8: Data extraction performance of Llama-3.3-70B (temperature = 0) with and without annotation guideline in dataset B (n = 100). Supplement 9: Data extraction performance of GPT-4o (temperature = 0) with and without annotation guideline in dataset A (n = 100). Supplement 10: Data extraction performance of GPT-4o (temperature = 0) with and without annotation guideline in dataset A (n = 100). Supplement 11: Data extraction performance of GPT-4o (temperature = 0) with and without annotation guideline in dataset B (n = 100). Supplement 12: Data extraction performance of GPT-4o (temperature = 0) with and without annotation guideline in dataset B (n = 100). Supplement 13: data extraction performance of GPT-4o (temperature = 1) with and without annotation guideline in dataset B (n = 100).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wihl, J., Rosenkranz, E., Schramm, S. et al. Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines. Eur Radiol Exp 9, 61 (2025). https://doi.org/10.1186/s41747-025-00600-2

Download citation

Received: 01 February 2025
Accepted: 03 June 2025
Published: 19 June 2025
DOI: https://doi.org/10.1186/s41747-025-00600-2

Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines

Abstract

Background

Methods

Results

Conclusion

Relevance statement

Key Points

Graphical Abstract

Background

Methods

Datasets

Data extraction parameters

Manual annotations and annotation guidelines

LLM infrastructure

LLM queries

GPT-4o's performance in diagnostic certainty assessment

Analysis

Results

Cohort overview

Annotation guidelines

Model performance

Processing time and test-retest reliability

Diagnostic certainty assessment

Discussion

Data availability

Abbreviations

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Supplementary information

Supplemental material

Rights and permissions

About this article

Cite this article

Share this article

Keywords