Introduction

Despite the rapid development of Artificial Intelligence (AI) models, there is yet a discernible gap in the realm of medical data processing. Historically, AI models have predominantly focused on individual data modalities—either visual or linguistic. This approach starkly contrasts with the intrinsic multimodal practices of physicians, who inherently rely on a confluence of imaging studies and textual electronic medical data for informed decision-making. By understanding diverse data types and their interrelationships, multimodal AIs would facilitate more accurate diagnoses, personalized treatment development, and a reduction in medical errors by providing a comprehensive view of patient data. For example, in the field of radiation oncology, which is one of the clinical fields to evaluate the potential of multimodal AI applications and the main focus of this article, the integration of multiple modalities holds great importance1.

For modern intensity-modulated radiation therapy and its inverse planning, two critical components are needed: organs-at-risk (OARs) and the target volume where the dose is prescribed. OARs are defined as the radiosensitive organs susceptible to damage by ionizing radiation during radiation therapy. Traditionally, they were either manually delineated by human experts or automatically contoured using atlas-based autocontouring algorithms. However, with the advent of deep learning-based AI models, such tasks have been efficiently accomplished2,3. Therefore, these OARs can be contoured “as they appear” in the planning computed tomography (CT) images.

However, in contrast to OARs segmentation, the task of target volume delineation, which also needs to be contoured on the planning CT images but often requires consideration of clinical information beyond the visual features, remains crucial for treatment planning and has traditionally been the responsibility of experienced radiation oncologists. This task is perceived as more challenging due to its intrinsic need for the integration of multimodal knowledge. Although a multitude of segmentation models have been proposed and explored to enhance the precision and efficacy of this task over the last few years4,5,6, a conspicuous gap in research persists, particularly regarding multimodal target delineation3.

This is because the delineation of radiation therapy target transcends beyond the mere consideration of visual elements, such as the gross tumor volume (GTV)7, and necessitates the incorporation of a myriad of factors, including tumor stage, histological diagnosis, the extent of metastasis, and gene mutation. These factors critically influence the potential for occult metastases, which may compromise the survival outcome of a patient. Areas at elevated risk for such metastatic growth are often treated electively, necessitating clinical consideration that is deeply rooted in a comprehensive understanding of various data modalities. Furthermore, additional factors, such as a patient’s performance status and age, which collectively contribute to the general condition, also exert an impact on treatment target delineation. Given the imperative nature of considering information beyond imaging in target volume delineation, the application of a multimodal approach in radiation oncology is not merely beneficial but essential for the tasks of the radiation oncology8. This is particularly substantiated by the necessity to incorporate textual clinical data, which can significantly influence the identification and subsequent treatment of regions susceptible to occult metastases.

Recently, large language models (LLMs)—AI models proficient in processing and generating text, code, and other data types—have witnessed remarkable advancements9,10,11. Trained on extensive datasets of text and code, these models discern relationships among varied data types and generate new data, adhering to learned patterns. Furthermore, multimodal data such as images, signals, etc., can be easily integrated into LLMs through adapters and generative models for vision understanding and generation, respectively. Consequently, these models have demonstrated promise in a myriad of medical tasks, including multimodal medical report generation, medical question answering, and multimodal segmentation with medical images like chest X-rays12,13,14,15.

Inspired by the multimodal integration capability of LLMs and needs for multimodal information for tumor target delineation, here we present a 3-dimensional (3D) multimodal clinical target volume (CTV) delineation model, LLMSeg, by integrating clinical information through the LLM for conditioning a segmentation model. Specifically, by leveraging the textual information from well-trained LLMs through simple prompt tuning, our cross-attention-based segmentation model has adeptly integrated text-based clinical information into the target volume contouring task. More specifically, as illustrated in Fig. 1a, we introduce an interactive alignment framework which uses both self-attention and cross-attention mechanisms in a bidirectional manner (text-to-image and image-to-text features), by following the concept of promptable segmentation from Segment Anything Model (SAM)16. To further improve the quality of features, we implement this interactive alignment between all the skip-connected image encoder features with the LLM feature. These layer-wise multimodal features are then combined to jointly predict the target labels through the multimodal decoder. In this way, we ensure the image encoder to efficiently extract meaningful text-related representations and vice versa. Finally, to transfer the LLM’s knowledge while the entire network parameters are kept and achieve superior performance in various downstream tasks17,18,19, we adapt the idea of light-weight learnable text prompts to fully leverage the great linguistic capability of the LLM within the proposed multimodal AI framework.

Fig. 1: Overview of our proposed LLMSeg.
figure 1

a Illustration comparing the concept between the traditional vision-only AI and the multimodal AI in the context of radiotherapy target volume delineation. b Quantitative comparison of CTV contouring performance in the Dice metric. The Dice metric for each trial is presented with whiskers representing the range from minimum to maximum values. The center line indicates the median, the bounds of the box represent the interquartile range (from the lower quartile to the upper quartile), and the x mark indicates the mean. n denotes the number of patients. The p values indicate the statistically significant superiority of the proposed multimodal LLMSeg. All statistical tests were two-sided. c Visual assessment of each concept. Source data are provided as a Source Data file.

In this work, we apply LLMSeg to the breast cancer target volume delineation task to evaluate its context-aware radiotherapy target delineation performance compared to a unimodal AI. Additionally, we expand its application to prostate cancer cases. By utilizing a well-curated, large-scale dataset from three institutions for development and external validation, we verify its capability to integrate pivotal clinical information, such as tumor stage, surgery type, and laterality. Experimental results confirm that the model not only demonstrates a significantly enhanced target contouring performance compared to existing unimodal segmentation models but also exhibits behavior that contours targets in accordance with provided clinical information. Notably, the model exhibits superior performance enhancement on an external dataset and shows stable performance gains in data-insufficient settings, demonstrating generalizability and data-efficiency that are not only apt for the characteristics of medical domain data but also aligns well with the perspective of clinical experts.

Results

Accurate and robust CTV delineation performance of multimodal model

Figure 1b presents a comparative analysis between the vision-only model and our proposed multimodal model for CTV delineation in breast cancer patients for all the validation sets. For internal validation, both methods showed promising performance of above 0.8 in the Dice metric, with a substantial improvement is observed in ours. However, the vision-only model showed a drastic performance drop of 0.73 and 0.44 in the Dice metric in both external settings. Specifically, in the case of external set #2, where the manufacturer of acquisition modality differs from that of internal and external set #1, the vision-only model completely failed to perform CTV delineation. Despite encountering visually shifted data distributions, our multimodal model demonstrated notable stability by consistently maintaining performance across all experimental conditions.

We qualitatively compare two different approaches in Fig. 1c. In general, CTV for breast cancer radiation therapy can be categorized into two primary types: one that involves treatment of the breast or chest wall alone, and the other that electively treats the regional lymph nodal area (including axillary, supraclavicular, and internal mammary lymph nodes (LNs)) in addition to the aforementioned areas, given the frequent metastasis of breast cancer to these regions. On the left side of Fig. 1c, despite the ground truth label posing CTV on both the breast and regional LNs, the vision-only model only contours the breast alone. Moreover, as the vision-only model lacks information about the laterality of the breast that diagnosed as cancer, partial segmentation masks are observed on the opposite breast. In contrast, the multimodal model accurately contours the breast and regional LNs that need to be treated as CTV. On the right side of Fig. 1c, despite early breast cancer case requiring treatment of the breast only, the vision-only model incorrectly includes the regional LNs as CTV. Moreover, CTVs are extended to the opposite breast. On the other hand, the multimodal model that integrates the clinical information accurately contours the requisite treatment areas, encompassing both the breast and the regional LNs, aligning with the ground truth.

We further compared our method with other diverse vision-only and multimodal methods in Table 1. Our proposed context-aware segmentation, in which the given textual information is not explicitly visible as an actual object in the input image, compared to traditional vision-language segmentation20,21. Therefore, we adapted publicly available 2D text-driven multimodal segmentation frameworks from various segmentation categories as our baseline models22,23,24. Furthermore, we conducted comparisons with two advanced visual backbones25,26 to justify our selection of the 3D residual U-Net as the visual backbone. In the results shown in Table 1, HIPIE22, and LISA23, considered SOTA models for 2D referring and reasoning segmentation respectively, showed suboptimal performance in 3D context-aware segmentation. On the other hand, ConTEXTualNet24, capable of handling 3D images as inputs, showed promising performance. Nevertheless, our approach demonstrated the SOTA performance across all evaluation metrics in various validation settings.

Table 1 Comparison of 3D CTV delineation performance for breast cancer patients

Performance evaluation by expert reveals superiority of multimodal model

The assessment of the target volume should not be based on mere metric evaluations such as the Dice, but rather by appropriate clinical rationale. In the context of breast contouring, this involves considerations such as whether the target volume has been contoured on the treated side of the breast, the contouring performed on the breast or chest wall depending on the type of surgery (breast-conserving surgery (BCS) or mastectomy), and whether the regional LNs have been included. Therefore, the appropriateness of target contouring should be evaluated by a board-certified radiation oncologist, ensuring a clinically relevant perspective in the assessment. To this end, five rubrics (laterality, surgery type, volume definition, coverage, integrity) were suggested by the board-certified radiation oncologists, to objectively and specifically evaluate the target volume with differentiated scoring reflecting their importance. Detailed descriptions of these rubrics are available in Supplementary Table 1 with Supplementary Fig. 1.

When evaluated using the proposed rubrics as indicated in Table 2, the multimodal model exhibited superior performance, achieving total scores up to twice as high as those of the vision-only model. Importantly, the model exhibited notably larger gains in rubrics like laterality and volume definition, where incorporation of the clinical context is crucial to achieve accurate results, than in metrics indicative of contouring quality, such as coverage and integrity. This performance gain was particularly pronounced in the external validation, notably in external set #2, where differences in the image acquisition setting were noted. This demonstrates the multimodal model’s robustness and clinical relevance across varied datasets and potential diverse clinical scenarios.

Table 2 Expert evaluation of CTV delineation performance for breast cancer patients

Data efficiency and robustness of the multimodal model

During the training process of clinical specialists, learning is expedited when textual clinical information is integrated alongside imaging studies, as opposed to focusing on target volume in images alone. This approach facilitates a more rapid assimilation of tendencies and principles of target volume contouring, enabling effective learning even with fewer cases. We sought to determine whether this efficiency of learning through the integration of textual clinical information could be applied to our multimodal approach.

We observed the performance of each concept in target volume contouring by progressively reducing the size of training dataset. As illustrated in Fig. 2a, our multimodal model demonstrated its data efficiency by maintaining stable performance above 0.8 in the Dice even with 40% of data availability. This starkly contrasts with the vision-only model, whose performance dropped from initial Dice of 0.8–0.7. When utilizing only 20% of the training dataset, the multimodal model’s performance decreased slightly below 0.8 in the Dice, while the vision-only model completely failed to contour CTV in the limited dataset scenario. This performance gap was particularly evident in external validation results. For external validation #1, the initial discrepancy between two models was ~0.1 in the Dice metric. However, as the size of training dataset decreased, the discrepancy doubled. For external validation #2, notable overfitting issues were observed in the vision-only model. On the contrary, our multimodal model achieved robust performance when trained with a reduced dataset of less than 40%. Qualitative analysis, as depicted in Fig. 2b, also supports these results. Detailed quantitative results for all metrics are further provided in Supplementary Table 2.

Fig. 2: Comparison of target contouring performance based on varying training dataset sizes.
figure 2

a Quantitative comparison for all the validation sets. The Dice metric for each trial is presented as mean values (center lines) with 95th percentile of confidence intervals calculated with the non-parametric bootstrap method (shaded areas). n denotes the number of patients. b Visual comparison for external validation #1. Source data are provided as a Source Data file.

Differential target contouring based on varied textual inputs

To validate the hypothesis that our multimodal model genuinely performs CTV delineation based on textual clinical information, we conducted an experiment to assess whether altering the textual clinical information alone would yield different delineation results, even for the same CT, as illustrated in Fig. 3a.

Fig. 3: Analysis of clinical data alignment for target contouring.
figure 3

a Illustration of modification of the input clinical data, given the same CT scan. Red font indicates modified input text. b, c Visual assessment of radiotherapy target contouring with modified input clinical data.

As depicted in Fig. 3b, c, the model performed contouring different targets for the same CT, contingent on the provided clinical data. In Fig. 3b, for a patient with left breast cancer at stage T1N0M0, upstaging the T stage or N stage demonstrated the inclusion of regional LNs, and altering the tumor’s laterality from left to right resulted in contouring on the opposite side. Interestingly, when the type of surgery was changed from BCS to total mastectomy, it was observed that the previously spared skin was no longer spared, and the target volume was expanded to include the chest wall. For another patient with right breast cancer at stage T2N1M0, as exemplified in Fig. 3c, downstaging N stage leads to the omission of regional LNs from the designated target volume, and changing the type of surgery to BCS results in a strategic shift to sparing the skin and excluding the pectoralis muscle from the treatment volume. These quantitative results align precisely with the decision policy of radiation oncologists, and substantiate that our model contours the target volume, strongly referencing the textual clinical information as well as the imaging features.

Exploring textual clinical information provision methods in the multimodal model

To demonstrate the necessity of LLM as for our textual clinical information provision method, we conducted an ablation study by replacing our textual module by a simple numeric category method and a CLIP text encoder trained on a relatively smaller textual dataset compared to LLM27. As indicated in Table 3a, the numeric category method, by representing each clinical information as categorized numbers, exhibited promising performance and showed relatively marginal performance drops to our method in the internal validation setting. However, in the two external validations, the performance gaps were increased up to 0.1 in the Dice metric and significantly more in the HD-95 metric, of up to 10 cm. Moreover, when replacing the textual module with the CLIP ViT-B/16 while maintaining our proposed multiple text prompt tuning method, huge performance gaps were observed compared to our method of up to 0.3 in the Dice metric and up to 10 cm in the HD-95 metric. These findings indicate that the effectiveness of the proposed multimodal model originates from leveraging LLM.

Table 3 Ablation studies on network components

Specifically, the numeric category method exhibited the second-most promising performance and showed relatively marginal performance drops to our method in the internal validation setting. However, in the two external validation settings, the performance gaps were increased. Hence, we qualitatively evaluated the source of the performance gap in Fig. 4a. In Case #1, where a patient underwent total mastectomy for T2N1M0 cancer in the left breast, our method accurately contoured the surgically treated breast with an implant, including the regional nodal area in the target volume. However, the numeric category method generated segmentation masks for both breasts, with more mask generation observed on the opposite breast. Similarly, in Case #2, where a patient underwent breast conservation surgery for T2N1M0 cancer in the left breast, our method accurately included the breast and regional nodes in the target volume while sparing the skin and chest wall. In contrast, the numeric category method only included the breast area in the target volume, excluding the regional nodes, and included parts of the skin and chest wall similar to the mastectomy case. Moreover, it partially generated segmentation masks on the opposite breast, demonstrating incomplete reflection of the clinical context.

Fig. 4: Qualitative comparison of different multimodal methods with omitted clinical data components.
figure 4

a Comparison with numeric category method: Case 1 (left breast, T2N1M0, post-mastectomy) and Case 2 (left breast, T2N1M0, post-breast conservation surgery) show our method (LLMSeg) accurately includes surgically treated areas and regional nodes, while the numeric category method inaccurately segments both breasts, missing clinical context. b Omission experiment for tumor information: For right breast T1aN0M0 cancer, our method segments accurately without omission. Omitting T stage, N stage, or laterality causes incorrect regional node inclusion or opposite breast contours. The competing method is inaccurate regardless of omission. c Omission experiment for surgery information: In left breast T1cN1M0 cancer post-mastectomy, our method without surgery information mimics breast-conserving surgery. The competing method inaccurately contours the opposite breast irrespective of surgery information.

We further ablated our employment of introducing clinical data by replacing it with various methodologies. These include utilizing a single or multiple text prompts through prompt tuning, low-rank adaptation (LoRA) fine-tuning28, and directly employing a pre-trained LLM without tuning. As indicated in Table 3b, our proposed text prompt tuning method consistently outperformed those using LoRA fine-tuning and a no-tuning strategy. Moreover, employing multiple learnable text prompts showed an improved performance compared to using a single text prompt. These results indicate that the introduced learnable text prompts were optimized to efficiently fine-tune the LLM for the target volume contouring task.

Ablation study of input clinical data components

We further conducted ablation study by omitting each piece of input clinical information and compared the difference between a competing method (Numeric Category) and our method (LLMSeg) in Fig. 4b, c. Firstly, without omission as shown in Fig. 4b, our method accurately segmented only the right breast area as the target volume for a person with T1aN0M0 cancer who underwent BCS. However, when the information for T stage was removed, the model included some regional nodes in the target range, similar to cases with higher stages. This trend was similarly observed in the omission of N stage information, where the model included regional nodes as in cases with nodal metastasis like N1 or N2. Likewise, without information about laterality, the model inaccurately contoured the opposite breast. On the contrary, the competing model showed inaccurate results such as contouring on the opposite breast even without omission. Moreover, regardless of the presence or absence of omission, there was little change in target contouring (e.g., laterality), or target contouring changed in patterns unrelated to the omitted information (e.g., contouring on the opposite side when omitting T stage or N stage information). These results indicate that the competing model receiving clinical context in a simpler manner failed to effectively incorporate such information and perform CTV delineation unrelated to the provided information. Similarly, in another case of T1cN1M0 breast cancer in the left breast where total mastectomy was performed as shown in Fig. 4c when surgery information is not provided, our method misidentified the surgery type and produced segmentation results resembling BCS, sparing the skin and chest wall. However, the competing model rather contoured on the opposite breast, which was irrelevant to surgical method.

In Table 4, we further assessed these ablation results quantitatively. For our method, the exclusion of information regarding laterality, which influences the decision on which breast to contour, resulted in the most significant decrease in performance. This was followed by similar degrees of performance decrease upon excluding information related to surgery type and N stage, which impact the inclusion of the skin, chest wall, and the regional nodes. Although excluding T stage information did result in a decrease in performance, it was the least significant, which is rational considering the minimal impact of T stage information on target volume delineation.

Table 4 Ablation of input clinical data components for two different multimodal methods

Overall, these comparative results suggest that our model considers the clinical context provided in text and is hindered in accurate target volume delineation if any component is missing. That is said, excluding any one component results in lower performance compared to using all available information, indicating that every component contributes to the model’s performance.

Exploring other cancer types

We further evaluated the proposed multimodal target volume contouring for prostate cancer patients. For prostate cancer, clinical data were directly curated from EMR, as detailed in Supplementary Table 3. This curated EMR data, along with each patient’s age, were then summarized as input clinical data. Similar to the breast cancer study, we observed the superiority of our multimodal approach over the vision-only approach, with a notable performance gain of up to 0.05 in Dice metric through all the validation settings as shown in Table 5.

Table 5 Comparison of 3D CTV delineation performance for prostate cancer patients

Similar to breast cancer, an expert evaluation was conducted for prostate cancer. A rubric-based analysis of expert evaluation in Table 6 clearly showed effectiveness of our method. Particularly, these benefits became unequivocally evident in the external validation setting, showing more than double the differences in total scores. Among those necessitating in-depth reference to clinical information for precise scoring—notably, the delineation of the primary site (assessing prostate volume coverage, including the seminal vesicle) and the volume definition (evaluating regional node irradiation appropriateness)—exhibited significantly larger differences compared to the vision-only model. Details on the rubrics used for prostate cancer can be found in the Supplementary Fig. 2 and Supplementary Table 4.

Table 6 Expert evaluation of CTV delineation performance for prostate cancer patients

Discussion

Despite the promising outcomes demonstrated by AI models in various studies, a notable limitation prevalent in the field of medical AI has been the predominant development of models tailored for singular, specialized tasks29. For instance, models have been specifically designed and trained to excel in a singular task, such as segmentation4,6, diagnosis30,31, or prognosis prediction32,33, without the adaptability to transition across various tasks. While these specialized models perform commendably within their designated task, they lack the flexibility to navigate the complex challenges in the medical domain, where the ability to integrate, and concurrently process diverse tasks is crucial.

In the nascent stages of applying vision-language models to the medical domain, initial research endeavors have predominantly focused on the most simple form of vision-text paired data, such as chest radiographs34. These studies have explored various tasks, including zero-shot classification35, report generation36,37, and text-guided segmentation15,24. However, the field of radiation oncology emerges as a particularly potent application area for such models8. Radiation oncology exemplifies a robust case for the adoption of multimodality, underpinned by two fundamental factors1. Firstly, decision-making in Radiation Oncology, especially in determining treatment scope and dose, extends beyond imaging to include a plethora of clinical information, such as surgical notes, pathology reports, and electronic medical records, which can be conveyed textually. Secondly, the integration of prior knowledge, including standard treatment guidelines and radiation oncology textbooks, is vital for informed treatment decision-making, with these guidelines also being expressible in textual formats. Consequently, the necessity for multimodality is markedly emphasized in Radiation Oncology (see Supplementary Fig. 3).

Consequently, we have applied LLMs in our research. Our model introduces several aspects with substantial clinical value and has demonstrated commendable results by accurately segmenting radiation therapy target volume based on clinical information, thereby achieving absolute performance where the multimodal model surpasses the vision-only model. It also exhibits a pronounced performance differential in external validation settings and demonstrates data efficiency in data-insufficient settings. This resonates intriguingly with the clinical implications, especially mirroring the learning trajectory and characteristics of clinical experts. In the clinical training of experts, reliance is placed on multimodality information; learning is not confined to either images or text but is rather a confluence of both, facilitating the inference of text-image relationships and enabling effective learning even with relatively fewer cases. This aspect of the clinical learning paradigm, being data-efficient, aligns seamlessly with our proposed multimodal model.

The decrement in classical AI-driven delineation generalization performance is often attributed to variations in image acquisition settings and characteristics of devices from different vendors, among other factors. Nonetheless, the ability of clinical experts to perform target contouring is scarcely influenced by external factors such as CT scanning conditions. This is because linguistical concepts embodied in textual clinical information, are independent of such acquisition settings. Therefore, it is plausible that our model, which learns in conjunction with such textual clinical information by leveraging the great linguistic capability of LLMs, demonstrates particularly commendable performance in external validation settings. This characteristic is particularly optimal for the medical domain, where training data is often limited and stable generalization performance is a prerequisite across varied external settings, thereby heralding a promising future for the application of multimodal models in medical AI.

Furthermore, we have demonstrated the necessity of incorporating clinical information into target volume contouring, particularly in cases such as breast cancer where the GTV may not be clearly visible in the planning CT image. This necessity is highlighted through diverse comprehensive qualitative comparison. In Fig. 1c, where the inclusion of clinical context is crucial for both cases, the multimodal target contouring reflects comprehensive considerations of clinical context. This necessity becomes more evident, where the absence of clinical context in the vision-only model results in clear failure cases. Additionally, in the detailed rubric comparison between the vision-only model and our multimodal model presented in Tables 2 and 6, the largest gains are observed in metrics that can be achieved through clinical considerations, such as laterality and volume definition. These results further emphasize the value of our multimodal approach.

Our study has several limitations. First, our evaluation is confined to patients at their initial diagnosis, leaving a scope for further exploration into varied patient scenarios and treatment stages, which can potentially influence the model’s applicability and performance. Second, the model does not incorporate considerations for radiation therapy doses in target volume contouring, presenting an opportunity to explore how dose-related variables could be integrated to enhance delineation and treatment planning in future studies. Third, while the model utilizes refined, rather than raw, clinical data, future research can explore mechanisms for automating the data refinement process or further develop capabilities to process raw clinical data, thereby reducing the need for manual intervention and potentially uncovering additional insights from unstructured clinical reports. Fourth, although our research scope covers both breast and prostate cancers to confirm its applicability in various cancer types, these cancer types are categorized as having relatively standardized target volume. This suggests the necessity for further validation of our method’s generalizability across a wider range of cancer types, which demand more challenging and intricate clinical considerations for accurate target volume delineation. Fifth, in our work, we focus CTV contouring to clearly demonstrate advantages of our multimodal model. However, GTV delineation, which involves contouring visually apparent areas, is crucial in clinical practice due to its importance in boost techniques for increased dose administration in many cancer types.

Additionally, in cancer types where the target volume is primarily determined based on GTV, such as lung cancer38, the benefits of integrating clinical information through our method may be relatively limited. Therefore, it is necessary to validate whether our method still offers utility in such cancer types, where the emphasis is on GTV for target volume definition. Therefore, future studies should expand to encompass GTV contouring, thereby improving its clinical utility. Last, but not least, the black-box nature of AI may hinder clinician’s direct utilization. Therefore, our proposed model should provide explainable results such as a confidence map in the clinical practice, as shown in Supplementary Fig. 4. These visual clues enable clinicians to interpret the model output by referencing the level of confidence for each segment of contour.

Despite aforementioned limitations, our research serves as a pivotal step towards the multimodal models in the field of radiation oncology, verifying the clinical utility and emphasizing the significance of intertwining textual clinical data with medical imaging. The model proposes a pathway for crafting more adaptable and clinically pertinent AI models in medical imaging and treatment planning. Future research would refine and broaden such models, closer to harnessing the full potential of multimodal framework in elevating clinical decision-making and patient care.

Methods

Ethic committee approval

The hospital data deliberately collected for this study were ethically approved by the Institutional Review Board of Department of Radiation Oncology at Yonsei Cancer Center, Department of Radiation Oncology at Yongin Severance Hospital, and Department of Radiation Oncology at Gangnam Severance Hospital (approval numbers of 4-2023-0179, 9-2023-0161 and 3-2023-0396 for each). The requirement for informed consent was waived due to the retrospective nature of the study.

Schematic comparison of the workflows of radiology and radiation oncology

Supplementary Fig. 3 delineates the clinical workflows in Radiology and Radiation Oncology. In radiology, while the patient’s history, previous diagnoses, past treatments, and previous imaging results are comprehensively considered, the most crucial element remains the findings visible in the current images, thus heavily relying on the visual information of the current imaging study. Conversely, in radiation oncology, determining the treatment target volume and prescribing doses necessitates a more comprehensive consideration of the patient’s history, pre-and post-operative imaging results, surgical pathology findings, laboratory results, and other clinical information, resulting in a relatively less reliance on the current simulation CT images.

Additionally, the integration of prior knowledge, including standard treatment guidelines and radiation oncology textbooks, is crucial for informed treatment decision-making and can also be expressed in textual formats. Therefore, the significance of multimodal approach is notably enhanced in Radiation Oncology compared to Radiology.

Definition of task

In radiation oncology, the treatment target volumes are categorized into GTV, CTV, and Planning Target Volume (PTV). GTV corresponds to the visible tumor, aligns with traditional segmentation’s objective to delineate visible image portions. CTV, while occasionally derived directly from GTV in the presence of a gross tumor, often also includes regions prone to microscopic disease. This necessitates the incorporation of diverse clinical factors, such as tumor type, histological findings, cancer stage (TNM classification), patient age, and performance status in specific cases. PTV further expands upon CTV to include margins that account for uncertainties in patient setup and positioning. Consequently, achieving accurate target volume delineation in radiation oncology goes beyond the scope of traditional segmentation tasks, necessitating incorporation of various clinical contexts as well as the structures visible on the CT scan.

Taking breast cancer as an example, in early-stage cases (e.g., stage I) where there is no regional LNs metastasis, often only the whole breast is included in the radiation therapy target volume. On the other hand, in advanced stages (e.g., stage IIIB), where regional LN metastasis is identified during surgery, there is often a need for elective nodal irradiation across all regional nodal areas. However, such distinctions are not discernible during the CT simulation for post-operative radiation therapy planning and require acquisition through other forms of information. Consequently, we aimed to develop a model that can consider clinical information such as primary tumor type, stage, age, and performance status in a manner akin to an experienced radiation oncologist by providing such data in the form of textual information to a multimodal model.

Among the primary cancer types, we initially targeted breast cancer. This was predicated on the fact that breast cancer presents with relatively uniform guidelines for target delineation according to the clinical information including primary tumor location, size, and the presence of nodal metastasis, etc. Furthermore, the inter-observer variability in target delineation for breast cancer is also expected to be small compared with other cancer types. Within the task of radiation therapy target delineation for breast cancer, we exclusively incorporated cases of patients at their initial diagnosis of breast cancer. This decision was based on the understanding that treatments with aims such as salvage or palliative often exhibit significant variability according to the preferences of the physicians as well as the patients, and other circumstances.

Details of clinical target volume

For breast cancer, the CTV for early breast cancer (Tis-T2) without nodal metastasis at initial diagnosis is limited to the whole breast. For those with nodal metastasis or in cases of locally advanced breast cancer (T3-4), as well as T2 cases with adverse features without proper axillary dissection, regional node irradiation was primarily considered. The delineation of regional nodes, especially the level of inclusion for the supraclavicular lymph node, is defined according to the Radiation Therapy Oncology Group guidelines for cases identified with N2 or more nodal metastasis, and by the European Society for Radiotherapy and Oncology guidelines for instances with N1 or less nodal involvement.

For prostate cancer, the definition of the CTV involved a more complex consideration of factors. In the presence of pelvic LN, regional node irradiation was performed in conjunction with prostate bed radiation. The decision to perform elective nodal irradiation on the pelvic LNs was based on the National Comprehensive Cancer Network risk groups, taking into account a combination of factors such as T stage, Prostate-Specific Antigen (PSA) levels, and Gleason score, particularly for those classified within the very high and high-risk groups. However, in individuals aged 80 and over, consideration of age led to the omission of pelvic LN irradiation. In cases where pathologic or imaging findings confirmed seminal vesicle invasion, contouring was performed to include the prostate and extend to the seminal vesicles within the CTV.

Details of datasets

For model development and internal validation, we acquired data from 981 patients treated at the Department of Radiation Oncology at Yonsei Cancer Center between September 2021 and October 2023. These patients had been initially diagnosed with breast cancer and underwent radiation therapy post-curative surgery with the primary objective of preventing recurrence. To better reflect real clinical application, the ideal approach for external validation needs the use of patient data acquired under different conditions and with equipment from a different vendor. Therefore, we utilized data from 206 patients treated at the Department of Radiation Oncology at Yongin Severance Hospital. We further utilized data from 204 patients treated at the Department of Radiation Oncology at Gangnam Severance Hospital. We confirmed that the external cohort was non-overlapping with those included in the model development nor internal validation.

Supplementary Table 5 presents the characteristics of breast cancer patients for each dataset. Across the train, internal, and external validation sets, distributions of factors such as location and T stage were observed to be consistent. The proportion of patients with LN metastasis and those undergoing total mastectomy was higher in the train and internal validation sets than the external validation set. Furthermore, due to the more advanced stages of disease, the proportion of patients who underwent neoadjuvant chemotherapy prior to surgery was observed to be higher in the train and internal validation sets compared to the external validation sets. Consequently, the percentage of patients receiving irradiation to the chest wall and regional LNs was also higher in the train and internal validation sets compared to the external validation set. When compared to the training and internal validation sets, external set #1 exhibited similar imaging equipment and conditions. However, external set #2 presented differences in image acquisition conditions, such as vendor, filter type, and slice thickness.

For evaluating the proposed method for other cancer types, we further acquired data from 943 prostate cancer patients from Yonsei Cancer Center and 141 prostate cancer patients from Yongin Severance Hospital. We confirmed that the external cohort was non-overlapping with those included in the model development nor internal validation. Supplementary Table 6 presents characteristics of prostate cancer patients for each dataset. In terms of the distribution of T and N stages, as well as Gleason scores, the training, internal validation, and external validation sets demonstrated a relatively uniform distribution. However, the initial PSA levels were found to be higher in the training and internal validation sets compared to the external validation set. Additionally, the proportion of individuals undergoing prostatectomy was also higher in the training and internal validation sets, which consequently led to a higher percentage of patients receiving radiotherapy with a definitive aim in the external validation set, while those in the training and internal validation sets were more likely to receive radiotherapy with a salvage aim. There were no significant differences in image acquisition settings across the datasets.

We not only utilized patient’s simulation CT images and CTVs for radiation therapy, but also incorporated text-based clinical information that is essential for precise target delineation. This additional information included the location of the primary cancer, type of surgery undertaken, disease stage, and the status of nodal metastasis. The input clinical data was prepared by the tabular format derived from raw clinical data for breast cancer, as shown in Supplementary Table 3a. The resulting clinical context was then curated using custom criteria. Initially, these criteria were devised by a board-certified radiation oncologist. Subsequent refinement was achieved through ablation studies on the components to construct the most effective clinical information, and the resulting examples of input texts are illustrated in the right-most column.

Compared to breast cancer, by utilizing tabular structure of clinical data which is curated by clinicians, for prostate cancer, we directly curated input clinical information from EMR data, by utilizing 10-shot in-context learning strategy with a pre-trained LLM, as shown in Supplementary Table 3b. Then, the curated EMR Data and each patient’s age were summarized as input clinical data in the right-most column. In the future study, a similar in-context learning approach can be applied to the breast cancer study for an automated framework.

Details of implementation

The schematic of our multimodal AI is illustrated in Fig. 1. For the image encoder/decoder and the LLM, we employed the 3D Residual U-Net39 and the pre-trained Llama2-7B-chat10 model, respectively. For the interactive alignment modules, we utilized the two-way transformer modules of SAM16. We further propose detailed multimodal AI framework as illustrated in Supplementary Fig. 5. We introduce three key components: (a) text prompt tuning, (b) multimodal interactive alignment, and (c) CTV delineation.

(a) Text prompt tuning

To efficiently fine-tune the LLM, we introduce N-text prompts \({{{\mathcal{V}}}}=\{{v}^{n}{| }_{n=1}^{N}\}\) as illustrated in Supplementary Fig. 5a, where each \({v}^{n}\in {{\mathbb{R}}}^{M\times D}\) consists of M vectors with the dimension D, which is same embedding dimension as the LLM. These learnable vectors are randomly initialized, and then consistently prepended to each of tokenized clinical data, which denoted as [TEXT] tokens. We additionally append a token, denoted as [SEG], which is intended to attend to all the aforementioned vectors and tokens. Here, the final prompted text input t can be formulated as follows:

$$t=\{{v}_{1}^{n},{v}_{2}^{n},...,{v}_{M}^{n},[\,{\mbox{TEXT}}\,],[\,{\mbox{SEG}}\,]\}.$$
(1)

Then, using the prompted text input t, the frozen LLM results the context embeddings \(g\in {{\mathbb{R}}}^{N\times D}\) as output embeddings as for the inputted [SEG] token.

(b) Multimodal interactive alignment

To align the context embeddings g with the image embeddings \({f}_{l}\in {{\mathbb{R}}}^{{H}_{l}{W}_{l}{S}_{l}\times {C}_{l}}\), where fl is the lth layer output of the 3D image encoder, HlWl, and Sl correspond to height, width, and slice of the image embeddings, and Cl is the intermediate channel dimension of each lth layer output, we first project g to have the identical dimension with that of each fl through layer-wise linear layer. As illustrated in Supplementary Fig. 5b, the linearly projected context embeddings \({\bar{g}}_{l}\) are then self-attended and crossly-attended with the image embedding fl to result context-aligned image embeddings \({{f}_{l}}^{\!\!*}\). Detailed specifications of each lth layer embeddings and the interactive alignment module are listed in Supplementary Table 7.

(c) CTV delineation

After the multimodal interactive alignment, the context-aligned image embeddings \({{f}_{l}}^{ \!\!*}\) become inputs for the 3D image decoder. As illustrated in Supplementary Fig. 5c, for the final predicted output \(\hat{y}\), we calculated the combination of the Cross-entropy (CE) loss and the Dice coefficient (Dice) loss by following:

$$\begin{array}{rcl}&&{\min }_{{{{\mathcal{D}}}},{{{\mathcal{V}}}}}{{{\mathcal{L}}}}={\lambda }_{{{{\rm{ce}}}}}{{{{\mathcal{L}}}}}_{{{{\rm{ce}}}}}(\hat{y},y)+{\lambda }_{{{{\rm{dice}}}}}{{{{\mathcal{L}}}}}_{{{{\rm{dice}}}}}(\hat{y},y),\\ &&\,{\mbox{where}}\,\,{{{\mathcal{L}}}}(\hat{y},y)=-{{\mathbb{E}}}_{x \sim {P}_{X}}\left[{y}_{i}\log p({\hat{y}}_{i})\right],\end{array}$$
(2)

where \({{{\mathcal{D}}}}\) denotes our prospoed LLMSeg, \({{{\mathcal{V}}}}\) denotes multiple text prompts, λce, and λdice are hyper-parameters for each CE loss and Dice loss, respectively. \(y\in {{\mathbb{R}}}^{B\times HWS}\) is the 3D ground-truth CTV mask, where B denotes batch size, HW, and S correspond to height, width, and slice of ground-truth CTV mask. \(p({\hat{y}}_{i})\) denotes softmax probability of the Ith pixel within the final predicted output \(\hat{y}\in {{\mathbb{R}}}^{B\times HWS}\), which is defined as:

$$\hat{y}={{{\mathcal{D}}}}(x,t)$$
(3)

where \(x\in {{\mathbb{R}}}^{B\times HWS}\) is input 3D CT scan, t is prompted clinical data corresponds to the input CT scan x with text prompts \({{{\mathcal{V}}}}\).

Details of network training

When pre-processing the data, all the chest CT images and CTVs were initially re-sampled to have an identical voxel spacing of 1.0 × 1.0 × 3.0 mm3. The image intensity values were truncated between −1000 and 1000 of Hounsfield unit, and linearly normalized within a range between 0 and 1.0. When training the network, a 3D patch with a size of 384 × 384 × 128 pixels was randomly cropped to cover the entire breast alongside with its paired clinical data with batch size of 2. When evaluating the trained network, the entire 3D CT image was tested using sliding windows with a 3D patch with a size of 384 × 384 × 128 pixels. We set the optimal hyper-parameters as listed in Supplementary Table 8. During training, we let the entire LLM frozen, while making the image encoder/decoder modules, the interactive alignment modules, and their corresponding linear layers, and the text prompts trainable parameters.

As the loss function, we computed both the binary CE loss and the Dice loss, with the weight value for each loss as 1.0, respectively. The network parameters were optimized using AdamW40 optimizer with a learning rate of 0.0001 until the training epoch reaching 100. We implemented the network using the open-source library MONAI. All the experiments were conducted using the PyTorch41 in Python using CUDA 11.4 on NVIDIA RTX A6000 48 GB. We further described backbones for each model, and compared training complexity in Supplementary Table 9.

Rationales of selecting baseline models

Our baselines, ConTEXTualNet24, LISA23, and HIPIE22 along with our proposed model, LLMSeg, are designed to extract characteristics from an input sentence that are not explicitly visible in the image, as categorized in Supplementary Table 10. For example, tasks may include identifying the food item richest in Vitamin C from an image and generating a segmentation mask, or recognizing medical conditions and treatment plans (like cT2, N1mi, breast conserving surgery, and left-side procedures). These tasks necessitate a deep understanding of the sentence context and the ability to infer answers for context-aware or reasoning/referring-based segmentation. Both ConTEXTualNet, LISA, and HIPIE, like our model, leverage text embeddings derived from a language model to facilitate multimodal segmentation.

Additionally, for a meaningful comparative study, it is crucial to retrain the baseline models with our 3D CT training data. ConTEXTualNet, being a CNN-based network designed for end-to-end training, allows us to adapt the original 2D model into a 3D model suitable for retraining with our 3D data. On the other hand, recent SOTA multimodal foundation models for segmentation, such as LISA42 and HIPIE22, utilize 2D SAM16 or CLIP27-based cross-attention modules. Adapting these models to process 3D volumes as a whole would require retraining the 2D foundation model with 3D data, which is not feasible given our constraints. Consequently, to preserve their transfer learning mechanism based on the frozen 2D foundation model, we retrain these models by converting 3D CT scans to 2D slices as inputs. This highlights a limitation of current 2D vision-language models when adapting to 3D images, resulting in the loss of volumetric context for clinical information-guided multimodal segmentation and yielding suboptimal performance.

The reason for not including traditional open-vocabulary segmentation models in our study is that they are designed for semantic segmentation of visually discernible objects in an image, such as walls, chairs, windows, floors, and ceilings, as depicted in Supplementary Table 10. This capability stems from their use of pre-trained 2D vision-language foundation models which serves as their frozen backbone for feature extraction. These models leverage pre-aligned word-image features for semantic segmentation, thus, there are not appropriate baselines for our medical context-aware segmentation purposes, as the radiotherapy target volumes in CT images are not visually identifiable.

Details of evaluation

To quantitatively evaluate the CTV delineation performance, we calculated Dice coefficient (Dice), Intersection over Union (IoU), and the 95th percentile of Hausdorff Distance (95-HD)43 to measure spatial distances between the ground-truth and the predicted contours. When calculating the 95-HD, all the measured distances in the pixel unit are converted with respect to the original pixel resolution, and the results are expressed in centimeters (cm).

Details of clinical evaluation

To accurately assess the performance of the model, we conducted clinical evaluations by the board-certified radiation oncologist with over 5 years of experience. To provide a more detailed evaluation of the model’s performance and establish an objective criterion for assessment, we employed rubrics proposed by the radiation oncologists. For breast cancer, these rubrics included laterality (right, left, or bilateral—1 point), type of surgery (whether the case was post-BCS or mastectomy—1 point), volume definition (accurate definition of breast or chest wall, inclusion of regional LNs—1.5 points), coverage (ensuring the target volume was adequately covered without encompassing unnecessary areas), and integrity (absence of incomplete or distorted segmentation output), constituting a total of 5 points. Detailed criteria for each rubric and illustrative examples are provided in Supplementary Fig. 1 and Supplementary Table 1.

For prostate cancer, the criteria included primary site (accuracy in defining the treatment scope for the prostate, including seminal vesicles), volume definition (appropriate inclusion of the prostate and regional nodes), coverage, and integrity, totaling 4 points. The rubrics of laterality, surgery type, volume definition, and primary site were established to assess the appropriateness of the underlying concepts in defining the scope of the target area. Conversely, the criteria for coverage and integrity were specifically designed to evaluate the quality of the contouring. Detailed criteria for each rubric and illustrative examples are provided in Supplementary Fig. 2 and Supplementary Table 4.

Utilizing these evaluation criteria, to ensure fairness, the same board-certified radiation oncologists conducted assessments of the segmentation outputs by comparing them to the ground truth and considering the clinical context, all while being blinded to whether the outputs were generated by a vision-only model or a multimodal model.

Statistics and reproducibility

For statistical analysis, we used the non-parametric bootstrap method to calculate the confidence interval (CI) for each metric. We randomly sampled the total size of dataset from the original dataset while allowing replacement for 1000 times, repeatedly. Then, the mean values and the 95th percentile of CIs were estimated from the relative frequency distribution of each trial. Two-tailed Student’s paired t-test was used for the statistical comparison between the two groups. No statistical method was used to predetermine sample size. No data were excluded from the analyses; The experiments were not randomized; The Investigator was not blinded to allocation during experiments and outcome assessment.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.