MedSigLIP

MedSigLIP is a variant of SigLIP (Sigmoid Loss for Language Image Pre-training) that is trained to encode medical images and text into a common embedding space. Developers can use MedSigLIP to accelerate building healthcare-based AI applications.

MedSigLIP was trained on a variety of de-identified medical image and text pairs, including chest X-rays, dermatology images, ophthalmology images, histopathology slides, and slices of CT and MRI volumes, along with associated descriptions or reports. MedSigLIP contains a 400M parameter vision encoder and 400M parameter text encoder, it supports 448x448 image resolution with up to 64 text tokens.

MedSigLIP is recommended for medical image interpretation applications without a need for text generation, such as data-efficient classification, zero-shot classification, and semantic image retrieval. For medical applications that require text generation, MedGemma is recommended.

Common use cases

The following sections present some common use cases for the model. You're free to pursue any use case, as long as it adheres to the Health AI Developer Foundations terms of use.

Data-efficient medical image classification

MedSigLIP pre-training makes it a good starting point to be adapted for use in classifying medical images including radiology, fundus and skin images.

With a small amount of labelled data, you can train a classifier model on top of MedSigLIP embeddings. Furthermore, the embedding from each image only needs to be generated once and can be used as an input for a variety of different classifiers, with very little additional compute. See our example notebook about how to use MedSigLIP in data-efficient classification.

MedSigLIP also has pretraining on digital pathology images but we still recommend developers to start with Path Foundation for data efficient classification, due to reduced compute requirements and similar performance.

Zero-shot classification

By using MedSigLIP's text-encoder, users can get a classification score without any additional training data through textual prompts. Zero-shot works by measuring the relative distance of the image embeddings from a positive e.g., "pleural effusion present", and negative text prompt e.g., "normal X-ray". The use cases are the same as data-efficient classification but don't require data to train. The zero-shot method will outperform data-efficient classifications at low levels of training data, while the data-efficient classification will tend to exceed zero-shot performance with larger amounts of data. See ELIXR paper for more details.

See the getting started Colab notebook for an example of how to use zero-shot classification.

Semantic image retrieval

By using MedSigLIP's text-encoder users can rank a set of medical images across a search query. Similar to zero-shot classification, language-based image retrieval relies on the distance between the embeddings of the set of images and the text embeddings from the search query.

See the getting started Colab notebook for an example of how to rank an image based on a keyword. To adapt this to search, rank multiple images by their score against that zero-shot keyword.

Fine-tuning

MedSigLIP can be fine-tuned for improved performance on the existing tasks it's been trained on, or to add additional tasks to its repertoire. For an example of how to fine-tune MedSigLIP see this notebook.

Next steps