For MobileCLIP and MobileCLIP2 models and demos see ml-mobileclip repository.
This repository contains data generation code used in the following papers to improve and augment multi-modal and single-modal datasets. This code can be used to run distributed inference at scale to compute the output of an ML model on a large-scale dataset and store efficiently in a new dataset. The code supports processing petabytes of data and billions of samples on 10,000 GPUs and more. The code also supports elasticity where processing can start with a minimum number of workers and gradually scaled up.
- MobileCLIP2: Improving Multi-Modal Reinforced Training. (TMLR August 2025 Featured) Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander Toshev, Oncel Tuzel, Hadi Pouransari.
- MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training. (CVPR 2024) Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel.
- Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement. (ICCV 2023) Fartash Faghri, Hadi Pouransari, Sachin Mehta, Mehrdad Farajtabar, Ali Farhadi, Mohammad Rastegari, Oncel Tuzel.
The repository contains the RayGen package, a package that simplifies data generation using the distributed processing framework Ray.
The training code and models are accessible in the following repositories
- Dataset Reinforcement (ml-dr): Single modality training on reinforced image classification datasets.
- MobileCLIP (ml-mobileclip): Multi-modal reinforced training on image-text datasets.
For local development we recommend installing a Miniforge based Conda environment with Python 3.10 as the base environment. This code is developed and tested on Ubuntu 20.04-x86_64 compute environment.
Create a Conda environment and install the environment dependencies.
conda create -n raygen python=3.10
conda activate raygen
Run the setup script:
bash scripts/setup_gpu_worker.sh # or setup_cpu_worker.sh for cpu only
Install raygen
python package to the local environment using:
pip install -e .
To run RayGen scripts locally on a CPU or GPU machine first follow the installation instructions above. At least one GPU is needed for data generation scripts but not for dataset conversion.
Additionally you will want to install the "dev" dependencies which installs Ray.
pip install -e ".[dev]"
Once everything is installed, you can then start the local Ray cluster via:
ray start --head --port=6379
Next, run the generation code on a toy dataset:
INPUT=s3://some_bucket/some_path/
OUTPUT=s3://some_bucket/some_path/
GPU_COUNT=$(nvidia-smi -L | wc -l)
BATCH_SIZE=256
python3 scripts/gen.py \
--input $INPUT \
--output $OUTPUT \
--batch-size $BATCH_SIZE \
--min-actors $GPU_COUNT \
--verbose \
--local
If the batch size of 256 is too large for the local GPU then you can reduce it. I ran this with 16 cores on my host.
The example script run_capgen.py generates synthetic captions from a CoCa model for a dataset in JSONL format and saves it as a new dataset in JSONL format.
Next example script run_embgen.py generates CLIP embeddings. We generate CLIP image-text embeddings from an ensemble of two CLIP models. The models are loaded from HuggingFace. The number of image augmentations is specified as 2. We also specify a regular expression to select a subset of the synthetic captions for compute synthetic text embeddings. The image augmentation parameters are also specified
The code supports datasets as input and output in JSONL format (a collection of
.jsonl
files with a single .jsonl
file as the manifest) as well as
Webdataset format (a collection of .tar
files passed with a regular
expression). Additional file formats such as jsonl.zstd
, jsonl.gz
,
.tfrecord
are also supported as inputs but not as output.
To pass an input dataset, one can pass a path like
{bucket}/path/{00000000..00001023}.tar
. Alternatively, one may pass
a info.json
file that contains a list of all the shards in the dataset. The
minimal content of such a file is {'lengths': {'shard1.jsonl': 0, 'shard2.jsonl': 0, …}}
. As the generation code does not utilize the value of
lengths, they can be set to zero. If the output dataset format is JSONL, a new
info.json
will be written that contains additional information such as
totalLength
of the dataset.
The code saves a checkpoint of the generation process and by default resumes
from a checkpoint if available in the path. The checkpoint is
a checkpoint.json
file containing information about processed input and
output paths so far together with the number of processed samples.
This software and accompanying data and models have been released under the following licenses:
- Code: Apple Sample Code License (ASCL)
- ML models: Apple ML Research Model TOU
- Data: CC-BY-NC-ND Deed
Our codebase is built using multiple opensource contributions, please see ACKNOWLEDGEMENTS for more details.
If you found this code useful, please cite the following papers:
@article{faghri2025mobileclip2,
title={Mobile{CLIP}2: Improving Multi-Modal Reinforced Training},
author={Fartash Faghri and Pavan Kumar Anasosalu Vasu and Cem Koc and Vaishaal Shankar and Alexander T Toshev and Oncel Tuzel and Hadi Pouransari},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2025},
url={https://openreview.net/forum?id=WeF9zolng8},
note={Featured Certification}
}
@InProceedings{vasu2024mobileclip,
author = {Vasu, Pavan Kumar Anasosalu and Pouransari, Hadi and Faghri,
Fartash and Vemulapalli, Raviteja and Tuzel, Oncel},
title = {MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2024},
}
@InProceedings{faghri2023reinforce,
author = {Faghri, Fartash and Pouransari, Hadi and Mehta, Sachin and Farajtabar, Mehrdad and Farhadi, Ali and Rastegari, Mohammad and Tuzel, Oncel},
title = {Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2023},
}