RayGen: Multi-Modal Dataset Reinforcement for MobileCLIP

For MobileCLIP and MobileCLIP2 models and demos see ml-mobileclip repository.

This repository contains data generation code used in the following papers to improve and augment multi-modal and single-modal datasets. This code can be used to run distributed inference at scale to compute the output of an ML model on a large-scale dataset and store efficiently in a new dataset. The code supports processing petabytes of data and billions of samples on 10,000 GPUs and more. The code also supports elasticity where processing can start with a minimum number of workers and gradually scaled up.

MobileCLIP2: Improving Multi-Modal Reinforced Training. (TMLR August 2025 Featured) Fartash Faghri, Pavan Kumar Anasosalu Vasu, Cem Koc, Vaishaal Shankar, Alexander Toshev, Oncel Tuzel, Hadi Pouransari.
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training. (CVPR 2024) Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel.
Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement. (ICCV 2023) Fartash Faghri, Hadi Pouransari, Sachin Mehta, Mehrdad Farajtabar, Ali Farhadi, Mohammad Rastegari, Oncel Tuzel.

The repository contains the RayGen package, a package that simplifies data generation using the distributed processing framework Ray.

The training code and models are accessible in the following repositories

Dataset Reinforcement (ml-dr): Single modality training on reinforced image classification datasets.
MobileCLIP (ml-mobileclip): Multi-modal reinforced training on image-text datasets.

Installation

For local development we recommend installing a Miniforge based Conda environment with Python 3.10 as the base environment. This code is developed and tested on Ubuntu 20.04-x86_64 compute environment.

Create a Conda environment and install the environment dependencies.

conda create -n raygen python=3.10
conda activate raygen

Run the setup script:

bash scripts/setup_gpu_worker.sh  # or setup_cpu_worker.sh for cpu only

Install raygen python package to the local environment using:

pip install -e .

Local Development

To run RayGen scripts locally on a CPU or GPU machine first follow the installation instructions above. At least one GPU is needed for data generation scripts but not for dataset conversion.

Additionally you will want to install the "dev" dependencies which installs Ray.

pip install -e ".[dev]"

Once everything is installed, you can then start the local Ray cluster via:

ray start --head --port=6379

Next, run the generation code on a toy dataset:

INPUT=s3://some_bucket/some_path/
OUTPUT=s3://some_bucket/some_path/
GPU_COUNT=$(nvidia-smi -L | wc -l)
BATCH_SIZE=256

python3 scripts/gen.py \
    --input $INPUT \
    --output $OUTPUT \
    --batch-size $BATCH_SIZE \
    --min-actors $GPU_COUNT \
    --verbose \
    --local

If the batch size of 256 is too large for the local GPU then you can reduce it. I ran this with 16 cores on my host.

Example Commands

The example script run_capgen.py generates synthetic captions from a CoCa model for a dataset in JSONL format and saves it as a new dataset in JSONL format.

Next example script run_embgen.py generates CLIP embeddings. We generate CLIP image-text embeddings from an ensemble of two CLIP models. The models are loaded from HuggingFace. The number of image augmentations is specified as 2. We also specify a regular expression to select a subset of the synthetic captions for compute synthetic text embeddings. The image augmentation parameters are also specified

Supported sharded dataset formats

The code supports datasets as input and output in JSONL format (a collection of .jsonl files with a single .jsonl file as the manifest) as well as Webdataset format (a collection of .tar files passed with a regular expression). Additional file formats such as jsonl.zstd, jsonl.gz, .tfrecord are also supported as inputs but not as output.

To pass an input dataset, one can pass a path like {bucket}/path/{00000000..00001023}.tar. Alternatively, one may pass a info.json file that contains a list of all the shards in the dataset. The minimal content of such a file is {'lengths': {'shard1.jsonl': 0, 'shard2.jsonl': 0, …}}. As the generation code does not utilize the value of lengths, they can be set to zero. If the output dataset format is JSONL, a new info.json will be written that contains additional information such as totalLength of the dataset.

Checkpointing

The code saves a checkpoint of the generation process and by default resumes from a checkpoint if available in the path. The checkpoint is a checkpoint.json file containing information about processed input and output paths so far together with the number of processed samples.

License

This software and accompanying data and models have been released under the following licenses:

Acknowledgements

Our codebase is built using multiple opensource contributions, please see ACKNOWLEDGEMENTS for more details.

Citation

If you found this code useful, please cite the following papers:

@article{faghri2025mobileclip2,
  title={Mobile{CLIP}2: Improving Multi-Modal Reinforced Training},
  author={Fartash Faghri and Pavan Kumar Anasosalu Vasu and Cem Koc and Vaishaal Shankar and Alexander T Toshev and Oncel Tuzel and Hadi Pouransari},
  journal={Transactions on Machine Learning Research},
  issn={2835-8856},
  year={2025},
  url={https://openreview.net/forum?id=WeF9zolng8},
  note={Featured Certification}
}


@InProceedings{vasu2024mobileclip,
  author = {Vasu, Pavan Kumar Anasosalu and Pouransari, Hadi and Faghri, 
  Fartash and Vemulapalli, Raviteja and Tuzel, Oncel},
  title = {MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2024},
}


@InProceedings{faghri2023reinforce,
    author    = {Faghri, Fartash and Pouransari, Hadi and Mehta, Sachin and Farajtabar, Mehrdad and Farhadi, Ali and Rastegari, Mohammad and Tuzel, Oncel},
    title     = {Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement},
    booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month     = {October},
    year      = {2023},
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
examples		examples
scripts		scripts
src/raygen		src/raygen
.flake8		.flake8
.gitignore		.gitignore
ACKNOWLEDGMENTS		ACKNOWLEDGMENTS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE_DATA		LICENSE_DATA
LICENSE_MODELS		LICENSE_MODELS
README.md		README.md
pyproject.toml		pyproject.toml
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RayGen: Multi-Modal Dataset Reinforcement for MobileCLIP

Installation

Local Development

Example Commands

Supported sharded dataset formats

Checkpointing

License

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

apple/ml-mobileclip-dr

Folders and files

Latest commit

History

Repository files navigation

RayGen: Multi-Modal Dataset Reinforcement for MobileCLIP

Installation

Local Development

Example Commands

Supported sharded dataset formats

Checkpointing

License

Acknowledgements

Citation

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages