Fine‑tune the multimodal Pixtral‑12B model (Mistral × PixArt) on custom vision‑language instruction datasets using LoRA adapters and Hugging Face's 🤗 ecosystem.
- 🧠 Lightweight LoRA tuning (~3% trainable params)
- 🎯 Supports multimodal JSON with
[IMG]
token injection - 📦 Self-contained
train.py
script powered by 🤗 PEFT +Trainer
- 🚀 Compatible with Flash-Attn 2 for faster training (optional)
- 🧩 Easily pluggable into Hugging Face Hub
You can install all dependencies using the provided environment.yml
file (recommended for conda users):
# Step 1: Create conda environment from YAML
conda env create -f environment.yml
# Step 2: Activate the environment
conda activate pixtral-ft
If you prefer pip
, use the requirements.txt
instead:
pip install -r requirements.txt
Each file is a list of conversations. Every message can contain text
and/or image
parts:
Place your JSON under data/
and point --train_json
/ --eval_json
to the files.
# clone repo
git clone https://github.com/<your‑handle>/pixtral-finetune.git
cd pixtral-finetune
conda env create -f environment.yml # full spec
conda activate pixtral-ft
# └─ or: pip install -r requirements.txt # minimal spec
python scripts/train.py \
--model_id mistral-community/pixtral-12b \
--train_json data/train.json \
--eval_json data/val.json \
--output_dir out/pixtral-ft
Run python scripts/train.py --help
to see all flags.
Here are the most commonly used arguments in train.py
:
Argument | Description | Example |
---|---|---|
--model_id |
Base model to fine-tune | mistral-community/pixtral-12b |
--train_json |
Path to training dataset JSON | data/train.json |
--eval_json |
Path to validation dataset JSON | data/val.json |
--output_dir |
Where to save checkpoints and adapters | out/pixtral-ft |
--epochs |
Number of training epochs | 3 |
--lr |
Learning rate | 3e-5 |
--batch_size |
Per-device batch size | 3 |
--gradient_accumulation_steps |
Steps to accumulate gradients (useful for small VRAM) | 4 |
--flash_attn |
Enable Flash-Attn 2 for faster attention (if available) | (flag only, no value needed) |
--push_to_hub |
Push final model to Hugging Face Hub | (flag only) |
To see the full list of arguments at any time:
python scripts/train.py --help
A runnable demo lives in examples/inference.py
:
python examples/inference.py \
--adapter_path out/pixtral-ft \
--image demo.jpg \
--prompt "Describe this image."
Issue | Hint |
---|---|
CUDA out of memory | Lower --batch_size , increase gradient accumulation, or enable Flash‑Attn 2 |
Image token error | Ensure images are RGB and ≤ 4096 px on the long side |
Sequence too long | Shorten prompts or raise --max_seq_len |
PRs & issues welcome! 🎉 Please follow these steps to contribute:
- Fork the repository.
- Create a new branch (
git checkout -b feature-branch
). - Make your changes.
- Commit your changes (
git commit -m 'Add some feature'
). - Push to the branch (
git push origin feature-branch
). - Open a pull request.
Created by Ojasva Goyal
- feel free to contact me at ojasvagoyal9@gmail.com for any questions or feedback.