This repository contains the official implementation of our paper "Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation".
We propose Nyx, a unified mixed-modal retriever tailored for URAG scenarios, and construct NyxQA, a large-scale mixed-modal QA dataset. Our framework includes:
- A four-stage automated pipeline for generating realistic multimodal QA pairs.
- A two-stage training framework combining pre-training on NyxQA and supervised fine-tuning with VLM feedback.
- Strong performance on both text-only RAG benchmarks and vision-language URAG tasks.
We recommend using Conda for package management.
conda create -n nyx python=3.11
conda activate nyx
pip install -r requirements.txt
Our implementation uses torch==2.4.0
, faiss-cpu==1.8.0
, and transformers==4.52.2
. Please note that faiss-cpu
and transformers
might have numpy
version conflicts. We prefer keeping numpy
at version 1.26.4
(the version compatible with faiss-cpu
), so you may need to uninstall any newer numpy
versions.
Suggested installation order: PyTorch → faiss-cpu → transformers → accelerate → deepspeed
The core implementation of this project is built upon VLM2Vec. We extend our sincere gratitude to the original authors for their foundational work.
We also want to acknowledge and thank the developers of these essential tools that made our work possible:
- vLLM for efficient LLM inferencing
- FlashAttention for optimized attention computation
- DeepSpeed for distributed training acceleration
Our work stands on the shoulders of these remarkable open-source projects and the generous research community.
We also want to note that the logo at the top of this README is adapted from the character Nyx in the game Hades by Supergiant Games.