LARP: Tokenizing Videos 🎬 with a Learned Autoregressive Generative Prior 🚀 (ICLR 2025 Oral)

Overview

We present LARP, a novel video tokenizer inherently aligned with autoregressive (AR) generative models. LARP employs a holistic tokenization approach, leveraging learned queries to capture global and semantic representations. Moreover, it integrates a lightweight AR transformer during training, structuring the latent space to optimize both video reconstruction and autoregressive generation. LARP outperforms state-of-the-art methods on video generation tasks, achieving competitive performance.

Get Started

Install pytorch 2.4.0 and torchvision 0.19.0

pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124

Install other dependencies
```
pip install -r requirements.txt
```
Set up the datasets using set_datasets.sh

This script sets up datast for UCF101 and Kinetics-600 datasets.

You need to download the datasets you want to use and set the paths in the script. This script will create the necessary symbolic links so that the code can find the data.

After setting up the datasets, verify that all paths in the CSV files located in data/metadata are accessible.

Pretrained Models

We provide pretrained models for LARP tokenizer, LARP AR model, and LARP AR frame prediction model.

Model	#params	FVD	🤗 HuggingFace
LARP-L-Long-tokenizer	173M	20 (recon.)	link
LARP-L-Long-AR	632M	57 (gen.)	link
LARP-L-Long-AR-FP	632M	5.1 (FP)	link

Please refer to the sampling and evaluation section for details on how to use these models.

Training

We provide scripts for training LARP tokenizer, LARP AR model, and LARP AR frame prediction model using a single GPU.

Training LARP Tokenizer

bash scripts/train_larp_tokenizer.sh

Training LARP AR model on UCF101 dataset

bash scripts/train_larp_ar.sh

Training LARP AR frame prediction model on Kinetics-600 dataset

bash scripts/train_larp_ar_fp.sh

Reproducing the Pretrained Models

To reproduce the pretrained models released on HuggingFace, refer to the following training scripts:

scripts/train_larp_tokenizer_reproduce.sh
scripts/train_larp_ar_reproduce.sh
scripts/train_larp_ar_fp_reproduce.sh

Sampling and Evaluation

The sample.py script can be used to sample videos from the LARP AR model and LARP AR frame prediction model. It also computes the Frechet Video Distance (FVD) score with the real videos. The eval/eval_larp_tokenizer.py script can be used to evaluate reconstruction performance of the LARP tokenizer. Unless specified, all commands in this section are supposed to be run on an single GPU machine.

UCF101 Class-conditional Generation

The following command samples 10,000 videos from the LARP AR model trained on UCF101 dataset and compute the FVD score with the real videos. The videos are generated class-conditionally, i.e., each video is generated from a single class. Note that the UCF101 dataset is required to run this run this script.

This command can reproduce the UCF101 generation FVD results reported in the Table 1 of the paper.

python3 sample.py \
    --ar_model hywang66/LARP-L-long-AR \
    --tokenizer hywang66/LARP-L-long-tokenizer \
    --output_dir samples/ucf_reproduce \
    --num_samples 10000 \
    --sample_batch_size 64 \
    --cfg_scale 1.25 \
    --dtype bfloat16 \
    --dataset_csv ucf101_train.csv \
    --dataset_split_seed 42

The FVD score will be displayed at the end of the script and also appended to the fvd_report.csv file in the project directory.

Kinetics-600 Frame Prediction

The following command predicts the next 11 frames conditioned on the previous 5 frames using the LARP AR frame prediction model trained on Kinetics-600 dataset. 50,000 samples are generated and used to compute the FVD score with the real videos. Note that the Kinetics-600 dataset is required to run this run this script.

This command can reproduce the Kinetics-600 frame prediction FVD results reported in the Table 1 of the paper.

python3 sample.py \
    --fp --num_cond_frames 5 \
    --ar_model hywang66/LARP-L-long-AR-FP \
    --tokenizer hywang66/LARP-L-long-tokenizer \
    --output_dir samples/k600_FP_reproduce \
    --num_samples 50000 \
    --sample_batch_size 64 \
    --dtype bfloat16 \
    --dataset_csv k600_val.csv \
    --dataset_split_seed 42

The FVD score will be displayed at the end of the script and also appended to the fvd_report.csv file in the project directory.

Parallel Sampling and Evaluation

When multiple GPUs are available, sample.py can be run in parallel to accelerate the sampling process. Set the --num_samples argument to specify the per-GPU number of samples, and use the --num_samples_total argument to define the total number of samples. Importantly, set the --starting_index argument to specify the starting index for this process, ensuring that it samples videos from --starting_index to --starting_index + --num_samples (exclusive).

Example commands:

python3 sample.py \
    --ar_model hywang66/LARP-L-long-AR \
    --tokenizer hywang66/LARP-L-long-tokenizer \
    --output_dir samples/ucf_reproduce \
    --num_samples 128 \
    --num_samples_total 10000 \
    --starting_index 0 \
    --sample_batch_size 64 \
    --cfg_scale 1.25 \
    --dtype bfloat16 \
    --dataset_csv ucf101_train.csv \
    --dataset_split_seed 42

python3 sample.py \
    --ar_model hywang66/LARP-L-long-AR \
    --tokenizer hywang66/LARP-L-long-tokenizer \
    --output_dir samples/ucf_reproduce \
    --num_samples 128 \
    --num_samples_total 10000 \
    --starting_index 32 \
    --sample_batch_size 64 \
    --cfg_scale 1.25 \
    --dtype bfloat16 \
    --dataset_csv ucf101_train.csv \
    --dataset_split_seed 42

......

python3 sample.py \
    --ar_model hywang66/LARP-L-long-AR \
    --tokenizer hywang66/LARP-L-long-tokenizer \
    --output_dir samples/ucf_reproduce \
    --num_samples 16 \
    --num_samples_total 10000 \
    --starting_index 9984 \
    --sample_batch_size 64 \
    --cfg_scale 1.25 \
    --dtype bfloat16 \
    --dataset_csv ucf101_train.csv \
    --dataset_split_seed 42

Ensure there is no overlap in sample indices across processes, and assign each process to a different GPU. Once all processes have completed (in any order), the FVD score will be automatically calculated and appended to the fvd_report.csv file in the project directory.

LARP Tokenizer Reconstruction Evaluation

The following command evaluates the LARP tokenizer on the UCF101 dataset. The script computes the reconstruction FVD (rFVD) and other related metrics. Note that the UCF101 dataset is required to run this run this script.

This command can reproduce the LARP tokenizer reconstruction FVD results reported in the Table 1 of the paper.

python3 eval/eval_larp_tokenizer.py \
    --tokenizer hywang66/LARP-L-long-tokenizer \
    --dataset_csv ucf101_train.csv \
    --use_amp --det

Citation

If you find this code useful in your research, please consider citing:

@article{larp,
    title={LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior}, 
    author={Hanyu Wang and Saksham Suri and Yixuan Ren and Hao Chen and Abhinav Shrivastava},
    year={2024},
    eprint={2410.21264},
    archivePrefix={arXiv},
    primaryClass={cs.CV},
    url={https://arxiv.org/abs/2410.21264}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
ar		ar
assets		assets
cfgs		cfgs
data/metadata		data/metadata
datasets		datasets
eval		eval
models		models
scripts		scripts
trainers		trainers
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
sample.py		sample.py
set_datasets.sh		set_datasets.sh
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LARP: Tokenizing Videos 🎬 with a Learned Autoregressive Generative Prior 🚀 (ICLR 2025 Oral)

Overview

Get Started

Pretrained Models

Training

Training LARP Tokenizer

Training LARP AR model on UCF101 dataset

Training LARP AR frame prediction model on Kinetics-600 dataset

Reproducing the Pretrained Models

Sampling and Evaluation

UCF101 Class-conditional Generation

Kinetics-600 Frame Prediction

Parallel Sampling and Evaluation

LARP Tokenizer Reconstruction Evaluation

Citation

About

Uh oh!

Releases

Packages

Languages

License

hywang66/LARP

Folders and files

Latest commit

History

Repository files navigation

LARP: Tokenizing Videos 🎬 with a Learned Autoregressive Generative Prior 🚀 (ICLR 2025 Oral)

Overview

Get Started

Pretrained Models

Training

Training LARP Tokenizer

Training LARP AR model on UCF101 dataset

Training LARP AR frame prediction model on Kinetics-600 dataset

Reproducing the Pretrained Models

Sampling and Evaluation

UCF101 Class-conditional Generation

Kinetics-600 Frame Prediction

Parallel Sampling and Evaluation

LARP Tokenizer Reconstruction Evaluation

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages