Official PyTorch implementation of our paper at ICCV 2025 1st Multimodal Sign Language Recognition (MSLR) Workshop
This repository presents a point-supervised temporal localization pipeline for Japanese fingerspelling. We enhance HR-Pro with three key components:
- A transformer-based encoder (VideoMAE v2)
- SimCLR-style point-supervised contrastive learning (Point-Sup. CL)
- Joint angle features derived from MediaPipe Hands
.
├── feature_extraction # Feature extraction pipelines (Angular/I3D)
├── hrpro # HR-Pro-based temporal localization with point annotations
└── videomae # Point-supervised contrastive learning with VideoMAE v2
Each directory includes its own README.md
with detailed instructions for setup and execution.
- uv (or your preferred package manager)
- CUDA>=12.4
- OpenCV
We have released ub-MOJI, a Japanese fingerspelling video dataset with point-level annotations, available via Hugging Face.
Localization performance (mean Average Precision across tIoU thresholds) on the ub-MOJI dataset:
Model | mAP@0.1–0.5 | mAP@0.3–0.7 | mAP@0.1–0.7 |
---|---|---|---|
I3D (RGB + Flow) | 57.6% | 50.8% | 52.9% |
I3D + Angular | 90.8% | 80.4% | 84.0% |
VideoMAE v2 | 62.9% | 56.5% | 58.6% |
VideoMAE v2 + Point-Sup. CL | 93.4% | 78.9% | 83.6% |
VideoMAE v2 + Point-Sup. CL + Angular | 90.9% | 79.6% | 83.7% |
Download our trained checkpoints from Hugging Face.
For questions or collaborations, feel free to open an Issue or Pull Request.
This code is released under the MIT License.
Please refer to the dataset repository for dataset-specific licensing terms.
- Ryota Murai (Code Owner)
- Naoto Tsuta
- Duk Shin
- Yousun Kang
TBD