+
Skip to content

anjole/M_GRPO

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 

Repository files navigation

M-GRPO: Stabilizing Self-Supervised ReinforcementLearning for Large Language Models with Momentum-Anchored Policy Optimization

m_grpo drawio

Motivation: Existing self-supervised reinforcement learning (RL) are prone to "policy collapse.

Self-supervised reinforcement learning (RL) presents a promising avenue for enhancing the reasoning capabilities of LLMs without reliance on expensive human-annotated data. However, existing methods are prone to "policy collapse," a phenomenon where the learning process becomes unstable during extended training, leading to a sharp degradation in both reward and task performance. This paper diagnoses this instability, attributing it to the lack of a stable target in self-rewarding systems. To address this, we introduce M-GRPO, a momentum-anchored method that leverages a slowly evolving momentum model to provide a consistent and reliable training signal, stabilizing the generation of pseudo-labels for policy optimization. Our experiments, conducted on the MATH dataset, demonstrate that M-GRPO effectively prevents policy collapse, maintaining a stable training reward and consistently high validation accuracy. Xnip2025-09-12_00-58-07

How to use

First, download the MATH dataset and prepare it using the following Python script:

python examples/data_preprocess/math_dataset_ours.py --model Qwen2.5-3B

Then, run the following command to start the training (Modify the WANDB_KEY in the math_intuitor.sh script to your own WANDB key.):

bash math_intuitor.sh

📚 References

This project builds upon the following open-source repositories:

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 88.3%
  • Shell 10.5%
  • Other 1.2%
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载