M-GRPO: Stabilizing Self-Supervised ReinforcementLearning for Large Language Models with Momentum-Anchored Policy Optimization

Motivation: Existing self-supervised reinforcement learning (RL) are prone to "policy collapse.

Self-supervised reinforcement learning (RL) presents a promising avenue for enhancing the reasoning capabilities of LLMs without reliance on expensive human-annotated data. However, existing methods are prone to "policy collapse," a phenomenon where the learning process becomes unstable during extended training, leading to a sharp degradation in both reward and task performance. This paper diagnoses this instability, attributing it to the lack of a stable target in self-rewarding systems. To address this, we introduce M-GRPO, a momentum-anchored method that leverages a slowly evolving momentum model to provide a consistent and reliable training signal, stabilizing the generation of pseudo-labels for policy optimization. Our experiments, conducted on the MATH dataset, demonstrate that M-GRPO effectively prevents policy collapse, maintaining a stable training reward and consistently high validation accuracy.

How to use

First, download the MATH dataset and prepare it using the following Python script:

python examples/data_preprocess/math_dataset_ours.py --model Qwen2.5-3B

Then, run the following command to start the training (Modify the WANDB_KEY in the math_intuitor.sh script to your own WANDB key.):

bash math_intuitor.sh

📚 References

This project builds upon the following open-source repositories:

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
verl-intuitor		verl-intuitor
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

M-GRPO: Stabilizing Self-Supervised ReinforcementLearning for Large Language Models with Momentum-Anchored Policy Optimization

Motivation: Existing self-supervised reinforcement learning (RL) are prone to "policy collapse.

How to use

📚 References

About

Uh oh!

Releases

Packages

Languages

anjole/M_GRPO

Folders and files

Latest commit

History

Repository files navigation

M-GRPO: Stabilizing Self-Supervised ReinforcementLearning for Large Language Models with Momentum-Anchored Policy Optimization

Motivation: Existing self-supervised reinforcement learning (RL) are prone to "policy collapse.

How to use

📚 References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages