M-GRPO: Stabilizing Self-Supervised ReinforcementLearning for Large Language Models with Momentum-Anchored Policy Optimization
Self-supervised reinforcement learning (RL) presents a promising avenue for enhancing the reasoning capabilities of LLMs without reliance on expensive human-annotated data. However, existing methods are prone to "policy collapse," a phenomenon where the learning process becomes unstable during extended training, leading to a sharp degradation in both reward and task performance. This paper diagnoses this instability, attributing it to the lack of a stable target in self-rewarding systems. To address this, we introduce M-GRPO, a momentum-anchored method that leverages a slowly evolving momentum model to provide a consistent and reliable training signal, stabilizing the generation of pseudo-labels for policy optimization. Our experiments, conducted on the MATH dataset, demonstrate that M-GRPO effectively prevents policy collapse, maintaining a stable training reward and consistently high validation accuracy.
First, download the MATH dataset and prepare it using the following Python script:
python examples/data_preprocess/math_dataset_ours.py --model Qwen2.5-3B
Then, run the following command to start the training (Modify the WANDB_KEY in the math_intuitor.sh
script to your own WANDB key.):
bash math_intuitor.sh
This project builds upon the following open-source repositories:
-
intuitor License: Apache License 2.0
-
open-r1 License: Apache License 2.0
-
verl License: Apache License 2.0