Unlocking Reasoning with Group Relative Policy Optimization (GRPO)

DeepSeek-R1 has disrupted the AI community by providing a competitive, comparable, and completely open sourced alternative to OpenAI's original reasoning models. Before the DeepSeek team released their report, the methods and techniques used to enable long reflective reasoning capabilities in language models have been largely speculative.

While it's still not clear exactly how the private labs like OpenAI have achieved their own reasoning, DeepSeek fully outlines their technique in the paper DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.

Within the paper they outline the entire training pipeline for DeepSeek-R1 along with their breakthrough using a new reinforcement learning technique, Group Relative Policy Optimization (GRPO), originally outlined in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.

In this notebook we'll be show how GRPO is applied by covering:

How the Algorithm Works
The DeepSeek-R1 Training Pipeline
Applying GRPO Training Ourselves to Qwen-2.5-3B-Instruct

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
GRPO_Training.ipynb		GRPO_Training.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Unlocking Reasoning with Group Relative Policy Optimization (GRPO)

About

Uh oh!

Releases

Packages

Languages

License

ALucek/GRPO-Training

Folders and files

Latest commit

History

Repository files navigation

Unlocking Reasoning with Group Relative Policy Optimization (GRPO)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages