+
Skip to content

ALucek/GRPO-Training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Unlocking Reasoning with Group Relative Policy Optimization (GRPO)

DeepSeek-R1 has disrupted the AI community by providing a competitive, comparable, and completely open sourced alternative to OpenAI's original reasoning models. Before the DeepSeek team released their report, the methods and techniques used to enable long reflective reasoning capabilities in language models have been largely speculative.

While it's still not clear exactly how the private labs like OpenAI have achieved their own reasoning, DeepSeek fully outlines their technique in the paper DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.

Within the paper they outline the entire training pipeline for DeepSeek-R1 along with their breakthrough using a new reinforcement learning technique, Group Relative Policy Optimization (GRPO), originally outlined in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.

In this notebook we'll be show how GRPO is applied by covering:

  1. How the Algorithm Works
  2. The DeepSeek-R1 Training Pipeline
  3. Applying GRPO Training Ourselves to Qwen-2.5-3B-Instruct

About

An overview of GRPO & DeepSeek-R1 Training with Open Source GRPO Model Fine Tuning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
点击 这是indexloc提供的php浏览器服务,不要输入任何密码和下载