You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Break up grpo_train to be more functional/composable to enable R&D
Then implementing e.g. REINFORCE becomes a matter of swapping out loss function, rollout config
Also lets us easily experiment with ideas like expert demonstrations, curriculum learning, etc. These Composable primitives we can assemble into GRPO, REINFORCE, PPO, …