Workshop 4: Learning from Experience

Fine-tuning with Reinforcement Learning (HIL-SERL)

What We'll Cover

➔ When Imitation Isn't Enough
➔ Intro to HIL-SERL: LeRobot's RL Workflow
➔ The Core RL Loop & The Reward Function
➔ The Actor-Learner Architecture
➔ Pro-Tips: Key Hyperparameters
➔ Capstone: The GR00T Recipe

The Limits of Imitation

Problem: Your imitation-trained policy is good, but not perfect. It only knows what you showed it. It struggles with new situations and can't discover better ways to do the task.

Solution: Reinforcement Learning (RL). Let the robot learn from its own trial-and-error to improve and perfect a skill.

LeRobot's RL Workflow: HIL-SERL

Human-in-the-Loop Sample-Efficient Reinforcement Learning

It's a hybrid approach! We don't start from scratch. We use your imitation learning dataset as a starting point, and then a human (you!) can intervene during live training to guide the robot, making the process much faster and safer.

How a Robot Really Learns

The RL Loop:

Agent (the policy) takes an `Action`.
`Action` affects the Environment.
Agent gets an `Observation` and a Reward.
The loop repeats.

The Reward Function is how YOU define the goal.

Anatomy of a Reward Function

Sparse Reward

`return 1` if task is complete, `0` otherwise.

Like telling a child "you get a cookie if you clean your room" but giving no other instructions. The robot might never succeed.

Dense Reward

`reward = -distance_to_cube`

Gives continuous feedback, guiding the robot effectively. "You're getting warmer!"

The Actor-Learner Architecture

Training is split into two processes running at the same time:

1. The Actor

Lives on the robot.
Executes the policy, takes actions.
You can interrupt it at any time!
Sends experience to the Learner.

2. The Learner

Lives on a powerful GPU.
Receives experience from the Actor.
Updates the policy's network.
Sends the improved policy back.

Interactive Session: Reward Design

Task: Teach the robot to open a drawer.

What are the components of a good reward?

Reward for moving hand to handle.
Reward for grasping handle.
Reward for pulling in the correct direction.
Big reward for the drawer being fully open.
Small time penalty (to encourage efficiency).

Pro-Tips for RL Training

Three settings from the docs that have a big impact:

temperature_init: Controls exploration. Tip: Start low (`1e-2`).
policy_parameters_push_frequency: How often the Learner sends the new policy to the Actor. Tip: `2-4` seconds is a good balance.
storage_device: Where the Learner keeps the policy. Tip: Set to `"cuda"` if you have enough GPU memory.

Capstone: The GR00T Recipe

Source: GR00T Paper & Model Card

What it is: An open foundation model for generalist humanoid robots.

The Recipe: GR00T uses the concepts you've learned today, but at massive scale:

Heterogeneous Data: Learns from real robot data, human videos, and synthetic data (advanced Imitation Learning).
Vision-Language Model: Understands instructions with a powerful VLM (like we did with Prompt Engineering).
Diffusion Transformer: Generates actions using an advanced policy, which is then fine-tuned (Reinforcement Learning).

The skills you're learning today are the building blocks for the frontier of robotics research.

Q&A

Any questions about Reinforcement Learning?