Workshop 4: Learning from Experience

Fine-tuning with Reinforcement Learning (HIL-SERL)

What We'll Cover

The Limits of Imitation

Problem: Your imitation-trained policy is good, but not perfect. It only knows what you showed it. It struggles with new situations and can't discover better ways to do the task.

Solution: Reinforcement Learning (RL). Let the robot learn from its own trial-and-error to improve and perfect a skill.

LeRobot's RL Workflow: HIL-SERL

Human-in-the-Loop Sample-Efficient Reinforcement Learning

It's a hybrid approach! We don't start from scratch. We use your imitation learning dataset as a starting point, and then a human (you!) can intervene during live training to guide the robot, making the process much faster and safer.

How a Robot *Really* Learns

The RL Loop:

The Reward Function is how YOU define the goal.

Anatomy of a Reward Function

Sparse Reward

`return 1` if task is complete, `0` otherwise.

Like telling a child "you get a cookie if you clean your room" but giving no other instructions. The robot might never succeed.

Dense Reward

`reward = -distance_to_cube`

Gives continuous feedback, guiding the robot effectively. "You're getting warmer!"

The Actor-Learner Architecture

Training is split into two processes running at the same time:

1. The Actor

  • Lives on the robot.
  • Executes the policy, takes actions.
  • You can interrupt it at any time!
  • Sends experience to the Learner.

2. The Learner

  • Lives on a powerful GPU.
  • Receives experience from the Actor.
  • Updates the policy's network.
  • Sends the improved policy back.

Interactive Session: Reward Design

Task: Teach the robot to open a drawer.

What are the components of a good reward?

  • Reward for moving hand to handle.
  • Reward for grasping handle.
  • Reward for pulling in the correct direction.
  • Big reward for the drawer being fully open.
  • Small time penalty (to encourage efficiency).

Pro-Tips for RL Training

Three settings from the docs that have a big impact:

  • temperature_init: Controls exploration. Tip: Start low (`1e-2`).
  • policy_parameters_push_frequency: How often the Learner sends the new policy to the Actor. Tip: `2-4` seconds is a good balance.
  • storage_device: Where the Learner keeps the policy. Tip: Set to `"cuda"` if you have enough GPU memory.

Capstone: The GR00T Recipe

Source: GR00T Paper & Model Card

What it is: An open foundation model for generalist humanoid robots.

The Recipe: GR00T uses the concepts you've learned today, but at massive scale:

  1. Heterogeneous Data: Learns from real robot data, human videos, and synthetic data (advanced Imitation Learning).
  2. Vision-Language Model: Understands instructions with a powerful VLM (like we did with Prompt Engineering).
  3. Diffusion Transformer: Generates actions using an advanced policy, which is then fine-tuned (Reinforcement Learning).

The skills you're learning today are the building blocks for the frontier of robotics research.

Q&A

Any questions about Reinforcement Learning?