protomotions.agents.ppo.utils module

Utility functions for PPO algorithm.

This module provides helper functions for PPO, including advantage computation using Generalized Advantage Estimation (GAE).

Key Functions:
  • discount_values: Compute GAE advantages from rewards and values

protomotions.agents.ppo.utils.discount_values(mb_fdones, mb_values, mb_rewards, mb_next_values, gamma, tau)[source]

Compute Generalized Advantage Estimation (GAE) advantages.

Computes advantages using GAE-Lambda, which provides a bias-variance tradeoff for advantage estimation. Uses backwards iteration through the episode to compute bootstrapped advantages.

Parameters:
  • mb_fdones – Done flags (num_steps, num_envs). 1.0 = episode ended.

  • mb_values – Value predictions at each timestep (num_steps, num_envs).

  • mb_rewards – Rewards received at each timestep (num_steps, num_envs).

  • mb_next_values – Value predictions for next states (num_steps, num_envs).

  • gamma – Discount factor for future rewards (typically 0.99).

  • tau – GAE lambda parameter for bias-variance tradeoff (typically 0.95).

Returns:

Tensor of advantages with shape (num_steps, num_envs).

Example

>>> advantages = discount_values(dones, values, rewards, next_values, 0.99, 0.95)
>>> returns = advantages + values  # Can compute returns from advantages

Note

GAE-Lambda provides advantages that balance bias (low lambda) vs. variance (high lambda). Lambda=0 gives 1-step TD, lambda=1 gives Monte Carlo returns.

Reference:

Schulman et al. “High-Dimensional Continuous Control Using Generalized Advantage Estimation” (2015)