protomotions.agents.ppo.utils module¶
Utility functions for PPO algorithm.
This module provides helper functions for PPO, including advantage computation using Generalized Advantage Estimation (GAE).
- Key Functions:
discount_values: Compute GAE advantages from rewards and values
- protomotions.agents.ppo.utils.discount_values(mb_fdones, mb_values, mb_rewards, mb_next_values, gamma, tau)[source]¶
Compute Generalized Advantage Estimation (GAE) advantages.
Computes advantages using GAE-Lambda, which provides a bias-variance tradeoff for advantage estimation. Uses backwards iteration through the episode to compute bootstrapped advantages.
- Parameters:
mb_fdones – Done flags (num_steps, num_envs). 1.0 = episode ended.
mb_values – Value predictions at each timestep (num_steps, num_envs).
mb_rewards – Rewards received at each timestep (num_steps, num_envs).
mb_next_values – Value predictions for next states (num_steps, num_envs).
gamma – Discount factor for future rewards (typically 0.99).
tau – GAE lambda parameter for bias-variance tradeoff (typically 0.95).
- Returns:
Tensor of advantages with shape (num_steps, num_envs).
Example
>>> advantages = discount_values(dones, values, rewards, next_values, 0.99, 0.95) >>> returns = advantages + values # Can compute returns from advantages
Note
GAE-Lambda provides advantages that balance bias (low lambda) vs. variance (high lambda). Lambda=0 gives 1-step TD, lambda=1 gives Monte Carlo returns.
- Reference:
Schulman et al. “High-Dimensional Continuous Control Using Generalized Advantage Estimation” (2015)