protomotions.agents.ppo.model module

PPO model implementation with actor-critic architecture.

This module implements the neural network models for Proximal Policy Optimization. The actor outputs a Gaussian policy distribution, and the critic estimates state values.

Key Classes:
  • PPOActor: Policy network with Gaussian action distribution

  • PPOModel: Complete actor-critic model for PPO

class protomotions.agents.ppo.model.PPOActor(*args, **kwargs)[source]

Bases: TensorDictModuleBase

PPO policy network (actor).

Self-contained policy that computes distribution parameters, samples actions, and computes log probabilities all in a single forward pass.

Parameters:

config (PPOActorConfig) – Actor configuration including network architecture and initial log std.

logstd

Log standard deviation parameter (typically fixed during training).

mu

Neural network that outputs action means.

in_keys

List of input keys from mu model.

out_keys

List of output keys (action, mean_action, neglogp).

__init__(config)[source]
forward(tensordict)[source]

Forward pass: compute mu/std, sample action, compute neglogp.

This is the only method - self-contained and clean.

Parameters:

tensordict (MockTensorDict) – TensorDict containing observations.

Returns:

TensorDict with action, mean_action, and neglogp added.

Return type:

MockTensorDict

class protomotions.agents.ppo.model.PPOModel(*args, **kwargs)[source]

Bases: BaseModel

Complete PPO model with actor and critic networks.

Pure forward function that computes all model outputs in TensorDict. The forward pass adds action distribution parameters and value estimates.

Parameters:

config (PPOModelConfig) – Model configuration specifying actor and critic architectures.

_actor

Policy network.

_critic

Value network.

config: PPOModelConfig
__init__(config)[source]
forward(tensordict)[source]

Forward pass through actor and critic.

This is the main interface for the model. Computes all outputs: - action: Sampled action - mean_action: Deterministic action (mean) - neglogp: Negative log probability of sampled action - value: State value estimate

Parameters:

tensordict (MockTensorDict) – TensorDict containing observations.

Returns:

TensorDict with all model outputs added.

Return type:

MockTensorDict