ppo#


class PPOPolicy(*, actor: Module, critic: Module, optim: Optimizer, dist_fn: Callable[[...], Distribution], action_space: Space, eps_clip: float = 0.2, dual_clip: float | None = None, value_clip: bool = False, advantage_normalization: bool = True, recompute_advantage: bool = False, vf_coef: float = 0.5, ent_coef: float = 0.01, max_grad_norm: float | None = None, gae_lambda: float = 0.95, max_batchsize: int = 256, discount_factor: float = 0.99, reward_normalization: bool = False, deterministic_eval: bool = False, observation_space: Space | None = None, action_scaling: bool = True, action_bound_method: Literal['clip', 'tanh'] | None = 'clip', lr_scheduler: LRScheduler | MultipleLRSchedulers | None = None)[source]#

Implementation of Proximal Policy Optimization. arXiv:1707.06347.

Parameters:
  • actor – the actor network following the rules in BasePolicy. (s -> logits)

  • critic – the critic network. (s -> V(s))

  • optim – the optimizer for actor and critic network.

  • dist_fn – distribution class for computing the action.

  • action_space – env’s action space

  • eps_clip\(\epsilon\) in \(L_{CLIP}\) in the original paper.

  • dual_clip – a parameter c mentioned in arXiv:1912.09729 Equ. 5, where c > 1 is a constant indicating the lower bound. Set to None to disable dual-clip PPO.

  • value_clip – a parameter mentioned in arXiv:1811.02553v3 Sec. 4.1.

  • advantage_normalization – whether to do per mini-batch advantage normalization.

  • recompute_advantage – whether to recompute advantage every update repeat according to https://arxiv.org/pdf/2006.05990.pdf Sec. 3.5.

  • vf_coef – weight for value loss.

  • ent_coef – weight for entropy loss.

  • max_grad_norm – clipping gradients in back propagation.

  • gae_lambda – in [0, 1], param for Generalized Advantage Estimation.

  • max_batchsize – the maximum size of the batch when computing GAE.

  • discount_factor – in [0, 1].

  • reward_normalization – normalize estimated values to have std close to 1.

  • deterministic_eval – if True, use deterministic evaluation.

  • observation_space – the space of the observation.

  • action_scaling – if True, scale the action from [-1, 1] to the range of action_space. Only used if the action_space is continuous.

  • action_bound_method – method to bound action to range [-1, 1].

  • lr_scheduler – if not None, will be called in policy.update().

See also

Please refer to BasePolicy for more detailed explanation.

learn(batch: RolloutBatchProtocol, batch_size: int | None, repeat: int, *args: Any, **kwargs: Any) TPPOTrainingStats[source]#

Update policy with a given batch of data.

Returns:

A dataclass object, including the data needed to be logged (e.g., loss).

Note

In order to distinguish the collecting state, updating state and testing state, you can check the policy state by self.training and self.updating. Please refer to States for policy for more detailed explanation.

Warning

If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

process_fn(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indices: ndarray) LogpOldProtocol[source]#

Compute the discounted returns (Monte Carlo estimates) for each transition.

They are added to the batch under the field returns. Note: this function will modify the input batch!

\[G_t = \sum_{i=t}^T \gamma^{i-t}r_i\]

where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).

Parameters:
  • batch – a data batch which contains several episodes of data in sequential order. Mind that the end of each finished episode of batch should be marked by done flag, unfinished (or collecting) episodes will be recognized by buffer.unfinished_index().

  • buffer – the corresponding replay buffer.

  • indices (numpy.ndarray) – tell batch’s location in buffer, batch is equal to buffer[indices].

class PPOTrainingStats(*, train_time: float = 0.0, smoothed_loss: dict = <factory>, loss: tianshou.data.stats.SequenceSummaryStats, clip_loss: tianshou.data.stats.SequenceSummaryStats, vf_loss: tianshou.data.stats.SequenceSummaryStats, ent_loss: tianshou.data.stats.SequenceSummaryStats)[source]#
clip_loss: SequenceSummaryStats#
ent_loss: SequenceSummaryStats#
loss: SequenceSummaryStats#
vf_loss: SequenceSummaryStats#