class NPGPolicy(*, actor: Module | ActorProb | Actor, critic: Module | Critic | Critic, optim: Optimizer, dist_fn: Callable[[tuple[Tensor, Tensor]], Distribution] | Callable[[Tensor], Categorical], action_space: Space, optim_critic_iters: int = 5, actor_step_size: float = 0.5, advantage_normalization: bool = True, gae_lambda: float = 0.95, max_batchsize: int = 256, discount_factor: float = 0.99, reward_normalization: bool = False, deterministic_eval: bool = False, observation_space: Space | None = None, action_scaling: bool = True, action_bound_method: Literal['clip', 'tanh'] | None = 'clip', lr_scheduler: LRScheduler | MultipleLRSchedulers | None = None)[source]#

Implementation of Natural Policy Gradient.

  • actor – the actor network following the rules: If self.action_type == “discrete”: (s ->`action_values_BA`). If self.action_type == “continuous”: (s -> dist_input_BD).

  • critic – the critic network. (s -> V(s))

  • optim – the optimizer for actor and critic network.

  • dist_fn – distribution class for computing the action.

  • action_space – env’s action space

  • optim_critic_iters – Number of times to optimize critic network per update.

  • actor_step_size – step size for actor update in natural gradient direction.

  • advantage_normalization – whether to do per mini-batch advantage normalization.

  • gae_lambda – in [0, 1], param for Generalized Advantage Estimation.

  • max_batchsize – the maximum size of the batch when computing GAE.

  • discount_factor – in [0, 1].

  • reward_normalization – normalize estimated values to have std close to 1.

  • deterministic_eval – if True, use deterministic evaluation.

  • observation_space – the space of the observation.

  • action_scaling – if True, scale the action from [-1, 1] to the range of action_space. Only used if the action_space is continuous.

  • action_bound_method – method to bound action to range [-1, 1].

  • lr_scheduler – if not None, will be called in policy.update().

learn(batch: Batch, batch_size: int | None, repeat: int, **kwargs: Any) TNPGTrainingStats[source]#

Update policy with a given batch of data.


A dataclass object, including the data needed to be logged (e.g., loss).


In order to distinguish the collecting state, updating state and testing state, you can check the policy state by and self.updating. Please refer to States for policy for more detailed explanation.


If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

process_fn(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indices: ndarray) BatchWithAdvantagesProtocol[source]#

Compute the discounted returns (Monte Carlo estimates) for each transition.

They are added to the batch under the field returns. Note: this function will modify the input batch!

\[G_t = \sum_{i=t}^T \gamma^{i-t}r_i\]

where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).

  • batch – a data batch which contains several episodes of data in sequential order. Mind that the end of each finished episode of batch should be marked by done flag, unfinished (or collecting) episodes will be recognized by buffer.unfinished_index().

  • buffer – the corresponding replay buffer.

  • indices (numpy.ndarray) – tell batch’s location in buffer, batch is equal to buffer[indices].

class NPGTrainingStats(*, train_time: float = 0.0, smoothed_loss: dict = <factory>, actor_loss:, vf_loss:, kl:[source]#
actor_loss: SequenceSummaryStats#
kl: SequenceSummaryStats#
vf_loss: SequenceSummaryStats#