reinforce#


class LossSequenceTrainingStats(*, train_time: float = 0.0, smoothed_loss: dict = <factory>, loss: tianshou.data.stats.SequenceSummaryStats)[source]#

Bases: TrainingStats

loss: SequenceSummaryStats#
class SimpleLossTrainingStats(*, train_time: float = 0.0, smoothed_loss: dict = <factory>, loss: float)[source]#

Bases: TrainingStats

loss: float#
class ProbabilisticActorPolicy(*, actor: AbstractContinuousActorProbabilistic | AbstractDiscreteActor | ActionReprNet, dist_fn: Callable[[tuple[Tensor, Tensor]], Distribution] | Callable[[Tensor], Distribution], deterministic_eval: bool = False, action_space: Space, observation_space: Space | None = None, action_scaling: bool = True, action_bound_method: Literal['clip', 'tanh'] | None = 'clip')[source]#

Bases: Policy

A policy that outputs (representations of) probability distributions from which actions can be sampled.

Parameters:
  • actor – the actor network following the rules: If self.action_type == “discrete”: (s_B -> action_values_BA). If self.action_type == “continuous”: (s_B -> dist_input_BD).

  • dist_fn – the function/type which creates a distribution from the actor output, i.e. it maps the tensor(s) generated by the actor to a torch distribution. For continuous action spaces, the output is typically a pair of tensors (mean, std) and the distribution is a Gaussian distribution. For discrete action spaces, the output is typically a tensor of unnormalized log probabilities (“logits” in PyTorch terminology) or a tensor of probabilities which can serve as the parameters of a Categorical distribution. Note that if the actor uses softmax activation in its final layer, it will produce probabilities, whereas if it uses no activation, it can be considered as producing “logits”. As a user, you are responsible for ensuring that the distribution is compatible with the output of the actor model and the action space.

  • deterministic_eval – flag indicating whether the policy should use deterministic actions (using the mode of the action distribution) instead of stochastic ones (using random sampling) during evaluation. When enabled, the policy will always select the most probable action according to the learned distribution during evaluation phases, while still using stochastic sampling during training. This creates a clear distinction between exploration (training) and exploitation (evaluation) behaviors. Deterministic actions are generally preferred for final deployment and reproducible evaluation as they provide consistent behavior, reduce variance in performance metrics, and are more interpretable for human observers. Note that this parameter only affects behavior when the policy is not within a training step. When collecting rollouts for training, actions remain stochastic regardless of this setting to maintain proper exploration behaviour.

  • action_space – the environment’s action space.

  • observation_space – the environment’s observation space.

  • action_scaling – flag indicating whether, for continuous action spaces, actions should be scaled from the standard neural network output range [-1, 1] to the environment’s action space range [action_space.low, action_space.high]. This applies to continuous action spaces only (gym.spaces.Box) and has no effect for discrete spaces. When enabled, policy outputs are expected to be in the normalized range [-1, 1] (after bounding), and are then linearly transformed to the actual required range. This improves neural network training stability, allows the same algorithm to work across environments with different action ranges, and standardizes exploration strategies. Should be disabled if the actor model already produces outputs in the correct range.

  • action_bound_method – the method used for bounding actions in continuous action spaces to the range [-1, 1] before scaling them to the environment’s action space (provided that action_scaling is enabled). This applies to continuous action spaces only (gym.spaces.Box) and should be set to None for discrete spaces. When set to “clip”, actions exceeding the [-1, 1] range are simply clipped to this range. When set to “tanh”, a hyperbolic tangent function is applied, which smoothly constrains outputs to [-1, 1] while preserving gradients. The choice of bounding method affects both training dynamics and exploration behavior. Clipping provides hard boundaries but may create plateau regions in the gradient landscape, while tanh provides smoother transitions but can compress sensitivity near the boundaries. Should be set to None if the actor model inherently produces bounded outputs. Typically used together with action_scaling=True.

forward(batch: ObsBatchProtocol, state: dict | BatchProtocol | ndarray | None = None) DistBatchProtocol[source]#

Compute action over the given batch data by applying the actor.

Will sample from the dist_fn, if appropriate. Returns a new object representing the processed batch data (contrary to other methods that modify the input batch inplace).

class DiscreteActorPolicy(*, actor: ~tianshou.utils.net.common.AbstractDiscreteActor | ~tianshou.utils.net.common.ActionReprNet, dist_fn: ~collections.abc.Callable[[~torch.Tensor], ~torch.distributions.distribution.Distribution] = <function dist_fn_categorical_from_logits>, deterministic_eval: bool = False, action_space: ~gymnasium.spaces.space.Space, observation_space: ~gymnasium.spaces.space.Space | None = None)[source]#

Bases: ProbabilisticActorPolicy

Parameters:
  • actor – the actor network following the rules: (s_B -> dist_input_BD).

  • dist_fn – the function/type which creates a distribution from the actor output, i.e. it maps the tensor(s) generated by the actor to a torch distribution. For discrete action spaces, the output is typically a tensor of unnormalized log probabilities (“logits” in PyTorch terminology) or a tensor of probabilities which serve as the parameters of a Categorical distribution. Note that if the actor uses softmax activation in its final layer, it will produce probabilities, whereas if it uses no activation, it can be considered as producing “logits”. As a user, you are responsible for ensuring that the distribution is compatible with the output of the actor model and the action space.

  • deterministic_eval – flag indicating whether the policy should use deterministic actions (using the mode of the action distribution) instead of stochastic ones (using random sampling) during evaluation. When enabled, the policy will always select the most probable action according to the learned distribution during evaluation phases, while still using stochastic sampling during training. This creates a clear distinction between exploration (training) and exploitation (evaluation) behaviors. Deterministic actions are generally preferred for final deployment and reproducible evaluation as they provide consistent behavior, reduce variance in performance metrics, and are more interpretable for human observers. Note that this parameter only affects behavior when the policy is not within a training step. When collecting rollouts for training, actions remain stochastic regardless of this setting to maintain proper exploration behaviour.

  • action_space – the environment’s (discrete) action space.

  • observation_space – the environment’s observation space.

class DiscountedReturnComputation(gamma: float = 0.99, return_standardization: bool = False)[source]#

Bases: object

Parameters:
  • gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks

  • return_standardization – whether to standardize episode returns by subtracting the running mean and dividing by the running standard deviation. Note that this is known to be detrimental to performance in many cases!

add_discounted_returns(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indices: ndarray) BatchWithReturnsProtocol[source]#

Compute the discounted returns (Monte Carlo estimates) for each transition.

They are added to the batch under the field returns. Note: this function will modify the input batch!

\[G_t = \sum_{i=t}^T \gamma^{i-t}r_i\]

where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).

Parameters:
  • batch – a data batch which contains several episodes of data in sequential order. Mind that the end of each finished episode of batch should be marked by done flag, unfinished (or collecting) episodes will be recognized by buffer.unfinished_index().

  • buffer – the corresponding replay buffer.

  • indices – tell batch’s location in buffer, batch is equal to buffer[indices].

class Reinforce(*, policy: ProbabilisticActorPolicy, gamma: float = 0.99, return_standardization: bool = False, optim: OptimizerFactory)[source]#

Bases: OnPolicyAlgorithm[ProbabilisticActorPolicy]

Implementation of the REINFORCE (a.k.a. vanilla policy gradient) algorithm.

Parameters:
  • policy – the policy

  • optim – the optimizer factory for the policy’s model.

  • gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks

  • return_standardization – if True, will scale/standardize returns by subtracting the running mean and dividing by the running standard deviation. Can be detrimental to performance!