algorithm_base#


class TrainingStats(*, train_time: float = 0.0, smoothed_loss: dict = <factory>)[source]#

Bases: DataclassPPrintMixin

train_time: float = 0.0#

The time for learning models.

smoothed_loss: dict#

The smoothed loss statistics of the policy learn step.

get_loss_stats_dict() dict[str, float][source]#

Return loss statistics as a dict for logging.

Returns a dict with all fields except train_time and smoothed_loss. Moreover, fields with value None excluded, and instances of SequenceSummaryStats are replaced by their mean.

class TrainingStatsWrapper(wrapped_stats: TrainingStats)[source]#

Bases: TrainingStats

In this particular case, super().__init__() should be called LAST in the subclass init.

property wrapped_stats: TrainingStats#
class Policy(action_space: Space, observation_space: Space | None = None, action_scaling: bool = False, action_bound_method: Literal['clip', 'tanh'] | None = 'clip')[source]#

Bases: Module, ABC

Represents a policy, which provides the fundamental mapping from observations to actions.

Parameters:
  • action_space – the environment’s action_space.

  • observation_space – the environment’s observation space.

  • action_scaling – flag indicating whether, for continuous action spaces, actions should be scaled from the standard neural network output range [-1, 1] to the environment’s action space range [action_space.low, action_space.high]. This applies to continuous action spaces only (gym.spaces.Box) and has no effect for discrete spaces. When enabled, policy outputs are expected to be in the normalized range [-1, 1] (after bounding), and are then linearly transformed to the actual required range. This improves neural network training stability, allows the same algorithm to work across environments with different action ranges, and standardizes exploration strategies. Should be disabled if the actor model already produces outputs in the correct range.

  • action_bound_method – the method used for bounding actions in continuous action spaces to the range [-1, 1] before scaling them to the environment’s action space (provided that action_scaling is enabled). This applies to continuous action spaces only (gym.spaces.Box) and should be set to None for discrete spaces. When set to “clip”, actions exceeding the [-1, 1] range are simply clipped to this range. When set to “tanh”, a hyperbolic tangent function is applied, which smoothly constrains outputs to [-1, 1] while preserving gradients. The choice of bounding method affects both training dynamics and exploration behavior. Clipping provides hard boundaries but may create plateau regions in the gradient landscape, while tanh provides smoother transitions but can compress sensitivity near the boundaries. Should be set to None if the actor model inherently produces bounded outputs. Typically used together with action_scaling=True.

is_within_training_step#

flag indicating whether we are currently within a training step, which encompasses data collection for training (in online RL algorithms) and the policy update (gradient steps).

It can be used, for example, to control whether a flag controlling deterministic evaluation should indeed be applied, because within a training step, we typically always want to apply stochastic evaluation (even if such a flag is enabled), as well as stochastic action computation for q-targets (e.g. in SAC based algorithms).

This flag should normally remain False and should be set to True only by the algorithm which performs training steps. This is done automatically by the Trainer classes. If a policy is used outside of a Trainer, the user should ensure that this flag is set correctly.

property action_type: Literal['discrete', 'continuous']#
map_action(act: Tensor | ndarray) ndarray[source]#

Map raw network output to action range in gym’s env.action_space.

This function is called in collect() and only affects action sending to env. Remapped action will not be stored in buffer and thus can be viewed as a part of env (a black box action transformation).

Action mapping includes 2 standard procedures: bounding and scaling. Bounding procedure expects original action range is (-inf, inf) and maps it to [-1, 1], while scaling procedure expects original action range is (-1, 1) and maps it to [action_space.low, action_space.high]. Bounding procedure is applied first.

Parameters:

act – a data batch or numpy.ndarray which is the action taken by policy.forward.

Returns:

action in the same form of input “act” but remap to the target action space.

map_action_inverse(act: Tensor | ndarray) ndarray[source]#

Inverse operation to map_action().

This function is called in collect() for random initial steps. It scales [action_space.low, action_space.high] to the value ranges of policy.forward.

Parameters:

act – a data batch, list or numpy.ndarray which is the action taken by gym.spaces.Box.sample().

Returns:

action remapped.

compute_action(obs: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], info: dict[str, Any] | None = None, state: dict | BatchProtocol | ndarray | None = None) ndarray | int[source]#

Get action as int (for discrete env’s) or array (for continuous ones) from an env’s observation and info.

Parameters:
  • obs – observation from the gym’s env.

  • info – information given by the gym’s env.

  • state – the hidden state of RNN policy, used for recurrent policy.

Returns:

action as int (for discrete env’s) or array (for continuous ones).

add_exploration_noise(act: _TArrOrActBatch, batch: ObsBatchProtocol) _TArrOrActBatch[source]#
(Optionally) adds noise to an actions computed by the policy’s forward method for

exploration purposes.

NOTE: The base implementation does not add any noise, but subclasses can override this method to add appropriate mechanisms for adding noise.

Parameters:
  • act – a data batch or numpy.ndarray containing actions computed by the policy’s forward method.

  • batch – the corresponding input batch that was passed to forward; provided for advanced usage.

Returns:

actions in the same format as the input act but with added exploration noise (if implemented - otherwise returns act unchanged).

class LaggedNetworkAlgorithmMixin[source]#

Bases: ABC

Base class for an algorithm mixin which adds support for lagged networks (target networks) whose weights are updated periodically.

class LaggedNetworkFullUpdateAlgorithmMixin[source]#

Bases: LaggedNetworkAlgorithmMixin

Algorithm mixin which adds support for lagged networks (target networks) where weights are updated by fully copying the weights of the source network to the target network.

class LaggedNetworkPolyakUpdateAlgorithmMixin(tau: float)[source]#

Bases: LaggedNetworkAlgorithmMixin

Algorithm mixin which adds support for lagged networks (target networks) where weights are updated via Polyak averaging (soft update using a convex combination of the parameters of the source and target networks with weight tau and 1-tau respectively).

Parameters:

tau – the fraction with which to use the source network’s parameters, the inverse 1-tau being the fraction with which to retain the target network’s parameters.

class Algorithm(*, policy: TPolicy)[source]#

Bases: Module, Generic[TPolicy, TTrainerParams], ABC

The base class for reinforcement learning algorithms in Tianshou.

An algorithm critically defines how to update the parameters of neural networks based on a batch data, optionally applying pre-processing and post-processing to the data. The actual update step is highly algorithm-specific and thus is defined in subclasses.

Parameters:

policy – the policy

class Optimizer(optim: Optimizer, module: Module, max_grad_norm: float | None = None)[source]#

Bases: object

Wrapper for a torch optimizer that optionally performs gradient clipping.

Parameters:
  • optim – the optimizer

  • module – the module whose parameters are being affected by optim

  • max_grad_norm – the maximum L2 norm threshold for gradient clipping. When not None, gradients will be rescaled using to ensure their L2 norm does not exceed this value. This prevents exploding gradients and stabilizes training by limiting the magnitude of parameter updates. Set to None to disable gradient clipping.

step(loss: Tensor, retain_graph: bool | None = None, create_graph: bool = False) None[source]#

Performs an optimizer step, optionally applying gradient clipping (if configured at construction).

Parameters:
  • loss – the loss to backpropagate

  • retain_graph – passed on to backward

  • create_graph – passed on to backward

state_dict() dict[source]#

Returns the state_dict of the wrapped optimizer.

load_state_dict(state_dict: dict) None[source]#

Loads the given state_dict into the wrapped optimizer.

state_dict(*args, destination=None, prefix='', keep_vars=False)[source]#

Returns a dictionary containing references to the whole state of the module.

Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to None are not included.

Note

The returned object is a shallow copy. It contains references to the module’s parameters and buffers.

Warning

Currently state_dict() also accepts positional arguments for destination, prefix and keep_vars in order. However, this is being deprecated and keyword arguments will be enforced in future releases.

Warning

Please avoid the use of argument destination as it is not designed for end-users.

Args:
destination (dict, optional): If provided, the state of module will

be updated into the dict and the same object is returned. Otherwise, an OrderedDict will be created and returned. Default: None.

prefix (str, optional): a prefix added to parameter and buffer

names to compose the keys in state_dict. Default: ''.

keep_vars (bool, optional): by default the Tensor s

returned in the state dict are detached from autograd. If it’s set to True, detaching will not be performed. Default: False.

Returns:
dict:

a dictionary containing a whole state of the module

Example:

>>> # xdoctest: +SKIP("undefined vars")
>>> module.state_dict().keys()
['bias', 'weight']
load_state_dict(state_dict: Mapping[str, Any], strict: bool = True, assign: bool = False) _IncompatibleKeys[source]#

Copies parameters and buffers from state_dict into this module and its descendants. If strict is True, then the keys of state_dict must exactly match the keys returned by this module’s state_dict() function.

Warning

If assign is True the optimizer must be created after the call to load_state_dict.

Args:
state_dict (dict): a dict containing parameters and

persistent buffers.

strict (bool, optional): whether to strictly enforce that the keys

in state_dict match the keys returned by this module’s state_dict() function. Default: True

assign (bool, optional): whether to assign items in the state

dictionary to their corresponding keys in the module instead of copying them inplace into the module’s current parameters and buffers. When False, the properties of the tensors in the current module are preserved while when True, the properties of the Tensors in the state dict are preserved. Default: False

Returns:
NamedTuple with missing_keys and unexpected_keys fields:
  • missing_keys is a list of str containing the missing keys

  • unexpected_keys is a list of str containing the unexpected keys

Note:

If a parameter or buffer is registered as None and its corresponding key exists in state_dict, load_state_dict() will raise a RuntimeError.

static value_mask(buffer: ReplayBuffer, indices: ndarray) ndarray[source]#

Value mask determines whether the obs_next of buffer[indices] is valid.

For instance, usually “obs_next” after “done” flag is considered to be invalid, and its q/advantage value can provide meaningless (even misleading) information, and should be set to 0 by hand. But if “done” flag is generated because timelimit of game length (info[“TimeLimit.truncated”] is set to True in gym’s settings), “obs_next” will instead be valid. Value mask is typically used for assisting in calculating the correct q/advantage value.

Parameters:
  • buffer – the corresponding replay buffer.

  • indices (numpy.ndarray) – indices of replay buffer whose “obs_next” will be judged.

Returns:

A bool type numpy.ndarray in the same shape with indices. “True” means “obs_next” of that buffer[indices] is valid.

static compute_episodic_return(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indices: ndarray, v_s_: ndarray | Tensor | None = None, v_s: ndarray | Tensor | None = None, gamma: float = 0.99, gae_lambda: float = 0.95) tuple[ndarray, ndarray][source]#

Compute returns over given batch.

Use Implementation of Generalized Advantage Estimator (arXiv:1506.02438) to calculate q/advantage value of given batch. Returns are calculated as advantage + value, which is exactly equivalent to using \(TD(\lambda)\) for estimating returns.

Setting v_s_ and v_s to None (or all zeros) and gae_lambda to 1.0 calculates the discounted return-to-go/ Monte-Carlo return.

Parameters:
  • batch – a data batch which contains several episodes of data in sequential order. Mind that the end of each finished episode of batch should be marked by done flag, unfinished (or collecting) episodes will be recognized by buffer.unfinished_index().

  • buffer – the corresponding replay buffer.

  • indices – tells the batch’s location in buffer, batch is equal to buffer[indices].

  • v_s – the value function of all next states \(V(s')\). If None, it will be set to an array of 0.

  • v_s – the value function of all current states \(V(s)\). If None, it is set based upon v_s_ rolled by 1.

  • gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks

  • gae_lambda – the lambda parameter in [0, 1] for generalized advantage estimation (GAE). Controls the bias-variance tradeoff in advantage estimates, acting as a weighting factor for combining different n-step advantage estimators. Higher values (closer to 1) reduce bias but increase variance by giving more weight to longer trajectories, while lower values (closer to 0) reduce variance but increase bias by relying more on the immediate TD error and value function estimates. At λ=0, GAE becomes equivalent to the one-step TD error (high bias, low variance); at λ=1, it becomes equivalent to Monte Carlo advantage estimation (low bias, high variance). Intermediate values create a weighted average of n-step returns, with exponentially decaying weights for longer-horizon returns. Typically set between 0.9 and 0.99 for most policy gradient methods.

Returns:

two numpy arrays (returns, advantage) with each shape (bsz, ).

static compute_nstep_return(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indices: ndarray, target_q_fn: Callable[[ReplayBuffer, ndarray], Tensor], gamma: float = 0.99, n_step: int = 1) BatchWithReturnsProtocol[source]#

Computes the n-step return for Q-learning targets, adds it to the batch and returns the resulting batch.

\[G_t = \sum_{i = t}^{t + n - 1} \gamma^{i - t}(1 - d_i)r_i + \gamma^n (1 - d_{t + n}) Q_{\mathrm{target}}(s_{t + n})\]

where \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\), \(d_t\) is the done flag of step \(t\).

Parameters:
  • batch – a data batch, which is equal to buffer[indices].

  • buffer – the data buffer.

  • indices – tell batch’s location in buffer

  • target_q_fn – a function which computes the target Q value of “obs_next” given data buffer and wanted indices (n_step steps ahead).

  • gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks

  • n_step – the number of estimation step, should be an int greater than 0.

Returns:

a Batch. The result will be stored in batch.returns as a torch.Tensor with the same shape as target_q_fn’s return tensor.

abstract create_trainer(params: TTrainerParams) Trainer[source]#
run_training(params: TTrainerParams) InfoStats[source]#
class OnPolicyAlgorithm(*, policy: TPolicy)[source]#

Bases: Algorithm[TPolicy, OnPolicyTrainerParams], Generic[TPolicy], ABC

Base class for on-policy RL algorithms.

Parameters:

policy – the policy

create_trainer(params: OnPolicyTrainerParams) OnPolicyTrainer[source]#
update(buffer: ReplayBuffer, batch_size: int | None, repeat: int) TrainingStats[source]#
class OffPolicyAlgorithm(*, policy: TPolicy)[source]#

Bases: Algorithm[TPolicy, OffPolicyTrainerParams], Generic[TPolicy], ABC

Base class for off-policy RL algorithms.

Parameters:

policy – the policy

create_trainer(params: OffPolicyTrainerParams) OffPolicyTrainer[source]#
update(buffer: ReplayBuffer, sample_size: int | None) TrainingStats[source]#
class OfflineAlgorithm(*, policy: TPolicy)[source]#

Bases: Algorithm[TPolicy, OfflineTrainerParams], Generic[TPolicy], ABC

Base class for offline RL algorithms.

Parameters:

policy – the policy

process_buffer(buffer: TBuffer) TBuffer[source]#

Pre-process the replay buffer to prepare for offline learning, e.g. to add new keys.

run_training(params: OfflineTrainerParams) InfoStats[source]#
create_trainer(params: OfflineTrainerParams) OfflineTrainer[source]#
update(buffer: ReplayBuffer, sample_size: int | None) TrainingStats[source]#
class OnPolicyWrapperAlgorithm(wrapped_algorithm: OnPolicyAlgorithm[TPolicy])[source]#

Bases: OnPolicyAlgorithm[TPolicy], Generic[TPolicy], ABC

Base class for an on-policy algorithm that is a wrapper around another algorithm.

It applies the wrapped algorithm’s pre-processing and post-processing methods and chains the update method of the wrapped algorithm with the wrapper’s own update method.

Parameters:

policy – the policy

class OffPolicyWrapperAlgorithm(wrapped_algorithm: OffPolicyAlgorithm[TPolicy])[source]#

Bases: OffPolicyAlgorithm[TPolicy], Generic[TPolicy], ABC

Base class for an off-policy algorithm that is a wrapper around another algorithm.

It applies the wrapped algorithm’s pre-processing and post-processing methods and chains the update method of the wrapped algorithm with the wrapper’s own update method.

Parameters:

policy – the policy

class RandomActionPolicy(action_space: Space)[source]#

Bases: Policy

Parameters:
  • action_space – the environment’s action_space.

  • observation_space – the environment’s observation space.

  • action_scaling – flag indicating whether, for continuous action spaces, actions should be scaled from the standard neural network output range [-1, 1] to the environment’s action space range [action_space.low, action_space.high]. This applies to continuous action spaces only (gym.spaces.Box) and has no effect for discrete spaces. When enabled, policy outputs are expected to be in the normalized range [-1, 1] (after bounding), and are then linearly transformed to the actual required range. This improves neural network training stability, allows the same algorithm to work across environments with different action ranges, and standardizes exploration strategies. Should be disabled if the actor model already produces outputs in the correct range.

  • action_bound_method – the method used for bounding actions in continuous action spaces to the range [-1, 1] before scaling them to the environment’s action space (provided that action_scaling is enabled). This applies to continuous action spaces only (gym.spaces.Box) and should be set to None for discrete spaces. When set to “clip”, actions exceeding the [-1, 1] range are simply clipped to this range. When set to “tanh”, a hyperbolic tangent function is applied, which smoothly constrains outputs to [-1, 1] while preserving gradients. The choice of bounding method affects both training dynamics and exploration behavior. Clipping provides hard boundaries but may create plateau regions in the gradient landscape, while tanh provides smoother transitions but can compress sensitivity near the boundaries. Should be set to None if the actor model inherently produces bounded outputs. Typically used together with action_scaling=True.

forward(batch: ObsBatchProtocol, state: dict | BatchProtocol | ndarray | None = None, **kwargs: Any) ActStateBatchProtocol[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

episode_mc_return_to_go(rewards: ndarray, gamma: float = 0.99) ndarray[source]#

Calculates discounted monte-carlo returns to go from rewards of a single episode.

Parameters:
  • rewards – rewards of a single episode. Assumed to be a 1-dim array from reset till the end of the episode.

  • gamma – discount factor

Returns:

a numpy array of shape (len(rewards), ).