algorithm_base#
Source code: tianshou/algorithm/algorithm_base.py
- class TrainingStats(*, train_time: float = 0.0, smoothed_loss: dict = <factory>)[source]#
Bases:
DataclassPPrintMixin- train_time: float = 0.0#
The time for learning models.
- smoothed_loss: dict#
The smoothed loss statistics of the policy learn step.
- class TrainingStatsWrapper(wrapped_stats: TrainingStats)[source]#
Bases:
TrainingStatsIn this particular case, super().__init__() should be called LAST in the subclass init.
- property wrapped_stats: TrainingStats#
- class Policy(action_space: Space, observation_space: Space | None = None, action_scaling: bool = False, action_bound_method: Literal['clip', 'tanh'] | None = 'clip')[source]#
Bases:
Module,ABCRepresents a policy, which provides the fundamental mapping from observations to actions.
- Parameters:
action_space – the environment’s action_space.
observation_space – the environment’s observation space.
action_scaling – flag indicating whether, for continuous action spaces, actions should be scaled from the standard neural network output range [-1, 1] to the environment’s action space range [action_space.low, action_space.high]. This applies to continuous action spaces only (gym.spaces.Box) and has no effect for discrete spaces. When enabled, policy outputs are expected to be in the normalized range [-1, 1] (after bounding), and are then linearly transformed to the actual required range. This improves neural network training stability, allows the same algorithm to work across environments with different action ranges, and standardizes exploration strategies. Should be disabled if the actor model already produces outputs in the correct range.
action_bound_method – the method used for bounding actions in continuous action spaces to the range [-1, 1] before scaling them to the environment’s action space (provided that action_scaling is enabled). This applies to continuous action spaces only (gym.spaces.Box) and should be set to None for discrete spaces. When set to “clip”, actions exceeding the [-1, 1] range are simply clipped to this range. When set to “tanh”, a hyperbolic tangent function is applied, which smoothly constrains outputs to [-1, 1] while preserving gradients. The choice of bounding method affects both training dynamics and exploration behavior. Clipping provides hard boundaries but may create plateau regions in the gradient landscape, while tanh provides smoother transitions but can compress sensitivity near the boundaries. Should be set to None if the actor model inherently produces bounded outputs. Typically used together with action_scaling=True.
- is_within_training_step#
flag indicating whether we are currently within a training step, which encompasses data collection for training (in online RL algorithms) and the policy update (gradient steps).
It can be used, for example, to control whether a flag controlling deterministic evaluation should indeed be applied, because within a training step, we typically always want to apply stochastic evaluation (even if such a flag is enabled), as well as stochastic action computation for q-targets (e.g. in SAC based algorithms).
This flag should normally remain False and should be set to True only by the algorithm which performs training steps. This is done automatically by the Trainer classes. If a policy is used outside of a Trainer, the user should ensure that this flag is set correctly.
- property action_type: Literal['discrete', 'continuous']#
- map_action(act: Tensor | ndarray) ndarray[source]#
Map raw network output to action range in gym’s env.action_space.
This function is called in
collect()and only affects action sending to env. Remapped action will not be stored in buffer and thus can be viewed as a part of env (a black box action transformation).Action mapping includes 2 standard procedures: bounding and scaling. Bounding procedure expects original action range is (-inf, inf) and maps it to [-1, 1], while scaling procedure expects original action range is (-1, 1) and maps it to [action_space.low, action_space.high]. Bounding procedure is applied first.
- Parameters:
act – a data batch or numpy.ndarray which is the action taken by policy.forward.
- Returns:
action in the same form of input “act” but remap to the target action space.
- map_action_inverse(act: Tensor | ndarray) ndarray[source]#
Inverse operation to
map_action().This function is called in
collect()for random initial steps. It scales [action_space.low, action_space.high] to the value ranges of policy.forward.- Parameters:
act – a data batch, list or numpy.ndarray which is the action taken by gym.spaces.Box.sample().
- Returns:
action remapped.
- compute_action(obs: _SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes], info: dict[str, Any] | None = None, state: dict | BatchProtocol | ndarray | None = None) ndarray | int[source]#
Get action as int (for discrete env’s) or array (for continuous ones) from an env’s observation and info.
- Parameters:
obs – observation from the gym’s env.
info – information given by the gym’s env.
state – the hidden state of RNN policy, used for recurrent policy.
- Returns:
action as int (for discrete env’s) or array (for continuous ones).
- add_exploration_noise(act: _TArrOrActBatch, batch: ObsBatchProtocol) _TArrOrActBatch[source]#
- (Optionally) adds noise to an actions computed by the policy’s forward method for
exploration purposes.
NOTE: The base implementation does not add any noise, but subclasses can override this method to add appropriate mechanisms for adding noise.
- Parameters:
act – a data batch or numpy.ndarray containing actions computed by the policy’s forward method.
batch – the corresponding input batch that was passed to forward; provided for advanced usage.
- Returns:
actions in the same format as the input act but with added exploration noise (if implemented - otherwise returns act unchanged).
- class LaggedNetworkAlgorithmMixin[source]#
Bases:
ABCBase class for an algorithm mixin which adds support for lagged networks (target networks) whose weights are updated periodically.
- class LaggedNetworkFullUpdateAlgorithmMixin[source]#
Bases:
LaggedNetworkAlgorithmMixinAlgorithm mixin which adds support for lagged networks (target networks) where weights are updated by fully copying the weights of the source network to the target network.
- class LaggedNetworkPolyakUpdateAlgorithmMixin(tau: float)[source]#
Bases:
LaggedNetworkAlgorithmMixinAlgorithm mixin which adds support for lagged networks (target networks) where weights are updated via Polyak averaging (soft update using a convex combination of the parameters of the source and target networks with weight tau and 1-tau respectively).
- Parameters:
tau – the fraction with which to use the source network’s parameters, the inverse 1-tau being the fraction with which to retain the target network’s parameters.
- class Algorithm(*, policy: TPolicy)[source]#
Bases:
Module,Generic[TPolicy,TTrainerParams],ABCThe base class for reinforcement learning algorithms in Tianshou.
An algorithm critically defines how to update the parameters of neural networks based on a batch data, optionally applying pre-processing and post-processing to the data. The actual update step is highly algorithm-specific and thus is defined in subclasses.
- Parameters:
policy – the policy
- class Optimizer(optim: Optimizer, module: Module, max_grad_norm: float | None = None)[source]#
Bases:
objectWrapper for a torch optimizer that optionally performs gradient clipping.
- Parameters:
optim – the optimizer
module – the module whose parameters are being affected by optim
max_grad_norm – the maximum L2 norm threshold for gradient clipping. When not None, gradients will be rescaled using to ensure their L2 norm does not exceed this value. This prevents exploding gradients and stabilizes training by limiting the magnitude of parameter updates. Set to None to disable gradient clipping.
- step(loss: Tensor, retain_graph: bool | None = None, create_graph: bool = False) None[source]#
Performs an optimizer step, optionally applying gradient clipping (if configured at construction).
- Parameters:
loss – the loss to backpropagate
retain_graph – passed on to backward
create_graph – passed on to backward
- state_dict(*args, destination=None, prefix='', keep_vars=False)[source]#
Returns a dictionary containing references to the whole state of the module.
Both parameters and persistent buffers (e.g. running averages) are included. Keys are corresponding parameter and buffer names. Parameters and buffers set to
Noneare not included.Note
The returned object is a shallow copy. It contains references to the module’s parameters and buffers.
Warning
Currently
state_dict()also accepts positional arguments fordestination,prefixandkeep_varsin order. However, this is being deprecated and keyword arguments will be enforced in future releases.Warning
Please avoid the use of argument
destinationas it is not designed for end-users.- Args:
- destination (dict, optional): If provided, the state of module will
be updated into the dict and the same object is returned. Otherwise, an
OrderedDictwill be created and returned. Default:None.- prefix (str, optional): a prefix added to parameter and buffer
names to compose the keys in state_dict. Default:
''.- keep_vars (bool, optional): by default the
Tensors returned in the state dict are detached from autograd. If it’s set to
True, detaching will not be performed. Default:False.
- Returns:
- dict:
a dictionary containing a whole state of the module
Example:
>>> # xdoctest: +SKIP("undefined vars") >>> module.state_dict().keys() ['bias', 'weight']
- load_state_dict(state_dict: Mapping[str, Any], strict: bool = True, assign: bool = False) _IncompatibleKeys[source]#
Copies parameters and buffers from
state_dictinto this module and its descendants. IfstrictisTrue, then the keys ofstate_dictmust exactly match the keys returned by this module’sstate_dict()function.Warning
If
assignisTruethe optimizer must be created after the call toload_state_dict.- Args:
- state_dict (dict): a dict containing parameters and
persistent buffers.
- strict (bool, optional): whether to strictly enforce that the keys
in
state_dictmatch the keys returned by this module’sstate_dict()function. Default:True- assign (bool, optional): whether to assign items in the state
dictionary to their corresponding keys in the module instead of copying them inplace into the module’s current parameters and buffers. When
False, the properties of the tensors in the current module are preserved while whenTrue, the properties of the Tensors in the state dict are preserved. Default:False
- Returns:
NamedTuplewithmissing_keysandunexpected_keysfields:missing_keys is a list of str containing the missing keys
unexpected_keys is a list of str containing the unexpected keys
- Note:
If a parameter or buffer is registered as
Noneand its corresponding key exists instate_dict,load_state_dict()will raise aRuntimeError.
- static value_mask(buffer: ReplayBuffer, indices: ndarray) ndarray[source]#
Value mask determines whether the obs_next of buffer[indices] is valid.
For instance, usually “obs_next” after “done” flag is considered to be invalid, and its q/advantage value can provide meaningless (even misleading) information, and should be set to 0 by hand. But if “done” flag is generated because timelimit of game length (info[“TimeLimit.truncated”] is set to True in gym’s settings), “obs_next” will instead be valid. Value mask is typically used for assisting in calculating the correct q/advantage value.
- Parameters:
buffer – the corresponding replay buffer.
indices (numpy.ndarray) – indices of replay buffer whose “obs_next” will be judged.
- Returns:
A bool type numpy.ndarray in the same shape with indices. “True” means “obs_next” of that buffer[indices] is valid.
- static compute_episodic_return(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indices: ndarray, v_s_: ndarray | Tensor | None = None, v_s: ndarray | Tensor | None = None, gamma: float = 0.99, gae_lambda: float = 0.95) tuple[ndarray, ndarray][source]#
Compute returns over given batch.
Use Implementation of Generalized Advantage Estimator (arXiv:1506.02438) to calculate q/advantage value of given batch. Returns are calculated as advantage + value, which is exactly equivalent to using \(TD(\lambda)\) for estimating returns.
Setting v_s_ and v_s to None (or all zeros) and gae_lambda to 1.0 calculates the discounted return-to-go/ Monte-Carlo return.
- Parameters:
batch – a data batch which contains several episodes of data in sequential order. Mind that the end of each finished episode of batch should be marked by done flag, unfinished (or collecting) episodes will be recognized by buffer.unfinished_index().
buffer – the corresponding replay buffer.
indices – tells the batch’s location in buffer, batch is equal to buffer[indices].
v_s – the value function of all next states \(V(s')\). If None, it will be set to an array of 0.
v_s – the value function of all current states \(V(s)\). If None, it is set based upon v_s_ rolled by 1.
gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks
gae_lambda – the lambda parameter in [0, 1] for generalized advantage estimation (GAE). Controls the bias-variance tradeoff in advantage estimates, acting as a weighting factor for combining different n-step advantage estimators. Higher values (closer to 1) reduce bias but increase variance by giving more weight to longer trajectories, while lower values (closer to 0) reduce variance but increase bias by relying more on the immediate TD error and value function estimates. At λ=0, GAE becomes equivalent to the one-step TD error (high bias, low variance); at λ=1, it becomes equivalent to Monte Carlo advantage estimation (low bias, high variance). Intermediate values create a weighted average of n-step returns, with exponentially decaying weights for longer-horizon returns. Typically set between 0.9 and 0.99 for most policy gradient methods.
- Returns:
two numpy arrays (returns, advantage) with each shape (bsz, ).
- static compute_nstep_return(batch: RolloutBatchProtocol, buffer: ReplayBuffer, indices: ndarray, target_q_fn: Callable[[ReplayBuffer, ndarray], Tensor], gamma: float = 0.99, n_step: int = 1) BatchWithReturnsProtocol[source]#
Computes the n-step return for Q-learning targets, adds it to the batch and returns the resulting batch.
\[G_t = \sum_{i = t}^{t + n - 1} \gamma^{i - t}(1 - d_i)r_i + \gamma^n (1 - d_{t + n}) Q_{\mathrm{target}}(s_{t + n})\]where \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\), \(d_t\) is the done flag of step \(t\).
- Parameters:
batch – a data batch, which is equal to buffer[indices].
buffer – the data buffer.
indices – tell batch’s location in buffer
target_q_fn – a function which computes the target Q value of “obs_next” given data buffer and wanted indices (n_step steps ahead).
gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks
n_step – the number of estimation step, should be an int greater than 0.
- Returns:
a Batch. The result will be stored in batch.returns as a torch.Tensor with the same shape as target_q_fn’s return tensor.
- class OnPolicyAlgorithm(*, policy: TPolicy)[source]#
Bases:
Algorithm[TPolicy,OnPolicyTrainerParams],Generic[TPolicy],ABCBase class for on-policy RL algorithms.
- Parameters:
policy – the policy
- create_trainer(params: OnPolicyTrainerParams) OnPolicyTrainer[source]#
- update(buffer: ReplayBuffer, batch_size: int | None, repeat: int) TrainingStats[source]#
- class OffPolicyAlgorithm(*, policy: TPolicy)[source]#
Bases:
Algorithm[TPolicy,OffPolicyTrainerParams],Generic[TPolicy],ABCBase class for off-policy RL algorithms.
- Parameters:
policy – the policy
- create_trainer(params: OffPolicyTrainerParams) OffPolicyTrainer[source]#
- update(buffer: ReplayBuffer, sample_size: int | None) TrainingStats[source]#
- class OfflineAlgorithm(*, policy: TPolicy)[source]#
Bases:
Algorithm[TPolicy,OfflineTrainerParams],Generic[TPolicy],ABCBase class for offline RL algorithms.
- Parameters:
policy – the policy
- process_buffer(buffer: TBuffer) TBuffer[source]#
Pre-process the replay buffer to prepare for offline learning, e.g. to add new keys.
- run_training(params: OfflineTrainerParams) InfoStats[source]#
- create_trainer(params: OfflineTrainerParams) OfflineTrainer[source]#
- update(buffer: ReplayBuffer, sample_size: int | None) TrainingStats[source]#
- class OnPolicyWrapperAlgorithm(wrapped_algorithm: OnPolicyAlgorithm[TPolicy])[source]#
Bases:
OnPolicyAlgorithm[TPolicy],Generic[TPolicy],ABCBase class for an on-policy algorithm that is a wrapper around another algorithm.
It applies the wrapped algorithm’s pre-processing and post-processing methods and chains the update method of the wrapped algorithm with the wrapper’s own update method.
- Parameters:
policy – the policy
- class OffPolicyWrapperAlgorithm(wrapped_algorithm: OffPolicyAlgorithm[TPolicy])[source]#
Bases:
OffPolicyAlgorithm[TPolicy],Generic[TPolicy],ABCBase class for an off-policy algorithm that is a wrapper around another algorithm.
It applies the wrapped algorithm’s pre-processing and post-processing methods and chains the update method of the wrapped algorithm with the wrapper’s own update method.
- Parameters:
policy – the policy
- class RandomActionPolicy(action_space: Space)[source]#
Bases:
Policy- Parameters:
action_space – the environment’s action_space.
observation_space – the environment’s observation space.
action_scaling – flag indicating whether, for continuous action spaces, actions should be scaled from the standard neural network output range [-1, 1] to the environment’s action space range [action_space.low, action_space.high]. This applies to continuous action spaces only (gym.spaces.Box) and has no effect for discrete spaces. When enabled, policy outputs are expected to be in the normalized range [-1, 1] (after bounding), and are then linearly transformed to the actual required range. This improves neural network training stability, allows the same algorithm to work across environments with different action ranges, and standardizes exploration strategies. Should be disabled if the actor model already produces outputs in the correct range.
action_bound_method – the method used for bounding actions in continuous action spaces to the range [-1, 1] before scaling them to the environment’s action space (provided that action_scaling is enabled). This applies to continuous action spaces only (gym.spaces.Box) and should be set to None for discrete spaces. When set to “clip”, actions exceeding the [-1, 1] range are simply clipped to this range. When set to “tanh”, a hyperbolic tangent function is applied, which smoothly constrains outputs to [-1, 1] while preserving gradients. The choice of bounding method affects both training dynamics and exploration behavior. Clipping provides hard boundaries but may create plateau regions in the gradient landscape, while tanh provides smoother transitions but can compress sensitivity near the boundaries. Should be set to None if the actor model inherently produces bounded outputs. Typically used together with action_scaling=True.
- forward(batch: ObsBatchProtocol, state: dict | BatchProtocol | ndarray | None = None, **kwargs: Any) ActStateBatchProtocol[source]#
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- episode_mc_return_to_go(rewards: ndarray, gamma: float = 0.99) ndarray[source]#
Calculates discounted monte-carlo returns to go from rewards of a single episode.
- Parameters:
rewards – rewards of a single episode. Assumed to be a 1-dim array from reset till the end of the episode.
gamma – discount factor
- Returns:
a numpy array of shape (len(rewards), ).