tianshou.policy¶
Base¶

class
tianshou.policy.
BasePolicy
(observation_space: Optional[gym.spaces.space.Space] = None, action_space: Optional[gym.spaces.space.Space] = None, action_scaling: bool = False, action_bound_method: str = '')[source]¶ Bases:
abc.ABC
,torch.nn.modules.module.Module
The base class for any RL policy.
Tianshou aims to modularize RL algorithms. It comes into several classes of policies in Tianshou. All of the policy classes must inherit
BasePolicy
.A policy class typically has the following parts:
__init__()
: initialize the policy, including coping the target network and so on;forward()
: compute action with given observation;process_fn()
: preprocess data from the replay buffer (this function can interact with replay buffer);learn()
: update policy with a given batch of data.post_process_fn()
: update the replay buffer from the learning process (e.g., prioritized replay buffer needs to update the weight);update()
: the main interface for training, i.e., process_fn > learn > post_process_fn.
Most of the policy needs a neural network to predict the action and an optimizer to optimize the policy. The rules of selfdefined networks are:
Input: observation “obs” (may be a
numpy.ndarray
, atorch.Tensor
, a dict or any others), hidden state “state” (for RNN usage), and other information “info” provided by the environment.Output: some “logits”, the next hidden state “state”, and the intermediate result during policy forwarding procedure “policy”. The “logits” could be a tuple instead of a
torch.Tensor
. It depends on how the policy process the network output. For example, in PPO, the return of the network might be(mu, sigma), state
for Gaussian policy. The “policy” can be a Batch of torch.Tensor or other things, which will be stored in the replay buffer, and can be accessed in the policy update process (e.g. in “policy.learn()”, the “batch.policy” is what you need).
Since
BasePolicy
inheritstorch.nn.Module
, you can useBasePolicy
almost the same astorch.nn.Module
, for instance, loading and saving the model:torch.save(policy.state_dict(), "policy.pth") policy.load_state_dict(torch.load("policy.pth"))

exploration_noise
(act: Union[numpy.ndarray, tianshou.data.batch.Batch], batch: tianshou.data.batch.Batch) → Union[numpy.ndarray, tianshou.data.batch.Batch][source]¶ Modify the action from policy.forward with exploration noise.
 Parameters
act – a data batch or numpy.ndarray which is the action taken by policy.forward.
batch – the input batch for policy.forward, kept for advanced usage.
 Returns
action in the same form of input “act” but with added exploration noise.

soft_update
(tgt: torch.nn.modules.module.Module, src: torch.nn.modules.module.Module, tau: float) → None[source]¶ Softly update the parameters of target module towards the parameters of source module.

abstract
forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data.
 Returns
A
Batch
which MUST have the following keys:act
an numpy.ndarray or a torch.Tensor, the action over given batch data.state
a dict, an numpy.ndarray or a torch.Tensor, the internal state of the policy,None
as default.
Other keys are userdefined. It depends on the algorithm. For example,
# some code return Batch(logits=..., act=..., state=None, dist=...)
The keyword
policy
is reserved and the corresponding data will be stored into the replay buffer. For instance,# some code return Batch(..., policy=Batch(log_prob=dist.log_prob(act))) # and in the sampled data batch, you can directly use # batch.policy.log_prob to get your data.
Note
In continuous action space, you should do another step “map_action” to get the real action:
act = policy(batch).act # doesn't map to the target action range act = policy.map_action(act, batch)

map_action
(act: Union[tianshou.data.batch.Batch, numpy.ndarray]) → Union[tianshou.data.batch.Batch, numpy.ndarray][source]¶ Map raw network output to action range in gym’s env.action_space.
This function is called in
collect()
and only affects action sending to env. Remapped action will not be stored in buffer and thus can be viewed as a part of env (a black box action transformation).Action mapping includes 2 standard procedures: bounding and scaling. Bounding procedure expects original action range is (inf, inf) and maps it to [1, 1], while scaling procedure expects original action range is (1, 1) and maps it to [action_space.low, action_space.high]. Bounding procedure is applied first.
 Parameters
act – a data batch or numpy.ndarray which is the action taken by policy.forward.
 Returns
action in the same form of input “act” but remap to the target action space.

map_action_inverse
(act: Union[tianshou.data.batch.Batch, List, numpy.ndarray]) → Union[tianshou.data.batch.Batch, List, numpy.ndarray][source]¶ Inverse operation to
map_action()
.This function is called in
collect()
for random initial steps. It scales [action_space.low, action_space.high] to the value ranges of policy.forward. Parameters
act – a data batch, list or numpy.ndarray which is the action taken by gym.spaces.Box.sample().
 Returns
action remapped.

process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indices: numpy.ndarray) → tianshou.data.batch.Batch[source]¶ Preprocess the data from the provided replay buffer.
Used in
update()
. Check out policy.process_fn for more information.

abstract
learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, Any][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

post_process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indices: numpy.ndarray) → None[source]¶ Postprocess the data from the provided replay buffer.
Typical usage is to update the sampling weight in prioritized experience replay. Used in
update()
.

update
(sample_size: int, buffer: Optional[tianshou.data.buffer.base.ReplayBuffer], **kwargs: Any) → Dict[str, Any][source]¶ Update the policy network and replay buffer.
It includes 3 function steps: process_fn, learn, and post_process_fn. In addition, this function will change the value of
self.updating
: it will be False before this function and will be True when executingupdate()
. Please refer to States for policy for more detailed explanation. Parameters
sample_size (int) – 0 means it will extract all the data from the buffer, otherwise it will sample a batch with given sample_size.
buffer (ReplayBuffer) – the corresponding replay buffer.
 Returns
A dict, including the data needed to be logged (e.g., loss) from
policy.learn()
.

static
value_mask
(buffer: tianshou.data.buffer.base.ReplayBuffer, indices: numpy.ndarray) → numpy.ndarray[source]¶ Value mask determines whether the obs_next of buffer[indices] is valid.
For instance, usually “obs_next” after “done” flag is considered to be invalid, and its q/advantage value can provide meaningless (even misleading) information, and should be set to 0 by hand. But if “done” flag is generated because timelimit of game length (info[“TimeLimit.truncated”] is set to True in gym’s settings), “obs_next” will instead be valid. Value mask is typically used for assisting in calculating the correct q/advantage value.
 Parameters
buffer (ReplayBuffer) – the corresponding replay buffer.
indices (numpy.ndarray) – indices of replay buffer whose “obs_next” will be judged.
 Returns
A bool type numpy.ndarray in the same shape with indices. “True” means “obs_next” of that buffer[indices] is valid.

static
compute_episodic_return
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indices: numpy.ndarray, v_s_: Optional[Union[numpy.ndarray, torch.Tensor]] = None, v_s: Optional[Union[numpy.ndarray, torch.Tensor]] = None, gamma: float = 0.99, gae_lambda: float = 0.95) → Tuple[numpy.ndarray, numpy.ndarray][source]¶ Compute returns over given batch.
Use Implementation of Generalized Advantage Estimator (arXiv:1506.02438) to calculate q/advantage value of given batch.
 Parameters
batch (Batch) – a data batch which contains several episodes of data in sequential order. Mind that the end of each finished episode of batch should be marked by done flag, unfinished (or collecting) episodes will be recognized by buffer.unfinished_index().
indices (numpy.ndarray) – tell batch’s location in buffer, batch is equal to buffer[indices].
v_s (np.ndarray) – the value function of all next states \(V(s')\).
gamma (float) – the discount factor, should be in [0, 1]. Default to 0.99.
gae_lambda (float) – the parameter for Generalized Advantage Estimation, should be in [0, 1]. Default to 0.95.
 Returns
two numpy arrays (returns, advantage) with each shape (bsz, ).

static
compute_nstep_return
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indice: numpy.ndarray, target_q_fn: Callable[[tianshou.data.buffer.base.ReplayBuffer, numpy.ndarray], torch.Tensor], gamma: float = 0.99, n_step: int = 1, rew_norm: bool = False) → tianshou.data.batch.Batch[source]¶ Compute nstep return for Qlearning targets.
\[G_t = \sum_{i = t}^{t + n  1} \gamma^{i  t}(1  d_i)r_i + \gamma^n (1  d_{t + n}) Q_{\mathrm{target}}(s_{t + n})\]where \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\), \(d_t\) is the done flag of step \(t\).
 Parameters
batch (Batch) – a data batch, which is equal to buffer[indice].
buffer (ReplayBuffer) – the data buffer.
target_q_fn (function) – a function which compute target Q value of “obs_next” given data buffer and wanted indices.
gamma (float) – the discount factor, should be in [0, 1]. Default to 0.99.
n_step (int) – the number of estimation step, should be an int greater than 0. Default to 1.
rew_norm (bool) – normalize the reward to Normal(0, 1), Default to False.
 Returns
a Batch. The result will be stored in batch.returns as a torch.Tensor with the same shape as target_q_fn’s return tensor.

training
: bool¶

class
tianshou.policy.
RandomPolicy
(observation_space: Optional[gym.spaces.space.Space] = None, action_space: Optional[gym.spaces.space.Space] = None, action_scaling: bool = False, action_bound_method: str = '')[source]¶ Bases:
tianshou.policy.base.BasePolicy
A random agent used in multiagent learning.
It randomly chooses an action from the legal action.

forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute the random action over the given batch data.
The input should contain a mask in batch.obs, with “True” to be available and “False” to be unavailable. For example,
batch.obs.mask == np.array([[False, True, False]])
means with batch size 1, action “1” is available but action “0” and “2” are unavailable. Returns
A
Batch
with “act” key, containing the random action.
See also
Please refer to
forward()
for more detailed explanation.

learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Since a random agent learns nothing, it returns an empty dict.

training
: bool¶

Modelfree¶
DQN Family¶

class
tianshou.policy.
DQNPolicy
(model: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, discount_factor: float = 0.99, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, is_double: bool = True, **kwargs: Any)[source]¶ Bases:
tianshou.policy.base.BasePolicy
Implementation of Deep Q Network. arXiv:1312.5602.
Implementation of Double QLearning. arXiv:1509.06461.
Implementation of Dueling DQN. arXiv:1511.06581 (the dueling DQN is implemented in the network side, not here).
 Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s > logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network). Default to 0.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
is_double (bool) – use double dqn. Default to True.
See also
Please refer to
BasePolicy
for more detailed explanation.
train
(mode: bool = True) → tianshou.policy.modelfree.dqn.DQNPolicy[source]¶ Set the module in training mode, except for the target network.

process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indices: numpy.ndarray) → tianshou.data.batch.Batch[source]¶ Compute the nstep return for Qlearning targets.
More details can be found at
compute_nstep_return()
.

compute_q_value
(logits: torch.Tensor, mask: Optional[numpy.ndarray]) → torch.Tensor[source]¶ Compute the q value based on the network’s raw output and action mask.

forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, model: str = 'model', input: str = 'obs', **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data.
If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:
batch == Batch( obs=Batch( obs="original obs, with batch_size=1 for demonstration", mask=np.array([[False, True, False]]), # action 1 is available # action 0 and 2 are unavailable ), ... )
 Parameters
eps (float) – in [0, 1], for epsilongreedy exploration method.
 Returns
A
Batch
which has 3 keys:act
the action.logits
the network’s raw output.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.

learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

exploration_noise
(act: Union[numpy.ndarray, tianshou.data.batch.Batch], batch: tianshou.data.batch.Batch) → Union[numpy.ndarray, tianshou.data.batch.Batch][source]¶ Modify the action from policy.forward with exploration noise.
 Parameters
act – a data batch or numpy.ndarray which is the action taken by policy.forward.
batch – the input batch for policy.forward, kept for advanced usage.
 Returns
action in the same form of input “act” but with added exploration noise.

training
: bool¶

class
tianshou.policy.
C51Policy
(model: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, discount_factor: float = 0.99, num_atoms: int = 51, v_min: float =  10.0, v_max: float = 10.0, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.dqn.DQNPolicy
Implementation of Categorical Deep QNetwork. arXiv:1707.06887.
 Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s > logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
num_atoms (int) – the number of atoms in the support set of the value distribution. Default to 51.
v_min (float) – the value of the smallest atom in the support set. Default to 10.0.
v_max (float) – the value of the largest atom in the support set. Default to 10.0.
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network). Default to 0.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
See also
Please refer to
DQNPolicy
for more detailed explanation.
compute_q_value
(logits: torch.Tensor, mask: Optional[numpy.ndarray]) → torch.Tensor[source]¶ Compute the q value based on the network’s raw output and action mask.

learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

training
: bool¶

class
tianshou.policy.
RainbowPolicy
(model: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, discount_factor: float = 0.99, num_atoms: int = 51, v_min: float =  10.0, v_max: float = 10.0, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.c51.C51Policy
Implementation of Rainbow DQN. arXiv:1710.02298.
 Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s > logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
num_atoms (int) – the number of atoms in the support set of the value distribution. Default to 51.
v_min (float) – the value of the smallest atom in the support set. Default to 10.0.
v_max (float) – the value of the largest atom in the support set. Default to 10.0.
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network). Default to 0.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
See also
Please refer to
C51Policy
for more detailed explanation.
learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

training
: bool¶

class
tianshou.policy.
QRDQNPolicy
(model: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, discount_factor: float = 0.99, num_quantiles: int = 200, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.dqn.DQNPolicy
Implementation of Quantile Regression Deep QNetwork. arXiv:1710.10044.
 Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s > logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
num_quantiles (int) – the number of quantile midpoints in the inverse cumulative distribution function of the value. Default to 200.
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network).
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
See also
Please refer to
DQNPolicy
for more detailed explanation.
compute_q_value
(logits: torch.Tensor, mask: Optional[numpy.ndarray]) → torch.Tensor[source]¶ Compute the q value based on the network’s raw output and action mask.

learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

training
: bool¶

class
tianshou.policy.
IQNPolicy
(model: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, discount_factor: float = 0.99, sample_size: int = 32, online_sample_size: int = 8, target_sample_size: int = 8, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.qrdqn.QRDQNPolicy
Implementation of Implicit Quantile Network. arXiv:1806.06923.
 Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s > logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
sample_size (int) – the number of samples for policy evaluation. Default to 32.
online_sample_size (int) – the number of samples for online model in training. Default to 8.
target_sample_size (int) – the number of samples for target model in training. Default to 8.
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network).
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
See also
Please refer to
QRDQNPolicy
for more detailed explanation.
forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, model: str = 'model', input: str = 'obs', **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data.
If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:
batch == Batch( obs=Batch( obs="original obs, with batch_size=1 for demonstration", mask=np.array([[False, True, False]]), # action 1 is available # action 0 and 2 are unavailable ), ... )
 Parameters
eps (float) – in [0, 1], for epsilongreedy exploration method.
 Returns
A
Batch
which has 3 keys:act
the action.logits
the network’s raw output.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.

learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

training
: bool¶

class
tianshou.policy.
FQFPolicy
(model: tianshou.utils.net.discrete.FullQuantileFunction, optim: torch.optim.optimizer.Optimizer, fraction_model: tianshou.utils.net.discrete.FractionProposalNetwork, fraction_optim: torch.optim.optimizer.Optimizer, discount_factor: float = 0.99, num_fractions: int = 32, ent_coef: float = 0.0, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.qrdqn.QRDQNPolicy
Implementation of Fullyparameterized Quantile Function. arXiv:1911.02140.
 Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s > logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
fraction_model (FractionProposalNetwork) – a FractionProposalNetwork for proposing fractions/quantiles given state.
fraction_optim (torch.optim.Optimizer) – a torch.optim for optimizing the fraction model above.
discount_factor (float) – in [0, 1].
num_fractions (int) – the number of fractions to use. Default to 32.
ent_coef (float) – the coefficient for entropy loss. Default to 0.
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network).
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
See also
Please refer to
QRDQNPolicy
for more detailed explanation.
forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, model: str = 'model', input: str = 'obs', fractions: Optional[tianshou.data.batch.Batch] = None, **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data.
If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:
batch == Batch( obs=Batch( obs="original obs, with batch_size=1 for demonstration", mask=np.array([[False, True, False]]), # action 1 is available # action 0 and 2 are unavailable ), ... )
 Parameters
eps (float) – in [0, 1], for epsilongreedy exploration method.
 Returns
A
Batch
which has 3 keys:act
the action.logits
the network’s raw output.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.

learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

training
: bool¶
Onpolicy¶

class
tianshou.policy.
PGPolicy
(model: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, dist_fn: Type[torch.distributions.distribution.Distribution], discount_factor: float = 0.99, reward_normalization: bool = False, action_scaling: bool = True, action_bound_method: str = 'clip', lr_scheduler: Optional[torch.optim.lr_scheduler.LambdaLR] = None, deterministic_eval: bool = False, **kwargs: Any)[source]¶ Bases:
tianshou.policy.base.BasePolicy
Implementation of REINFORCE algorithm.
 Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s > logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
dist_fn (Type[torch.distributions.Distribution]) – distribution class for computing the action.
discount_factor (float) – in [0, 1]. Default to 0.99.
action_scaling (bool) – whether to map actions from range [1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.
See also
Please refer to
BasePolicy
for more detailed explanation.
process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indices: numpy.ndarray) → tianshou.data.batch.Batch[source]¶ Compute the discounted returns for each transition.
\[G_t = \sum_{i=t}^T \gamma^{it}r_i\]where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).

forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data.
 Returns
A
Batch
which has 4 keys:act
the action.logits
the network’s raw output.dist
the action distribution.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.

learn
(batch: tianshou.data.batch.Batch, batch_size: int, repeat: int, **kwargs: Any) → Dict[str, List[float]][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

training
: bool¶

class
tianshou.policy.
NPGPolicy
(actor: torch.nn.modules.module.Module, critic: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, dist_fn: Type[torch.distributions.distribution.Distribution], advantage_normalization: bool = True, optim_critic_iters: int = 5, actor_step_size: float = 0.5, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.a2c.A2CPolicy
Implementation of Natural Policy Gradient.
https://proceedings.neurips.cc/paper/2001/file/4b86abe48d358ecf194c56c69108433ePaper.pdf
 Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s > logits)critic (torch.nn.Module) – the critic network. (s > V(s))
optim (torch.optim.Optimizer) – the optimizer for actor and critic network.
dist_fn (Type[torch.distributions.Distribution]) – distribution class for computing the action.
advantage_normalization (bool) – whether to do per minibatch advantage normalization. Default to True.
optim_critic_iters (int) – Number of times to optimize critic network per update. Default to 5.
gae_lambda (float) – in [0, 1], param for Generalized Advantage Estimation. Default to 0.95.
reward_normalization (bool) – normalize estimated values to have std close to 1. Default to False.
max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 256.
action_scaling (bool) – whether to map actions from range [1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.

process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indices: numpy.ndarray) → tianshou.data.batch.Batch[source]¶ Compute the discounted returns for each transition.
\[G_t = \sum_{i=t}^T \gamma^{it}r_i\]where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).

learn
(batch: tianshou.data.batch.Batch, batch_size: int, repeat: int, **kwargs: Any) → Dict[str, List[float]][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

training
: bool¶

class
tianshou.policy.
A2CPolicy
(actor: torch.nn.modules.module.Module, critic: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, dist_fn: Type[torch.distributions.distribution.Distribution], vf_coef: float = 0.5, ent_coef: float = 0.01, max_grad_norm: Optional[float] = None, gae_lambda: float = 0.95, max_batchsize: int = 256, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.pg.PGPolicy
Implementation of Synchronous Advantage ActorCritic. arXiv:1602.01783.
 Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s > logits)critic (torch.nn.Module) – the critic network. (s > V(s))
optim (torch.optim.Optimizer) – the optimizer for actor and critic network.
dist_fn (Type[torch.distributions.Distribution]) – distribution class for computing the action.
discount_factor (float) – in [0, 1]. Default to 0.99.
vf_coef (float) – weight for value loss. Default to 0.5.
ent_coef (float) – weight for entropy loss. Default to 0.01.
max_grad_norm (float) – clipping gradients in back propagation. Default to None.
gae_lambda (float) – in [0, 1], param for Generalized Advantage Estimation. Default to 0.95.
reward_normalization (bool) – normalize estimated values to have std close to 1. Default to False.
max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 256.
action_scaling (bool) – whether to map actions from range [1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.
See also
Please refer to
BasePolicy
for more detailed explanation.
process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indices: numpy.ndarray) → tianshou.data.batch.Batch[source]¶ Compute the discounted returns for each transition.
\[G_t = \sum_{i=t}^T \gamma^{it}r_i\]where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).

learn
(batch: tianshou.data.batch.Batch, batch_size: int, repeat: int, **kwargs: Any) → Dict[str, List[float]][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

training
: bool¶

class
tianshou.policy.
TRPOPolicy
(actor: torch.nn.modules.module.Module, critic: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, dist_fn: Type[torch.distributions.distribution.Distribution], max_kl: float = 0.01, backtrack_coeff: float = 0.8, max_backtracks: int = 10, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.npg.NPGPolicy
Implementation of Trust Region Policy Optimization. arXiv:1502.05477.
 Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s > logits)critic (torch.nn.Module) – the critic network. (s > V(s))
optim (torch.optim.Optimizer) – the optimizer for actor and critic network.
dist_fn (Type[torch.distributions.Distribution]) – distribution class for computing the action.
advantage_normalization (bool) – whether to do per minibatch advantage normalization. Default to True.
optim_critic_iters (int) – Number of times to optimize critic network per update. Default to 5.
max_kl (int) – max kldivergence used to constrain each actor network update. Default to 0.01.
backtrack_coeff (float) – Coefficient to be multiplied by step size when constraints are not met. Default to 0.8.
max_backtracks (int) – Max number of backtracking times in linesearch. Default to 10.
gae_lambda (float) – in [0, 1], param for Generalized Advantage Estimation. Default to 0.95.
reward_normalization (bool) – normalize estimated values to have std close to 1. Default to False.
max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 256.
action_scaling (bool) – whether to map actions from range [1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.

learn
(batch: tianshou.data.batch.Batch, batch_size: int, repeat: int, **kwargs: Any) → Dict[str, List[float]][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

training
: bool¶

class
tianshou.policy.
PPOPolicy
(actor: torch.nn.modules.module.Module, critic: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, dist_fn: Type[torch.distributions.distribution.Distribution], eps_clip: float = 0.2, dual_clip: Optional[float] = None, value_clip: bool = False, advantage_normalization: bool = True, recompute_advantage: bool = False, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.a2c.A2CPolicy
Implementation of Proximal Policy Optimization. arXiv:1707.06347.
 Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s > logits)critic (torch.nn.Module) – the critic network. (s > V(s))
optim (torch.optim.Optimizer) – the optimizer for actor and critic network.
dist_fn (Type[torch.distributions.Distribution]) – distribution class for computing the action.
discount_factor (float) – in [0, 1]. Default to 0.99.
eps_clip (float) – \(\epsilon\) in \(L_{CLIP}\) in the original paper. Default to 0.2.
dual_clip (float) – a parameter c mentioned in arXiv:1912.09729 Equ. 5, where c > 1 is a constant indicating the lower bound. Default to 5.0 (set None if you do not want to use it).
value_clip (bool) – a parameter mentioned in arXiv:1811.02553v3 Sec. 4.1. Default to True.
advantage_normalization (bool) – whether to do per minibatch advantage normalization. Default to True.
recompute_advantage (bool) – whether to recompute advantage every update repeat according to https://arxiv.org/pdf/2006.05990.pdf Sec. 3.5. Default to False.
vf_coef (float) – weight for value loss. Default to 0.5.
ent_coef (float) – weight for entropy loss. Default to 0.01.
max_grad_norm (float) – clipping gradients in back propagation. Default to None.
gae_lambda (float) – in [0, 1], param for Generalized Advantage Estimation. Default to 0.95.
reward_normalization (bool) – normalize estimated values to have std close to 1, also normalize the advantage to Normal(0, 1). Default to False.
max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 256.
action_scaling (bool) – whether to map actions from range [1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.
See also
Please refer to
BasePolicy
for more detailed explanation.
process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indices: numpy.ndarray) → tianshou.data.batch.Batch[source]¶ Compute the discounted returns for each transition.
\[G_t = \sum_{i=t}^T \gamma^{it}r_i\]where \(T\) is the terminal time step, \(\gamma\) is the discount factor, \(\gamma \in [0, 1]\).

learn
(batch: tianshou.data.batch.Batch, batch_size: int, repeat: int, **kwargs: Any) → Dict[str, List[float]][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

training
: bool¶
Offpolicy¶

class
tianshou.policy.
DDPGPolicy
(actor: Optional[torch.nn.modules.module.Module], actor_optim: Optional[torch.optim.optimizer.Optimizer], critic: Optional[torch.nn.modules.module.Module], critic_optim: Optional[torch.optim.optimizer.Optimizer], tau: float = 0.005, gamma: float = 0.99, exploration_noise: Optional[tianshou.exploration.random.BaseNoise] = <tianshou.exploration.random.GaussianNoise object>, reward_normalization: bool = False, estimation_step: int = 1, action_scaling: bool = True, action_bound_method: str = 'clip', **kwargs: Any)[source]¶ Bases:
tianshou.policy.base.BasePolicy
Implementation of Deep Deterministic Policy Gradient. arXiv:1509.02971.
 Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s > logits)actor_optim (torch.optim.Optimizer) – the optimizer for actor network.
critic (torch.nn.Module) – the critic network. (s, a > Q(s, a))
critic_optim (torch.optim.Optimizer) – the optimizer for critic network.
tau (float) – param for soft update of the target network. Default to 0.005.
gamma (float) – discount factor, in [0, 1]. Default to 0.99.
exploration_noise (BaseNoise) – the exploration noise, add to the action. Default to
GaussianNoise(sigma=0.1)
.reward_normalization (bool) – normalize the reward to Normal(0, 1), Default to False.
estimation_step (int) – the number of steps to look ahead. Default to 1.
action_scaling (bool) – whether to map actions from range [1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [1, 1], can be either “clip” (for simply clipping the action) or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
See also
Please refer to
BasePolicy
for more detailed explanation.
set_exp_noise
(noise: Optional[tianshou.exploration.random.BaseNoise]) → None[source]¶ Set the exploration noise.

train
(mode: bool = True) → tianshou.policy.modelfree.ddpg.DDPGPolicy[source]¶ Set the module in training mode, except for the target network.

process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indices: numpy.ndarray) → tianshou.data.batch.Batch[source]¶ Preprocess the data from the provided replay buffer.
Used in
update()
. Check out policy.process_fn for more information.

forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, model: str = 'actor', input: str = 'obs', **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data.
 Returns
A
Batch
which has 2 keys:act
the action.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.

learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

exploration_noise
(act: Union[numpy.ndarray, tianshou.data.batch.Batch], batch: tianshou.data.batch.Batch) → Union[numpy.ndarray, tianshou.data.batch.Batch][source]¶ Modify the action from policy.forward with exploration noise.
 Parameters
act – a data batch or numpy.ndarray which is the action taken by policy.forward.
batch – the input batch for policy.forward, kept for advanced usage.
 Returns
action in the same form of input “act” but with added exploration noise.

training
: bool¶

class
tianshou.policy.
TD3Policy
(actor: torch.nn.modules.module.Module, actor_optim: torch.optim.optimizer.Optimizer, critic1: torch.nn.modules.module.Module, critic1_optim: torch.optim.optimizer.Optimizer, critic2: torch.nn.modules.module.Module, critic2_optim: torch.optim.optimizer.Optimizer, tau: float = 0.005, gamma: float = 0.99, exploration_noise: Optional[tianshou.exploration.random.BaseNoise] = <tianshou.exploration.random.GaussianNoise object>, policy_noise: float = 0.2, update_actor_freq: int = 2, noise_clip: float = 0.5, reward_normalization: bool = False, estimation_step: int = 1, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.ddpg.DDPGPolicy
Implementation of TD3, arXiv:1802.09477.
 Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s > logits)actor_optim (torch.optim.Optimizer) – the optimizer for actor network.
critic1 (torch.nn.Module) – the first critic network. (s, a > Q(s, a))
critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.
critic2 (torch.nn.Module) – the second critic network. (s, a > Q(s, a))
critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.
tau (float) – param for soft update of the target network. Default to 0.005.
gamma (float) – discount factor, in [0, 1]. Default to 0.99.
exploration_noise (float) – the exploration noise, add to the action. Default to
GaussianNoise(sigma=0.1)
policy_noise (float) – the noise used in updating policy network. Default to 0.2.
update_actor_freq (int) – the update frequency of actor network. Default to 2.
noise_clip (float) – the clipping range used in updating policy network. Default to 0.5.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
action_scaling (bool) – whether to map actions from range [1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [1, 1], can be either “clip” (for simply clipping the action) or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
See also
Please refer to
BasePolicy
for more detailed explanation.
train
(mode: bool = True) → tianshou.policy.modelfree.td3.TD3Policy[source]¶ Set the module in training mode, except for the target network.

learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

training
: bool¶

actor
: torch.nn.Module¶

actor_optim
: torch.optim.Optimizer¶

critic
: torch.nn.Module¶

critic_optim
: torch.optim.Optimizer¶

class
tianshou.policy.
SACPolicy
(actor: torch.nn.modules.module.Module, actor_optim: torch.optim.optimizer.Optimizer, critic1: torch.nn.modules.module.Module, critic1_optim: torch.optim.optimizer.Optimizer, critic2: torch.nn.modules.module.Module, critic2_optim: torch.optim.optimizer.Optimizer, tau: float = 0.005, gamma: float = 0.99, alpha: Union[float, Tuple[float, torch.Tensor, torch.optim.optimizer.Optimizer]] = 0.2, reward_normalization: bool = False, estimation_step: int = 1, exploration_noise: Optional[tianshou.exploration.random.BaseNoise] = None, deterministic_eval: bool = True, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.ddpg.DDPGPolicy
Implementation of Soft ActorCritic. arXiv:1812.05905.
 Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s > logits)actor_optim (torch.optim.Optimizer) – the optimizer for actor network.
critic1 (torch.nn.Module) – the first critic network. (s, a > Q(s, a))
critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.
critic2 (torch.nn.Module) – the second critic network. (s, a > Q(s, a))
critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.
tau (float) – param for soft update of the target network. Default to 0.005.
gamma (float) – discount factor, in [0, 1]. Default to 0.99.
torch.Tensor, torch.optim.Optimizer) or float alpha ((float,) – entropy regularization coefficient. Default to 0.2. If a tuple (target_entropy, log_alpha, alpha_optim) is provided, then alpha is automatically tuned.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
exploration_noise (BaseNoise) – add a noise to action for exploration. Default to None. This is useful when solving hardexploration problem.
deterministic_eval (bool) – whether to use deterministic action (mean of Gaussian policy) instead of stochastic action sampled by the policy. Default to True.
action_scaling (bool) – whether to map actions from range [1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [1, 1], can be either “clip” (for simply clipping the action) or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
See also
Please refer to
BasePolicy
for more detailed explanation.
actor
: torch.nn.Module¶

actor_optim
: torch.optim.Optimizer¶

train
(mode: bool = True) → tianshou.policy.modelfree.sac.SACPolicy[source]¶ Set the module in training mode, except for the target network.

forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, input: str = 'obs', **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data.
 Returns
A
Batch
which has 2 keys:act
the action.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.

training
: bool¶

critic
: torch.nn.Module¶

critic_optim
: torch.optim.Optimizer¶

learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

class
tianshou.policy.
DiscreteSACPolicy
(actor: torch.nn.modules.module.Module, actor_optim: torch.optim.optimizer.Optimizer, critic1: torch.nn.modules.module.Module, critic1_optim: torch.optim.optimizer.Optimizer, critic2: torch.nn.modules.module.Module, critic2_optim: torch.optim.optimizer.Optimizer, tau: float = 0.005, gamma: float = 0.99, alpha: Union[float, Tuple[float, torch.Tensor, torch.optim.optimizer.Optimizer]] = 0.2, reward_normalization: bool = False, estimation_step: int = 1, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.sac.SACPolicy
Implementation of SAC for Discrete Action Settings. arXiv:1910.07207.
 Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s > logits)actor_optim (torch.optim.Optimizer) – the optimizer for actor network.
critic1 (torch.nn.Module) – the first critic network. (s > Q(s))
critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.
critic2 (torch.nn.Module) – the second critic network. (s > Q(s))
critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.
tau (float) – param for soft update of the target network. Default to 0.005.
gamma (float) – discount factor, in [0, 1]. Default to 0.99.
torch.Tensor, torch.optim.Optimizer) or float alpha ((float,) – entropy regularization coefficient. Default to 0.2. If a tuple (target_entropy, log_alpha, alpha_optim) is provided, the alpha is automatically tuned.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
See also
Please refer to
BasePolicy
for more detailed explanation.
forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, input: str = 'obs', **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data.
 Returns
A
Batch
which has 2 keys:act
the action.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.

learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

exploration_noise
(act: Union[numpy.ndarray, tianshou.data.batch.Batch], batch: tianshou.data.batch.Batch) → Union[numpy.ndarray, tianshou.data.batch.Batch][source]¶ Modify the action from policy.forward with exploration noise.
 Parameters
act – a data batch or numpy.ndarray which is the action taken by policy.forward.
batch – the input batch for policy.forward, kept for advanced usage.
 Returns
action in the same form of input “act” but with added exploration noise.

training
: bool¶

actor
: torch.nn.Module¶

actor_optim
: torch.optim.Optimizer¶

critic
: torch.nn.Module¶

critic_optim
: torch.optim.Optimizer¶
Imitation¶

class
tianshou.policy.
ImitationPolicy
(model: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, **kwargs: Any)[source]¶ Bases:
tianshou.policy.base.BasePolicy
Implementation of vanilla imitation learning.
 Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s > a)optim (torch.optim.Optimizer) – for optimizing the model.
action_space (gym.Space) – env’s action space.
See also
Please refer to
BasePolicy
for more detailed explanation.
forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data.
 Returns
A
Batch
which MUST have the following keys:act
an numpy.ndarray or a torch.Tensor, the action over given batch data.state
a dict, an numpy.ndarray or a torch.Tensor, the internal state of the policy,None
as default.
Other keys are userdefined. It depends on the algorithm. For example,
# some code return Batch(logits=..., act=..., state=None, dist=...)
The keyword
policy
is reserved and the corresponding data will be stored into the replay buffer. For instance,# some code return Batch(..., policy=Batch(log_prob=dist.log_prob(act))) # and in the sampled data batch, you can directly use # batch.policy.log_prob to get your data.
Note
In continuous action space, you should do another step “map_action” to get the real action:
act = policy(batch).act # doesn't map to the target action range act = policy.map_action(act, batch)

learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

training
: bool¶

class
tianshou.policy.
BCQPolicy
(actor: torch.nn.modules.module.Module, actor_optim: torch.optim.optimizer.Optimizer, critic1: torch.nn.modules.module.Module, critic1_optim: torch.optim.optimizer.Optimizer, critic2: torch.nn.modules.module.Module, critic2_optim: torch.optim.optimizer.Optimizer, vae: tianshou.utils.net.continuous.VAE, vae_optim: torch.optim.optimizer.Optimizer, device: Union[str, torch.device] = 'cpu', gamma: float = 0.99, tau: float = 0.005, lmbda: float = 0.75, forward_sampled_times: int = 100, num_sampled_action: int = 10, **kwargs: Any)[source]¶ Bases:
tianshou.policy.base.BasePolicy
Implementation of BCQ algorithm. arXiv:1812.02900.
 Parameters
actor (Perturbation) – the actor perturbation. (s, a > perturbed a)
actor_optim (torch.optim.Optimizer) – the optimizer for actor network.
critic1 (torch.nn.Module) – the first critic network. (s, a > Q(s, a))
critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.
critic2 (torch.nn.Module) – the second critic network. (s, a > Q(s, a))
critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.
vae (VAE) – the VAE network, generating actions similar to those in batch. (s, a > generated a)
vae_optim (torch.optim.Optimizer) – the optimizer for the VAE network.
torch.device] device (Union[str,) – which device to create this model on. Default to “cpu”.
gamma (float) – discount factor, in [0, 1]. Default to 0.99.
tau (float) – param for soft update of the target network. Default to 0.005.
lmbda (float) – param for Clipped Double Qlearning. Default to 0.75.
forward_sampled_times (int) – the number of sampled actions in forward function. The policy samples many actions and takes the action with the max value. Default to 100.
num_sampled_action (int) – the number of sampled actions in calculating target Q. The algorithm samples several actions using VAE, and perturbs each action to get the target Q. Default to 10.
See also
Please refer to
BasePolicy
for more detailed explanation.
train
(mode: bool = True) → tianshou.policy.imitation.bcq.BCQPolicy[source]¶ Set the module in training mode, except for the target network.

forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data.

learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

training
: bool¶

class
tianshou.policy.
CQLPolicy
(actor: tianshou.utils.net.continuous.ActorProb, actor_optim: torch.optim.optimizer.Optimizer, critic1: torch.nn.modules.module.Module, critic1_optim: torch.optim.optimizer.Optimizer, critic2: torch.nn.modules.module.Module, critic2_optim: torch.optim.optimizer.Optimizer, cql_alpha_lr: float = 0.0001, cql_weight: float = 1.0, tau: float = 0.005, gamma: float = 0.99, alpha: Union[float, Tuple[float, torch.Tensor, torch.optim.optimizer.Optimizer]] = 0.2, temperature: float = 1.0, with_lagrange: bool = True, lagrange_threshold: float = 10.0, min_action: float =  1.0, max_action: float = 1.0, num_repeat_actions: int = 10, alpha_min: float = 0.0, alpha_max: float = 1000000.0, clip_grad: float = 1.0, device: Union[str, torch.device] = 'cpu', **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.sac.SACPolicy
Implementation of CQL algorithm. arXiv:2006.04779.
 Parameters
actor (ActorProb) – the actor network following the rules in
BasePolicy
. (s > a)actor_optim (torch.optim.Optimizer) – the optimizer for actor network.
critic1 (torch.nn.Module) – the first critic network. (s, a > Q(s, a))
critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.
critic2 (torch.nn.Module) – the second critic network. (s, a > Q(s, a))
critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.
cql_alpha_lr (float) – the learning rate of cql_log_alpha. Default to 1e4.
cql_weight (float) – the value of alpha. Default to 1.0.
tau (float) – param for soft update of the target network. Default to 0.005.
gamma (float) – discount factor, in [0, 1]. Default to 0.99.
torch.Tensor, torch.optim.Optimizer) or float alpha ((float,) – entropy regularization coefficient. Default to 0.2. If a tuple (target_entropy, log_alpha, alpha_optim) is provided, then alpha is automatically tuned.
temperature (float) – the value of temperature. Default to 1.0.
with_lagrange (bool) – whether to use Lagrange. Default to True.
lagrange_threshold (float) – the value of tau in CQL(Lagrange). Default to 10.0.
min_action (float) – The minimum value of each dimension of action. Default to 1.0.
max_action (float) – The maximum value of each dimension of action. Default to 1.0.
num_repeat_actions (int) – The number of times the action is repeated when calculating logsumexp. Default to 10.
alpha_min (float) – lower bound for clipping cql_alpha. Default to 0.0.
alpha_max (float) – upper bound for clipping cql_alpha. Default to 1e6.
clip_grad (float) – clip_grad for updating critic network. Default to 1.0.
torch.device] device (Union[str,) – which device to create this model on. Default to “cpu”.
See also
Please refer to
BasePolicy
for more detailed explanation.
train
(mode: bool = True) → tianshou.policy.imitation.cql.CQLPolicy[source]¶ Set the module in training mode, except for the target network.

calc_pi_values
(obs_pi: torch.Tensor, obs_to_pred: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

calc_random_values
(obs: torch.Tensor, act: torch.Tensor) → Tuple[torch.Tensor, torch.Tensor][source]¶

process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indices: numpy.ndarray) → tianshou.data.batch.Batch[source]¶ Preprocess the data from the provided replay buffer.
Used in
update()
. Check out policy.process_fn for more information.

learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

training
: bool¶

actor
: torch.nn.Module¶

actor_optim
: torch.optim.Optimizer¶

critic
: torch.nn.Module¶

critic_optim
: torch.optim.Optimizer¶

class
tianshou.policy.
DiscreteBCQPolicy
(model: torch.nn.modules.module.Module, imitator: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, discount_factor: float = 0.99, estimation_step: int = 1, target_update_freq: int = 8000, eval_eps: float = 0.001, unlikely_action_threshold: float = 0.3, imitation_logits_penalty: float = 0.01, reward_normalization: bool = False, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.dqn.DQNPolicy
Implementation of discrete BCQ algorithm. arXiv:1910.01708.
 Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s > q_value)imitator (torch.nn.Module) – a model following the rules in
BasePolicy
. (s > imitation_logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency.
eval_eps (float) – the epsilongreedy noise added in evaluation.
unlikely_action_threshold (float) – the threshold (tau) for unlikely actions, as shown in Equ. (17) in the paper. Default to 0.3.
imitation_logits_penalty (float) – regularization weight for imitation logits. Default to 1e2.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
See also
Please refer to
BasePolicy
for more detailed explanation.
train
(mode: bool = True) → tianshou.policy.imitation.discrete_bcq.DiscreteBCQPolicy[source]¶ Set the module in training mode, except for the target network.

forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, input: str = 'obs', **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data.
If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:
batch == Batch( obs=Batch( obs="original obs, with batch_size=1 for demonstration", mask=np.array([[False, True, False]]), # action 1 is available # action 0 and 2 are unavailable ), ... )
 Parameters
eps (float) – in [0, 1], for epsilongreedy exploration method.
 Returns
A
Batch
which has 3 keys:act
the action.logits
the network’s raw output.state
the hidden state.
See also
Please refer to
forward()
for more detailed explanation.

learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

training
: bool¶

class
tianshou.policy.
DiscreteCQLPolicy
(model: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, discount_factor: float = 0.99, num_quantiles: int = 200, estimation_step: int = 1, target_update_freq: int = 0, reward_normalization: bool = False, min_q_weight: float = 10.0, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.qrdqn.QRDQNPolicy
Implementation of discrete Conservative QLearning algorithm. arXiv:2006.04779.
 Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy
. (s > logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
num_quantiles (int) – the number of quantile midpoints in the inverse cumulative distribution function of the value. Default to 200.
estimation_step (int) – the number of steps to look ahead. Default to 1.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network).
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
min_q_weight (float) – the weight for the cql loss.
See also
Please refer to
QRDQNPolicy
for more detailed explanation.
learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

training
: bool¶

class
tianshou.policy.
DiscreteCRRPolicy
(actor: torch.nn.modules.module.Module, critic: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, discount_factor: float = 0.99, policy_improvement_mode: str = 'exp', ratio_upper_bound: float = 20.0, beta: float = 1.0, min_q_weight: float = 10.0, target_update_freq: int = 0, reward_normalization: bool = False, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.pg.PGPolicy
Implementation of discrete Critic Regularized Regression. arXiv:2006.15134.
 Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s > logits)critic (torch.nn.Module) – the actionvalue critic (i.e., Q function) network. (s > Q(s, *))
optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1]. Default to 0.99.
policy_improvement_mode (str) – type of the weight function f. Possible values: “binary”/”exp”/”all”. Default to “exp”.
ratio_upper_bound (float) – when policy_improvement_mode is “exp”, the value of the exp function is upperbounded by this parameter. Default to 20.
beta (float) – when policy_improvement_mode is “exp”, this is the denominator of the exp function. Default to 1.
min_q_weight (float) – weight for CQL loss/regularizer. Default to 10.
target_update_freq (int) – the target network update frequency (0 if you do not use the target network). Default to 0.
reward_normalization (bool) – normalize the reward to Normal(0, 1). Default to False.
See also
Please refer to
PGPolicy
for more detailed explanation.
training
: bool¶

learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

class
tianshou.policy.
GAILPolicy
(actor: torch.nn.modules.module.Module, critic: torch.nn.modules.module.Module, optim: torch.optim.optimizer.Optimizer, dist_fn: Type[torch.distributions.distribution.Distribution], expert_buffer: tianshou.data.buffer.base.ReplayBuffer, disc_net: torch.nn.modules.module.Module, disc_optim: torch.optim.optimizer.Optimizer, disc_update_num: int = 4, eps_clip: float = 0.2, dual_clip: Optional[float] = None, value_clip: bool = False, advantage_normalization: bool = True, recompute_advantage: bool = False, **kwargs: Any)[source]¶ Bases:
tianshou.policy.modelfree.ppo.PPOPolicy
Implementation of Generative Adversarial Imitation Learning. arXiv:1606.03476.
 Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy
. (s > logits)critic (torch.nn.Module) – the critic network. (s > V(s))
optim (torch.optim.Optimizer) – the optimizer for actor and critic network.
dist_fn (Type[torch.distributions.Distribution]) – distribution class for computing the action.
expert_buffer (ReplayBuffer) – the replay buffer contains expert experience.
disc_net (torch.nn.Module) – the discriminator network with input dim equals state dim plus action dim and output dim equals 1.
disc_optim (torch.optim.Optimizer) – the optimizer for the discriminator network.
disc_update_num (int) – the number of discriminator grad steps per model grad step. Default to 4.
discount_factor (float) – in [0, 1]. Default to 0.99.
eps_clip (float) – \(\epsilon\) in \(L_{CLIP}\) in the original paper. Default to 0.2.
dual_clip (float) – a parameter c mentioned in arXiv:1912.09729 Equ. 5, where c > 1 is a constant indicating the lower bound. Default to 5.0 (set None if you do not want to use it).
value_clip (bool) – a parameter mentioned in arXiv:1811.02553 Sec. 4.1. Default to True.
advantage_normalization (bool) – whether to do per minibatch advantage normalization. Default to True.
recompute_advantage (bool) – whether to recompute advantage every update repeat according to https://arxiv.org/pdf/2006.05990.pdf Sec. 3.5. Default to False.
vf_coef (float) – weight for value loss. Default to 0.5.
ent_coef (float) – weight for entropy loss. Default to 0.01.
max_grad_norm (float) – clipping gradients in back propagation. Default to None.
gae_lambda (float) – in [0, 1], param for Generalized Advantage Estimation. Default to 0.95.
reward_normalization (bool) – normalize estimated values to have std close to 1, also normalize the advantage to Normal(0, 1). Default to False.
max_batchsize (int) – the maximum size of the batch when computing GAE, depends on the size of available memory and the memory cost of the model; should be as large as possible within the memory constraint. Default to 256.
action_scaling (bool) – whether to map actions from range [1, 1] to range [action_spaces.low, action_spaces.high]. Default to True.
action_bound_method (str) – method to bound action to range [1, 1], can be either “clip” (for simply clipping the action), “tanh” (for applying tanh squashing) for now, or empty string for no bounding. Default to “clip”.
action_space (Optional[gym.Space]) – env’s action space, mandatory if you want to use option “action_scaling” or “action_bound_method”. Default to None.
lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update(). Default to None (no lr_scheduler).
deterministic_eval (bool) – whether to use deterministic action instead of stochastic action sampled by the policy. Default to False.
See also
Please refer to
PPOPolicy
for more detailed explanation.
process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indices: numpy.ndarray) → tianshou.data.batch.Batch[source]¶ Preprocess the data from the provided replay buffer.
Used in
update()
. Check out policy.process_fn for more information.

learn
(batch: tianshou.data.batch.Batch, batch_size: int, repeat: int, **kwargs: Any) → Dict[str, List[float]][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

training
: bool¶
Modelbased¶

class
tianshou.policy.
PSRLPolicy
(trans_count_prior: numpy.ndarray, rew_mean_prior: numpy.ndarray, rew_std_prior: numpy.ndarray, discount_factor: float = 0.99, epsilon: float = 0.01, add_done_loop: bool = False, **kwargs: Any)[source]¶ Bases:
tianshou.policy.base.BasePolicy
Implementation of Posterior Sampling Reinforcement Learning.
Reference: Strens M. A Bayesian framework for reinforcement learning [C] //ICML. 2000, 2000: 943950.
 Parameters
trans_count_prior (np.ndarray) – dirichlet prior (alphas), with shape (n_state, n_action, n_state).
rew_mean_prior (np.ndarray) – means of the normal priors of rewards, with shape (n_state, n_action).
rew_std_prior (np.ndarray) – standard deviations of the normal priors of rewards, with shape (n_state, n_action).
discount_factor (float) – in [0, 1].
epsilon (float) – for precision control in value iteration.
add_done_loop (bool) – whether to add an extra selfloop for the terminal state in MDP. Default to False.
See also
Please refer to
BasePolicy
for more detailed explanation.
forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data with PSRL model.
 Returns
A
Batch
with “act” key containing the action.
See also
Please refer to
forward()
for more detailed explanation.

learn
(batch: tianshou.data.batch.Batch, *args: Any, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

training
: bool¶

class
tianshou.policy.
ICMPolicy
(policy: tianshou.policy.base.BasePolicy, model: tianshou.utils.net.discrete.IntrinsicCuriosityModule, optim: torch.optim.optimizer.Optimizer, lr_scale: float, reward_scale: float, forward_loss_weight: float, **kwargs: Any)[source]¶ Bases:
tianshou.policy.base.BasePolicy
Implementation of Intrinsic Curiosity Module. arXiv:1705.05363.
 Parameters
policy (BasePolicy) – a base policy to add ICM to.
model (IntrinsicCuriosityModule) – the ICM model.
optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
lr_scale (float) – the scaling factor for ICM learning.
forward_loss_weight (float) – the weight for forward model loss.
See also
Please refer to
BasePolicy
for more detailed explanation.
train
(mode: bool = True) → tianshou.policy.modelbased.icm.ICMPolicy[source]¶ Set the module in training mode.

forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch, numpy.ndarray]] = None, **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Compute action over the given batch data by inner policy.
See also
Please refer to
forward()
for more detailed explanation.

exploration_noise
(act: Union[numpy.ndarray, tianshou.data.batch.Batch], batch: tianshou.data.batch.Batch) → Union[numpy.ndarray, tianshou.data.batch.Batch][source]¶ Modify the action from policy.forward with exploration noise.
 Parameters
act – a data batch or numpy.ndarray which is the action taken by policy.forward.
batch – the input batch for policy.forward, kept for advanced usage.
 Returns
action in the same form of input “act” but with added exploration noise.

process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indices: numpy.ndarray) → tianshou.data.batch.Batch[source]¶ Preprocess the data from the provided replay buffer.
Used in
update()
. Check out policy.process_fn for more information.

post_process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indices: numpy.ndarray) → None[source]¶ Postprocess the data from the provided replay buffer.
Typical usage is to update the sampling weight in prioritized experience replay. Used in
update()
.

learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, float][source]¶ Update policy with a given batch of data.
 Returns
A dict, including the data needed to be logged (e.g., loss).
Note
In order to distinguish the collecting state, updating state and testing state, you can check the policy state by
self.training
andself.updating
. Please refer to States for policy for more detailed explanation.Warning
If you use
torch.distributions.Normal
andtorch.distributions.Categorical
to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The autobroadcasting of numerical operation with torch tensors will amplify this error.

training
: bool¶
Multiagent¶

class
tianshou.policy.
MultiAgentPolicyManager
(policies: List[tianshou.policy.base.BasePolicy], env: None, **kwargs: Any)[source]¶ Bases:
tianshou.policy.base.BasePolicy
Multiagent policy manager for MARL.
This multiagent policy manager accepts a list of
BasePolicy
. It dispatches the batch data to each of these policies when the “forward” is called. The same as “process_fn” and “learn”: it splits the data and feeds them to each policy. A figure in MultiAgent Reinforcement Learning can help you better understand this procedure.
replace_policy
(policy: tianshou.policy.base.BasePolicy, agent_id: int) → None[source]¶ Replace the “agent_id”th policy in this manager.

process_fn
(batch: tianshou.data.batch.Batch, buffer: tianshou.data.buffer.base.ReplayBuffer, indice: numpy.ndarray) → tianshou.data.batch.Batch[source]¶ Dispatch batch data from obs.agent_id to every policy’s process_fn.
Save original multidimensional rew in “save_rew”, set rew to the reward of each agent during their “process_fn”, and restore the original reward afterwards.

exploration_noise
(act: Union[numpy.ndarray, tianshou.data.batch.Batch], batch: tianshou.data.batch.Batch) → Union[numpy.ndarray, tianshou.data.batch.Batch][source]¶ Add exploration noise from subpolicy onto act.

forward
(batch: tianshou.data.batch.Batch, state: Optional[Union[dict, tianshou.data.batch.Batch]] = None, **kwargs: Any) → tianshou.data.batch.Batch[source]¶ Dispatch batch data from obs.agent_id to every policy’s forward.
 Parameters
state – if None, it means all agents have no state. If not None, it should contain keys of “agent_1”, “agent_2”, …
 Returns
a Batch with the following contents:
{ "act": actions corresponding to the input "state": { "agent_1": output state of agent_1's policy for the state "agent_2": xxx ... "agent_n": xxx} "out": { "agent_1": output of agent_1's policy for the input "agent_2": xxx ... "agent_n": xxx} }

learn
(batch: tianshou.data.batch.Batch, **kwargs: Any) → Dict[str, Union[float, List[float]]][source]¶ Dispatch the data to all policies for learning.
 Returns
a dict with the following contents:
{ "agent_1/item1": item 1 of agent_1's policy.learn output "agent_1/item2": item 2 of agent_1's policy.learn output "agent_2/xxx": xxx ... "agent_n/xxx": xxx }

training
: bool¶
