tianshou.policy¶
-
class
tianshou.policy.BasePolicy(**kwargs)[source]¶ Bases:
abc.ABC,torch.nn.modules.module.ModuleTianshou aims to modularizing RL algorithms. It comes into several classes of policies in Tianshou. All of the policy classes must inherit
BasePolicy.A policy class typically has four parts:
__init__(): initialize the policy, including coping the target network and so on;__call__(): compute action with given observation;process_fn(): pre-process data from the replay buffer (this function can interact with replay buffer);learn(): update policy with a given batch of data.
Most of the policy needs a neural network to predict the action and an optimizer to optimize the policy. The rules of self-defined networks are:
Input: observation
obs(may be anumpy.ndarrayortorch.Tensor), hidden statestate(for RNN usage), and other informationinfoprovided by the environment.Output: some
logitsand the next hidden statestate. The logits could be a tuple instead of atorch.Tensor. It depends on how the policy process the network output. For example, in PPO, the return of the network might be(mu, sigma), statefor Gaussian policy.
Since
BasePolicyinheritstorch.nn.Module, you can operateBasePolicyalmost the same astorch.nn.Module, for instance, load and save the model:torch.save(policy.state_dict(), 'policy.pth') policy.load_state_dict(torch.load('policy.pth'))
-
abstract
__call__(batch, state=None, **kwargs)[source]¶ Compute action over the given batch data.
- Returns
A
Batchwhich MUST have the following keys:actan numpy.ndarray or a torch.Tensor, the action over given batch data.statea dict, an numpy.ndarray or a torch.Tensor, the internal state of the policy,Noneas default.
Other keys are user-defined. It depends on the algorithm. For example,
# some code return Batch(logits=..., act=..., state=None, dist=...)
-
class
tianshou.policy.DQNPolicy(model, optim, discount_factor=0.99, estimation_step=1, target_update_freq=0, **kwargs)[source]¶ Bases:
tianshou.policy.base.BasePolicyImplementation of Deep Q Network. arXiv:1312.5602
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy. (s -> logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
discount_factor (float) – in [0, 1].
estimation_step (int) – greater than 1, the number of steps to look ahead.
target_update_freq (int) – the target network update frequency (
0if you do not use the target network).
-
__call__(batch, state=None, model='model', input='obs', eps=None, **kwargs)[source]¶ Compute action over the given batch data.
- Parameters
eps (float) – in [0, 1], for epsilon-greedy exploration method.
- Returns
A
Batchwhich has 3 keys:actthe action.logitsthe network’s raw output.statethe hidden state.
More information can be found at
__call__().
-
learn(batch, **kwargs)[source]¶ Update policy with a given batch of data.
- Returns
A dict which includes loss and its corresponding label.
-
class
tianshou.policy.PGPolicy(model, optim, dist_fn=<class 'torch.distributions.categorical.Categorical'>, discount_factor=0.99, **kwargs)[source]¶ Bases:
tianshou.policy.base.BasePolicyImplementation of Vanilla Policy Gradient.
- Parameters
model (torch.nn.Module) – a model following the rules in
BasePolicy. (s -> logits)optim (torch.optim.Optimizer) – a torch.optim for optimizing the model.
dist_fn (torch.distributions.Distribution) – for computing the action.
discount_factor (float) – in [0, 1].
-
__call__(batch, state=None, **kwargs)[source]¶ Compute action over the given batch data.
- Returns
A
Batchwhich has 4 keys:actthe action.logitsthe network’s raw output.distthe action distribution.statethe hidden state.
More information can be found at
__call__().
-
class
tianshou.policy.A2CPolicy(actor, critic, optim, dist_fn=<class 'torch.distributions.categorical.Categorical'>, discount_factor=0.99, vf_coef=0.5, ent_coef=0.01, max_grad_norm=None, **kwargs)[source]¶ Bases:
tianshou.policy.pg.PGPolicyImplementation of Synchronous Advantage Actor-Critic. arXiv:1602.01783
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy. (s -> logits)critic (torch.nn.Module) – the critic network. (s -> V(s))
optim (torch.optim.Optimizer) – the optimizer for actor and critic network.
dist_fn (torch.distributions.Distribution) – for computing the action, defaults to
torch.distributions.Categorical.discount_factor (float) – in [0, 1], defaults to 0.99.
vf_coef (float) – weight for value loss, defaults to 0.5.
ent_coef (float) – weight for entropy loss, defaults to 0.01.
max_grad_norm (float) – clipping gradients in back propagation, defaults to
None.
-
__call__(batch, state=None, **kwargs)[source]¶ Compute action over the given batch data.
- Returns
A
Batchwhich has 4 keys:actthe action.logitsthe network’s raw output.distthe action distribution.statethe hidden state.
More information can be found at
__call__().
-
class
tianshou.policy.DDPGPolicy(actor, actor_optim, critic, critic_optim, tau=0.005, gamma=0.99, exploration_noise=0.1, action_range=None, reward_normalization=False, ignore_done=False, **kwargs)[source]¶ Bases:
tianshou.policy.base.BasePolicyImplementation of Deep Deterministic Policy Gradient. arXiv:1509.02971
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy. (s -> logits)actor_optim (torch.optim.Optimizer) – the optimizer for actor network.
critic (torch.nn.Module) – the critic network. (s, a -> Q(s, a))
critic_optim (torch.optim.Optimizer) – the optimizer for critic network.
tau (float) – param for soft update of the target network, defaults to 0.005.
gamma (float) – discount factor, in [0, 1], defaults to 0.99.
exploration_noise (float) – the noise intensity, add to the action, defaults to 0.1.
action_range ([float, float]) – the action range (minimum, maximum).
reward_normalization (bool) – normalize the reward to Normal(0, 1), defaults to
False.ignore_done (bool) – ignore the done flag while training the policy, defaults to
False.
-
__call__(batch, state=None, model='actor', input='obs', eps=None, **kwargs)[source]¶ Compute action over the given batch data.
- Parameters
eps (float) – in [0, 1], for exploration use.
- Returns
A
Batchwhich has 2 keys:actthe action.statethe hidden state.
More information can be found at
__call__().
-
learn(batch, **kwargs)[source]¶ Update policy with a given batch of data.
- Returns
A dict which includes loss and its corresponding label.
-
class
tianshou.policy.PPOPolicy(actor, critic, optim, dist_fn, discount_factor=0.99, max_grad_norm=0.5, eps_clip=0.2, vf_coef=0.5, ent_coef=0.0, action_range=None, **kwargs)[source]¶ Bases:
tianshou.policy.pg.PGPolicyImplementation of Proximal Policy Optimization. arXiv:1707.06347
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy. (s -> logits)critic (torch.nn.Module) – the critic network. (s -> V(s))
optim (torch.optim.Optimizer) – the optimizer for actor and critic network.
dist_fn (torch.distributions.Distribution) – for computing the action.
discount_factor (float) – in [0, 1], defaults to 0.99.
max_grad_norm (float) – clipping gradients in back propagation, defaults to
None.eps_clip (float) –
in
in the original
paper, defaults to 0.2.vf_coef (float) – weight for value loss, defaults to 0.5.
ent_coef (float) – weight for entropy loss, defaults to 0.01.
action_range ([float, float]) – the action range (minimum, maximum).
-
__call__(batch, state=None, model='actor', **kwargs)[source]¶ Compute action over the given batch data.
- Returns
A
Batchwhich has 4 keys:actthe action.logitsthe network’s raw output.distthe action distribution.statethe hidden state.
More information can be found at
__call__().
-
class
tianshou.policy.TD3Policy(actor, actor_optim, critic1, critic1_optim, critic2, critic2_optim, tau=0.005, gamma=0.99, exploration_noise=0.1, policy_noise=0.2, update_actor_freq=2, noise_clip=0.5, action_range=None, reward_normalization=False, ignore_done=False, **kwargs)[source]¶ Bases:
tianshou.policy.ddpg.DDPGPolicyImplementation of Twin Delayed Deep Deterministic Policy Gradient, arXiv:1802.09477
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy. (s -> logits)actor_optim (torch.optim.Optimizer) – the optimizer for actor network.
critic1 (torch.nn.Module) – the first critic network. (s, a -> Q(s, a))
critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.
critic2 (torch.nn.Module) – the second critic network. (s, a -> Q(s, a))
critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.
tau (float) – param for soft update of the target network, defaults to 0.005.
gamma (float) – discount factor, in [0, 1], defaults to 0.99.
exploration_noise (float) – the noise intensity, add to the action, defaults to 0.1.
policy_noise (float) – the noise used in updating policy network, default to 0.2.
update_actor_freq (int) – the update frequency of actor network, default to 2.
noise_clip (float) – the clipping range used in updating policy network, default to 0.5.
action_range ([float, float]) – the action range (minimum, maximum).
reward_normalization (bool) – normalize the reward to Normal(0, 1), defaults to
False.ignore_done (bool) – ignore the done flag while training the policy, defaults to
False.
-
class
tianshou.policy.SACPolicy(actor, actor_optim, critic1, critic1_optim, critic2, critic2_optim, tau=0.005, gamma=0.99, alpha=0.2, action_range=None, reward_normalization=False, ignore_done=False, **kwargs)[source]¶ Bases:
tianshou.policy.ddpg.DDPGPolicyImplementation of Soft Actor-Critic. arXiv:1812.05905
- Parameters
actor (torch.nn.Module) – the actor network following the rules in
BasePolicy. (s -> logits)actor_optim (torch.optim.Optimizer) – the optimizer for actor network.
critic1 (torch.nn.Module) – the first critic network. (s, a -> Q(s, a))
critic1_optim (torch.optim.Optimizer) – the optimizer for the first critic network.
critic2 (torch.nn.Module) – the second critic network. (s, a -> Q(s, a))
critic2_optim (torch.optim.Optimizer) – the optimizer for the second critic network.
tau (float) – param for soft update of the target network, defaults to 0.005.
gamma (float) – discount factor, in [0, 1], defaults to 0.99.
exploration_noise (float) – the noise intensity, add to the action, defaults to 0.1.
alpha (float) – entropy regularization coefficient, default to 0.2.
action_range ([float, float]) – the action range (minimum, maximum).
reward_normalization (bool) – normalize the reward to Normal(0, 1), defaults to
False.ignore_done (bool) – ignore the done flag while training the policy, defaults to
False.
-
__call__(batch, state=None, input='obs', **kwargs)[source]¶ Compute action over the given batch data.
- Parameters
eps (float) – in [0, 1], for exploration use.
- Returns
A
Batchwhich has 2 keys:actthe action.statethe hidden state.
More information can be found at
__call__().

is the discount factor,
,
is the done flag of step
. If there is no target network, the
is equal
to
.
is the terminal time step,