class SACPolicy(*, actor: Module | ActorProb, actor_optim: Optimizer, critic: Module, critic_optim: Optimizer, action_space: Space, critic2: Module | None = None, critic2_optim: Optimizer | None = None, tau: float = 0.005, gamma: float = 0.99, alpha: float | tuple[float, Tensor, Optimizer] = 0.2, estimation_step: int = 1, exploration_noise: BaseNoise | Literal['default'] | None = None, deterministic_eval: bool = True, action_scaling: bool = True, action_bound_method: Literal['clip'] | None = 'clip', observation_space: Space | None = None, lr_scheduler: LRScheduler | MultipleLRSchedulers | None = None)[source]#

Implementation of Soft Actor-Critic. arXiv:1812.05905.

  • actor – the actor network following the rules (s -> dist_input_BD)

  • actor_optim – the optimizer for actor network.

  • critic – the first critic network. (s, a -> Q(s, a))

  • critic_optim – the optimizer for the first critic network.

  • action_space – Env’s action space. Should be gym.spaces.Box.

  • critic2 – the second critic network. (s, a -> Q(s, a)). If None, use the same network as critic (via deepcopy).

  • critic2_optim – the optimizer for the second critic network. If None, clone critic_optim to use for critic2.parameters().

  • tau – param for soft update of the target network.

  • gamma – discount factor, in [0, 1].

  • alpha – entropy regularization coefficient. If a tuple (target_entropy, log_alpha, alpha_optim) is provided, then alpha is automatically tuned.

  • estimation_step – The number of steps to look ahead.

  • exploration_noise – add noise to action for exploration. This is useful when solving “hard exploration” problems. “default” is equivalent to GaussianNoise(sigma=0.1).

  • deterministic_eval – whether to use deterministic action (mode of Gaussian policy) in evaluation mode instead of stochastic action sampled by the policy. Does not affect training.

  • action_scaling – whether to map actions from range [-1, 1] to range[action_spaces.low, action_spaces.high].

  • action_bound_method – method to bound action to range [-1, 1], can be either “clip” (for simply clipping the action) or empty string for no bounding. Only used if the action_space is continuous.

  • observation_space – Env’s observation space.

  • lr_scheduler – a learning rate scheduler that adjusts the learning rate in optimizer in each policy.update()

See also

Please refer to BasePolicy for more detailed explanation.

forward(batch: ObsBatchProtocol, state: dict | Batch | ndarray | None = None, **kwargs: Any) DistLogProbBatchProtocol[source]#

Compute action over the given batch data.


A Batch which has 2 keys:

  • act the action.

  • state the hidden state.

See also

Please refer to forward() for more detailed explanation.

property is_auto_alpha: bool#
learn(batch: RolloutBatchProtocol, *args: Any, **kwargs: Any) TSACTrainingStats[source]#

Update policy with a given batch of data.


A dataclass object, including the data needed to be logged (e.g., loss).


In order to distinguish the collecting state, updating state and testing state, you can check the policy state by and self.updating. Please refer to States for policy for more detailed explanation.


If you use torch.distributions.Normal and torch.distributions.Categorical to calculate the log_prob, please be careful about the shape: Categorical distribution gives “[batch_size]” shape while Normal distribution gives “[batch_size, 1]” shape. The auto-broadcasting of numerical operation with torch tensors will amplify this error.

sync_weight() None[source]#

Soft-update the weight for the target network.

train(mode: bool = True) Self[source]#

Set the module in training mode, except for the target network.

class SACTrainingStats(*, train_time: float = 0.0, smoothed_loss: dict = <factory>, actor_loss: float, critic1_loss: float, critic2_loss: float, alpha: float | None = None, alpha_loss: float | None = None)[source]#
actor_loss: float#
alpha: float | None = None#
alpha_loss: float | None = None#
critic1_loss: float#
critic2_loss: float#