sac#
Source code: tianshou/algorithm/modelfree/sac.py
- correct_log_prob_gaussian_tanh(log_prob: Tensor, tanh_squashed_action: Tensor, eps: float = 1.1920928955078125e-07) Tensor[source]#
Apply correction for Tanh squashing when computing log_prob from Gaussian.
See equation 21 in the original SAC paper.
- Parameters:
log_prob – log probability of the action
tanh_squashed_action – action squashed to values in (-1, 1) range by tanh
eps – epsilon for numerical stability
- class SACTrainingStats(*, train_time: float = 0.0, smoothed_loss: dict = <factory>, actor_loss: float, critic1_loss: float, critic2_loss: float, alpha: float | None = None, alpha_loss: float | None = None)[source]#
Bases:
TrainingStats- actor_loss: float#
- critic1_loss: float#
- critic2_loss: float#
- alpha: float | None = None#
- alpha_loss: float | None = None#
- class SACPolicy(*, actor: Module | ContinuousActorProbabilistic, exploration_noise: BaseNoise | Literal['default'] | None = None, deterministic_eval: bool = True, action_scaling: bool = True, action_space: Space, observation_space: Space | None = None)[source]#
Bases:
ContinuousPolicyWithExplorationNoise- Parameters:
actor – the actor network following the rules (s -> dist_input_BD)
exploration_noise – add noise to action for exploration. This is useful when solving “hard exploration” problems. “default” is equivalent to GaussianNoise(sigma=0.1).
deterministic_eval – flag indicating whether the policy should use deterministic actions (using the mode of the action distribution) instead of stochastic ones (using random sampling) during evaluation. When enabled, the policy will always select the most probable action according to the learned distribution during evaluation phases, while still using stochastic sampling during training. This creates a clear distinction between exploration (training) and exploitation (evaluation) behaviors. Deterministic actions are generally preferred for final deployment and reproducible evaluation as they provide consistent behavior, reduce variance in performance metrics, and are more interpretable for human observers. Note that this parameter only affects behavior when the policy is not within a training step. When collecting rollouts for training, actions remain stochastic regardless of this setting to maintain proper exploration behaviour.
action_scaling – flag indicating whether, for continuous action spaces, actions should be scaled from the standard neural network output range [-1, 1] to the environment’s action space range [action_space.low, action_space.high]. This applies to continuous action spaces only (gym.spaces.Box) and has no effect for discrete spaces. When enabled, policy outputs are expected to be in the normalized range [-1, 1] (after bounding), and are then linearly transformed to the actual required range. This improves neural network training stability, allows the same algorithm to work across environments with different action ranges, and standardizes exploration strategies. Should be disabled if the actor model already produces outputs in the correct range.
action_space – the environment’s action_space.
observation_space – the environment’s observation space
- forward(batch: ObsBatchProtocol, state: dict | Batch | ndarray | None = None, **kwargs: Any) DistLogProbBatchProtocol[source]#
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- class Alpha[source]#
Bases:
ABCDefines the interface for the entropy regularization coefficient alpha.
- abstract property value: float#
Retrieves the current value of alpha.
- class FixedAlpha(alpha: float)[source]#
Bases:
AlphaRepresents a fixed entropy regularization coefficient alpha.
- property value: float#
Retrieves the current value of alpha.
- class AutoAlpha(target_entropy: float, log_alpha: float, optim: OptimizerFactory)[source]#
Bases:
Module,AlphaRepresents an entropy regularization coefficient alpha that is automatically tuned.
- Parameters:
target_entropy – the target entropy value. For discrete action spaces, it is usually -log(|A|) for a balance between stochasticity and determinism or -log(1/|A|)=log(|A|) for maximum stochasticity or, more generally, lambda*log(|A|), e.g. with lambda close to 1 (e.g. 0.98) for pronounced stochasticity. For continuous action spaces, it is usually -dim(A) for a balance between stochasticity and determinism, with similar generalizations as for discrete action spaces.
log_alpha – the (initial) value of the log of the entropy regularization coefficient alpha.
optim – the factory with which to create the optimizer for log_alpha.
- property value: float#
Retrieves the current value of alpha.
- class SAC(*, policy: SACPolicy, policy_optim: OptimizerFactory, critic: Module, critic_optim: OptimizerFactory, critic2: Module | None = None, critic2_optim: OptimizerFactory | None = None, tau: float = 0.005, gamma: float = 0.99, alpha: float | Alpha = 0.2, n_step_return_horizon: int = 1, deterministic_eval: bool = True)[source]#
Bases:
ActorDualCriticsOffPolicyAlgorithm[SACPolicy,DistLogProbBatchProtocol],Generic[TSACTrainingStats]Implementation of Soft Actor-Critic. arXiv:1812.05905.
- Parameters:
policy – the policy
policy_optim – the optimizer factory for the policy’s model.
critic – the first critic network. (s, a -> Q(s, a))
critic_optim – the optimizer factory for the first critic network.
critic2 – the second critic network. (s, a -> Q(s, a)). If None, copy the first critic (via deepcopy).
critic2_optim – the optimizer factory for the second critic network. If None, use the first critic’s factory.
tau – the soft update coefficient for target networks, controlling the rate at which target networks track the learned networks. When the parameters of the target network are updated with the current (source) network’s parameters, a weighted average is used: target = tau * source + (1 - tau) * target. Smaller values (closer to 0) create more stable but slower learning as target networks change more gradually. Higher values (closer to 1) allow faster learning but may reduce stability. Typically set to a small value (0.001 to 0.01) for most reinforcement learning tasks.
gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks
alpha – the entropy regularization coefficient, which balances exploration and exploitation. This coefficient controls how much the agent values randomness in its policy versus pursuing higher rewards. Higher values (e.g., 0.5-1.0) strongly encourage exploration by rewarding the agent for maintaining diverse action choices, even if this means selecting some lower-value actions. Lower values (e.g., 0.01-0.1) prioritize exploitation, allowing the policy to become more focused on the highest-value actions. A value of 0 would completely remove entropy regularization, potentially leading to premature convergence to suboptimal deterministic policies. Can be provided as a fixed float (0.2 is a reasonable default) or as an instance of, in particular, class AutoAlpha for automatic tuning during training.
n_step_return_horizon – the number of future steps (> 0) to consider when computing temporal difference (TD) targets. Controls the balance between TD learning and Monte Carlo methods: higher values reduce bias (by relying less on potentially inaccurate value estimates) but increase variance (by incorporating more environmental stochasticity and reducing the averaging effect). A value of 1 corresponds to standard TD learning with immediate bootstrapping, while very large values approach Monte Carlo-like estimation that uses complete episode returns.