td3#


class TD3TrainingStats(*, train_time: float = 0.0, smoothed_loss: dict = <factory>, actor_loss: float, critic1_loss: float, critic2_loss: float)[source]#

Bases: TrainingStats

actor_loss: float#
critic1_loss: float#
critic2_loss: float#
class ActorDualCriticsOffPolicyAlgorithm(*, policy: Any, policy_optim: OptimizerFactory, critic: Module, critic_optim: OptimizerFactory, critic2: Module | None = None, critic2_optim: OptimizerFactory | None = None, tau: float = 0.005, gamma: float = 0.99, n_step_return_horizon: int = 1)[source]#

Bases: ActorCriticOffPolicyAlgorithm[TPolicy, TActBatchProtocol], ABC

A base class for off-policy algorithms with two critics, where the target Q-value is computed as the minimum of the two lagged critics’ values.

Parameters:
  • policy – the policy

  • policy_optim – the optimizer factory for the policy’s model.

  • critic – the first critic network. For continuous action spaces: (s, a -> Q(s, a)). NOTE: The default implementation of _target_q_compute_value assumes a continuous action space; override this method if using discrete actions.

  • critic_optim – the optimizer factory for the first critic network.

  • critic2 – the second critic network (analogous functionality to the first). If None, copy the first critic (via deepcopy).

  • critic2_optim – the optimizer factory for the second critic network. If None, use the first critic’s factory.

  • tau – the soft update coefficient for target networks, controlling the rate at which target networks track the learned networks. When the parameters of the target network are updated with the current (source) network’s parameters, a weighted average is used: target = tau * source + (1 - tau) * target. Smaller values (closer to 0) create more stable but slower learning as target networks change more gradually. Higher values (closer to 1) allow faster learning but may reduce stability. Typically set to a small value (0.001 to 0.01) for most reinforcement learning tasks.

  • gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks

class TD3(*, policy: ContinuousDeterministicPolicy, policy_optim: OptimizerFactory, critic: Module, critic_optim: OptimizerFactory, critic2: Module | None = None, critic2_optim: OptimizerFactory | None = None, tau: float = 0.005, gamma: float = 0.99, policy_noise: float = 0.2, update_actor_freq: int = 2, noise_clip: float = 0.5, n_step_return_horizon: int = 1)[source]#

Bases: ActorDualCriticsOffPolicyAlgorithm[ContinuousDeterministicPolicy, ActStateBatchProtocol]

Implementation of TD3, arXiv:1802.09477.

Parameters:
  • policy – the policy

  • policy_optim – the optimizer factory for the policy’s model.

  • critic – the first critic network. (s, a -> Q(s, a))

  • critic_optim – the optimizer factory for the first critic network.

  • critic2 – the second critic network. (s, a -> Q(s, a)). If None, copy the first critic (via deepcopy).

  • critic2_optim – the optimizer factory for the second critic network. If None, use the first critic’s factory.

  • tau – the soft update coefficient for target networks, controlling the rate at which target networks track the learned networks. When the parameters of the target network are updated with the current (source) network’s parameters, a weighted average is used: target = tau * source + (1 - tau) * target. Smaller values (closer to 0) create more stable but slower learning as target networks change more gradually. Higher values (closer to 1) allow faster learning but may reduce stability. Typically set to a small value (0.001 to 0.01) for most reinforcement learning tasks.

  • gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks

  • policy_noise – scaling factor for the Gaussian noise added to target policy actions. This parameter implements target policy smoothing, a regularization technique described in the TD3 paper. The noise is sampled from a normal distribution and multiplied by this value before being added to actions. Higher values increase exploration in the target policy, helping to address function approximation error. The added noise is optionally clipped to a range determined by the noise_clip parameter. Typically set between 0.1 and 0.5 relative to the action scale of the environment.

  • update_actor_freq – the frequency of actor network updates relative to critic network updates (the actor network is only updated once for every update_actor_freq critic updates). This implements the “delayed” policy updates from the TD3 algorithm, where the actor is updated less frequently than the critics. Higher values (e.g., 2-5) help stabilize training by allowing the critic to become more accurate before updating the policy. The default value of 2 follows the original TD3 paper’s recommendation of updating the policy at half the rate of the Q-functions.

  • noise_clip – defines the maximum absolute value of the noise added to target policy actions, i.e. noise values are clipped to the range [-noise_clip, noise_clip] (after generating and scaling the noise via policy_noise). This parameter implements bounded target policy smoothing as described in the TD3 paper. It prevents extreme noise values from causing unrealistic target values during training. Setting it 0.0 (or a negative value) disables clipping entirely. It is typically set to about twice the policy_noise value (e.g. 0.5 when policy_noise is 0.2).