ddpg

ddpg#

Source code: tianshou/algorithm/modelfree/ddpg.py

class DDPGTrainingStats(*, train_time: float = 0.0, smoothed_loss: dict = <factory>, actor_loss: float, critic_loss: float)[source]#

Bases: TrainingStats

actor_loss: float#

critic_loss: float#

class ContinuousPolicyWithExplorationNoise(*, exploration_noise: BaseNoise | Literal['default'] | None = None, action_space: Space, observation_space: Space | None = None, action_scaling: bool = True, action_bound_method: Literal['clip'] | None = 'clip')[source]#

Bases: Policy, ABC

Parameters:

exploration_noise – noise model for adding noise to continuous actions for exploration. This is useful when solving “hard exploration” problems. “default” is equivalent to GaussianNoise(sigma=0.1).
action_space – the environment’s action_space.
observation_space – the environment’s observation space
action_scaling – flag indicating whether, for continuous action spaces, actions should be scaled from the standard neural network output range [-1, 1] to the environment’s action space range [action_space.low, action_space.high]. This applies to continuous action spaces only (gym.spaces.Box) and has no effect for discrete spaces. When enabled, policy outputs are expected to be in the normalized range [-1, 1] (after bounding), and are then linearly transformed to the actual required range. This improves neural network training stability, allows the same algorithm to work across environments with different action ranges, and standardizes exploration strategies. Should be disabled if the actor model already produces outputs in the correct range.
action_bound_method – the method used for bounding actions in continuous action spaces to the range [-1, 1] before scaling them to the environment’s action space (provided that action_scaling is enabled). This applies to continuous action spaces only (gym.spaces.Box) and should be set to None for discrete spaces. When set to “clip”, actions exceeding the [-1, 1] range are simply clipped to this range. When set to “tanh”, a hyperbolic tangent function is applied, which smoothly constrains outputs to [-1, 1] while preserving gradients. The choice of bounding method affects both training dynamics and exploration behavior. Clipping provides hard boundaries but may create plateau regions in the gradient landscape, while tanh provides smoother transitions but can compress sensitivity near the boundaries. Should be set to None if the actor model inherently produces bounded outputs. Typically used together with action_scaling=True.

set_exploration_noise(noise: BaseNoise | None) → None[source]#: Set the exploration noise.

add_exploration_noise(act: TArrOrActBatch, batch: ObsBatchProtocol) → TArrOrActBatch[source]#

(Optionally) adds noise to an actions computed by the policy’s forward method for: exploration purposes.

NOTE: The base implementation does not add any noise, but subclasses can override this method to add appropriate mechanisms for adding noise.

Parameters:

act – a data batch or numpy.ndarray containing actions computed by the policy’s forward method.
batch – the corresponding input batch that was passed to forward; provided for advanced usage.

Returns:

actions in the same format as the input act but with added exploration noise (if implemented - otherwise returns act unchanged).

class ContinuousDeterministicPolicy(*, actor: AbstractContinuousActorDeterministic, exploration_noise: BaseNoise | Literal['default'] | None = None, action_space: Space, observation_space: Space | None = None, action_scaling: bool = True, action_bound_method: Literal['clip'] | None = 'clip')[source]#

Bases: ContinuousPolicyWithExplorationNoise

A policy for continuous action spaces that uses an actor which directly maps states to actions.

Parameters:

actor – The actor network following the rules (s -> actions)
exploration_noise – add noise to continuous actions for exploration; set to None for discrete action spaces. This is useful when solving “hard exploration” problems. “default” is equivalent to GaussianNoise(sigma=0.1).
action_space – the environment’s action space.
tau – the soft update coefficient for target networks, controlling the rate at which target networks track the learned networks. When the parameters of the target network are updated with the current (source) network’s parameters, a weighted average is used: target = tau * source + (1 - tau) * target. Smaller values (closer to 0) create more stable but slower learning as target networks change more gradually. Higher values (closer to 1) allow faster learning but may reduce stability. Typically set to a small value (0.001 to 0.01) for most reinforcement learning tasks.
observation_space – the environment’s observation space.
action_scaling – flag indicating whether, for continuous action spaces, actions should be scaled from the standard neural network output range [-1, 1] to the environment’s action space range [action_space.low, action_space.high]. This applies to continuous action spaces only (gym.spaces.Box) and has no effect for discrete spaces. When enabled, policy outputs are expected to be in the normalized range [-1, 1] (after bounding), and are then linearly transformed to the actual required range. This improves neural network training stability, allows the same algorithm to work across environments with different action ranges, and standardizes exploration strategies. Should be disabled if the actor model already produces outputs in the correct range.
action_bound_method – method to bound action to range [-1, 1].

forward(batch: ObsBatchProtocol, state: dict | BatchProtocol | ndarray | None = None, model: Module | None = None, **kwargs: Any) → ActStateBatchProtocol[source]#

Compute action over the given batch data.

Returns:

A Batch which has 2 keys:

act the action.
state the hidden state.

class ActorCriticOffPolicyAlgorithm(*, policy: TPolicy, policy_optim: OptimizerFactory, critic: Module, critic_optim: OptimizerFactory, tau: float = 0.005, gamma: float = 0.99, n_step_return_horizon: int = 1)[source]#

Bases: OffPolicyAlgorithm[TPolicy], LaggedNetworkPolyakUpdateAlgorithmMixin, Generic[TPolicy, TActBatchProtocol], ABC

Base class for actor-critic off-policy algorithms that use a lagged critic as a target network.

Its implementation of process_fn adds the n-step return to the batch, using the Q-values computed by the target network (lagged critic, critic_old) in order to compute the reward-to-go.

Specializations can override the action computation (_target_q_compute_action) or the Q-value computation based on these actions (_target_q_compute_value) to customize the target Q-value computation. The default implementation assumes a continuous action space where a critic receives a state/observation and an action; for discrete action spaces, where the critic receives only a state/observation, the method _target_q_compute_value must be overridden.

Parameters:

policy – the policy
policy_optim – the optimizer factory for the policy’s model.
critic – the critic network. For continuous action spaces: (s, a -> Q(s, a)). For discrete action spaces: (s -> <Q(s, a_1), …, Q(s, a_N)>). NOTE: The default implementation of _target_q_compute_value assumes a continuous action space; override this method if using discrete actions.
critic_optim – the optimizer factory for the critic network.
tau – the soft update coefficient for target networks, controlling the rate at which target networks track the learned networks. When the parameters of the target network are updated with the current (source) network’s parameters, a weighted average is used: target = tau * source + (1 - tau) * target. Smaller values (closer to 0) create more stable but slower learning as target networks change more gradually. Higher values (closer to 1) allow faster learning but may reduce stability. Typically set to a small value (0.001 to 0.01) for most reinforcement learning tasks.
gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks

class DDPG(*, policy: ContinuousDeterministicPolicy, policy_optim: OptimizerFactory, critic: Module | ContinuousCritic, critic_optim: OptimizerFactory, tau: float = 0.005, gamma: float = 0.99, n_step_return_horizon: int = 1)[source]#

Bases: ActorCriticOffPolicyAlgorithm[ContinuousDeterministicPolicy, ActBatchProtocol]

Implementation of Deep Deterministic Policy Gradient. arXiv:1509.02971.

Parameters:

policy – the policy
policy_optim – the optimizer factory for the policy’s model.
critic – the critic network. (s, a -> Q(s, a))
critic_optim – the optimizer factory for the critic network.
tau – the soft update coefficient for target networks, controlling the rate at which target networks track the learned networks. When the parameters of the target network are updated with the current (source) network’s parameters, a weighted average is used: target = tau * source + (1 - tau) * target. Smaller values (closer to 0) create more stable but slower learning as target networks change more gradually. Higher values (closer to 1) allow faster learning but may reduce stability. Typically set to a small value (0.001 to 0.01) for most reinforcement learning tasks.
gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks
n_step_return_horizon – the number of future steps (> 0) to consider when computing temporal difference (TD) targets. Controls the balance between TD learning and Monte Carlo methods: higher values reduce bias (by relying less on potentially inaccurate value estimates) but increase variance (by incorporating more environmental stochasticity and reducing the averaging effect). A value of 1 corresponds to standard TD learning with immediate bootstrapping, while very large values approach Monte Carlo-like estimation that uses complete episode returns.

ddpg

Contents

ddpg#