dqn

dqn#

Source code: tianshou/algorithm/modelfree/dqn.py

class DiscreteQLearningPolicy(*, model: TModel, action_space: Space, observation_space: Space | None = None, eps_training: float = 0.0, eps_inference: float = 0.0)[source]#

Bases: Policy, Generic[TModel]

Parameters:

model – a model mapping (obs, state, info) to action_values_BA.
action_space – the environment’s action space
observation_space – the environment’s observation space.
eps_training – the epsilon value for epsilon-greedy exploration during training. When collecting data for training, this is the probability of choosing a random action instead of the action chosen by the policy. A value of 0.0 means no exploration (fully greedy) and a value of 1.0 means full exploration (fully random).
eps_inference – the epsilon value for epsilon-greedy exploration during inference, i.e. non-training cases (such as evaluation during test steps). The epsilon value is the probability of choosing a random action instead of the action chosen by the policy. A value of 0.0 means no exploration (fully greedy) and a value of 1.0 means full exploration (fully random).

set_eps_training(eps: float) → None[source]#

Sets the epsilon value for epsilon-greedy exploration during training.

Parameters:: eps – the epsilon value for epsilon-greedy exploration during training. When collecting data for training, this is the probability of choosing a random action instead of the action chosen by the policy. A value of 0.0 means no exploration (fully greedy) and a value of 1.0 means full exploration (fully random).

set_eps_inference(eps: float) → None[source]#

Sets the epsilon value for epsilon-greedy exploration during inference.

Parameters:: eps – the epsilon value for epsilon-greedy exploration during inference, i.e. non-training cases (such as evaluation during test steps). The epsilon value is the probability of choosing a random action instead of the action chosen by the policy. A value of 0.0 means no exploration (fully greedy) and a value of 1.0 means full exploration (fully random).

forward(batch: ObsBatchProtocol, state: Any | None = None, model: Module | None = None) → ModelOutputBatchProtocol[source]#

Compute action over the given batch data.

If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:

batch == Batch(
    obs=Batch(
        obs="original obs, with batch_size=1 for demonstration",
        mask=np.array([[False, True, False]]),
        # action 1 is available
        # action 0 and 2 are unavailable
    ),
    ...
)

Parameters:

batch –
state – optional hidden state (for RNNs)
model – if not passed will use self.model. Typically used to pass the lagged target network instead of using the current model.

Returns:

A Batch which has 3 keys:

act the action.
logits the network’s raw output.
state the hidden state.

compute_q_value(logits: Tensor, mask: ndarray | None) → Tensor[source]#: Compute the q value based on the network’s raw output and action mask.

add_exploration_noise(act: TArrOrActBatch, batch: ObsBatchProtocol) → TArrOrActBatch[source]#

(Optionally) adds noise to an actions computed by the policy’s forward method for: exploration purposes.

NOTE: The base implementation does not add any noise, but subclasses can override this method to add appropriate mechanisms for adding noise.

Parameters:

act – a data batch or numpy.ndarray containing actions computed by the policy’s forward method.
batch – the corresponding input batch that was passed to forward; provided for advanced usage.

Returns:

actions in the same format as the input act but with added exploration noise (if implemented - otherwise returns act unchanged).

class QLearningOffPolicyAlgorithm(*, policy: TDQNPolicy, optim: OptimizerFactory, gamma: float = 0.99, n_step_return_horizon: int = 1, target_update_freq: int = 0)[source]#

Bases: OffPolicyAlgorithm[TDQNPolicy], LaggedNetworkFullUpdateAlgorithmMixin, ABC

Base class for Q-learning off-policy algorithms that use a Q-function to compute the n-step return. It optionally uses a lagged model, which is used as a target network and which is fully updated periodically.

Parameters:

policy – the policy
optim – the optimizer factory for the policy’s model.
gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks
n_step_return_horizon – the number of future steps (> 0) to consider when computing temporal difference (TD) targets. Controls the balance between TD learning and Monte Carlo methods: higher values reduce bias (by relying less on potentially inaccurate value estimates) but increase variance (by incorporating more environmental stochasticity and reducing the averaging effect). A value of 1 corresponds to standard TD learning with immediate bootstrapping, while very large values approach Monte Carlo-like estimation that uses complete episode returns.
target_update_freq – the number of training iterations between each complete update of the target network. Controls how frequently the target Q-network parameters are updated with the current Q-network values. A value of 0 disables the target network entirely, using only a single network for both action selection and bootstrap targets. Higher values provide more stable learning targets but slow down the propagation of new value estimates. Lower positive values allow faster learning but may lead to instability due to rapidly changing targets. Typically set between 100-10000 for DQN variants, with exact values depending on environment complexity.

property use_target_network: bool#

class DQN(*, policy: TDQNPolicy, optim: OptimizerFactory, gamma: float = 0.99, n_step_return_horizon: int = 1, target_update_freq: int = 0, is_double: bool = True, huber_loss_delta: float | None = None)[source]#

Bases: QLearningOffPolicyAlgorithm[TDQNPolicy], Generic[TDQNPolicy]

Implementation of Deep Q Network. arXiv:1312.5602.

Implementation of Double Q-Learning. arXiv:1509.06461.

Implementation of Dueling DQN. arXiv:1511.06581 (the dueling DQN is implemented in the network side, not here).

Parameters:

policy – the policy
optim – the optimizer factory for the policy’s model.
gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks
n_step_return_horizon – the number of future steps (> 0) to consider when computing temporal difference (TD) targets. Controls the balance between TD learning and Monte Carlo methods: higher values reduce bias (by relying less on potentially inaccurate value estimates) but increase variance (by incorporating more environmental stochasticity and reducing the averaging effect). A value of 1 corresponds to standard TD learning with immediate bootstrapping, while very large values approach Monte Carlo-like estimation that uses complete episode returns.
target_update_freq – the number of training iterations between each complete update of the target network. Controls how frequently the target Q-network parameters are updated with the current Q-network values. A value of 0 disables the target network entirely, using only a single network for both action selection and bootstrap targets. Higher values provide more stable learning targets but slow down the propagation of new value estimates. Lower positive values allow faster learning but may lead to instability due to rapidly changing targets. Typically set between 100-10000 for DQN variants, with exact values depending on environment complexity.
is_double – flag indicating whether to use the Double DQN algorithm for target value computation. If True, the algorithm uses the online network to select actions and the target network to evaluate their Q-values. This approach helps reduce the overestimation bias in Q-learning by decoupling action selection from action evaluation. If False, the algorithm follows the vanilla DQN method that directly takes the maximum Q-value from the target network. Note: Double Q-learning will only be effective when a target network is used (target_update_freq > 0).
huber_loss_delta – controls whether to use the Huber loss instead of the MSE loss for the TD error and the threshold for the Huber loss. If None, the MSE loss is used. If not None, uses the Huber loss as described in the Nature DQN paper (nature14236) with the given delta, which limits the influence of outliers. Unlike the MSE loss where the gradients grow linearly with the error magnitude, the Huber loss causes the gradients to plateau at a constant value for large errors, providing more stable training. NOTE: The magnitude of delta should depend on the scale of the returns obtained in the environment.

dqn

Contents

dqn#