bdqn#


class BDQNPolicy(*, model: BranchingNet, action_space: Discrete, observation_space: Space | None = None, eps_training: float = 0.0, eps_inference: float = 0.0)[source]#

Bases: DiscreteQLearningPolicy[BranchingNet]

Parameters:
  • model – BranchingNet mapping (obs, state, info) -> action_values_BA.

  • action_space – the environment’s action space

  • observation_space – the environment’s observation space.

  • eps_training – the epsilon value for epsilon-greedy exploration during training. When collecting data for training, this is the probability of choosing a random action instead of the action chosen by the policy. A value of 0.0 means no exploration (fully greedy) and a value of 1.0 means full exploration (fully random).

  • eps_inference – the epsilon value for epsilon-greedy exploration during inference, i.e. non-training cases (such as evaluation during test steps). The epsilon value is the probability of choosing a random action instead of the action chosen by the policy. A value of 0.0 means no exploration (fully greedy) and a value of 1.0 means full exploration (fully random).

forward(batch: ObsBatchProtocol, state: dict | BatchProtocol | ndarray | None = None, model: Module | None = None) ModelOutputBatchProtocol[source]#

Compute action over the given batch data.

If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:

batch == Batch(
    obs=Batch(
        obs="original obs, with batch_size=1 for demonstration",
        mask=np.array([[False, True, False]]),
        # action 1 is available
        # action 0 and 2 are unavailable
    ),
    ...
)
Parameters:
  • batch

  • state – optional hidden state (for RNNs)

  • model – if not passed will use self.model. Typically used to pass the lagged target network instead of using the current model.

Returns:

A Batch which has 3 keys:

  • act the action.

  • logits the network’s raw output.

  • state the hidden state.

add_exploration_noise(act: TArrOrActBatch, batch: ObsBatchProtocol) TArrOrActBatch[source]#
(Optionally) adds noise to an actions computed by the policy’s forward method for

exploration purposes.

NOTE: The base implementation does not add any noise, but subclasses can override this method to add appropriate mechanisms for adding noise.

Parameters:
  • act – a data batch or numpy.ndarray containing actions computed by the policy’s forward method.

  • batch – the corresponding input batch that was passed to forward; provided for advanced usage.

Returns:

actions in the same format as the input act but with added exploration noise (if implemented - otherwise returns act unchanged).

class BDQN(*, policy: BDQNPolicy, optim: OptimizerFactory, gamma: float = 0.99, target_update_freq: int = 0, is_double: bool = True)[source]#

Bases: QLearningOffPolicyAlgorithm[BDQNPolicy]

Implementation of the Branching Dueling Q-Network (BDQN) algorithm arXiv:1711.08946.

Parameters:
  • policy – policy

  • optim – the optimizer factory for the policy’s model.

  • gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks

  • target_update_freq – the number of training iterations between each complete update of the target network. Controls how frequently the target Q-network parameters are updated with the current Q-network values. A value of 0 disables the target network entirely, using only a single network for both action selection and bootstrap targets. Higher values provide more stable learning targets but slow down the propagation of new value estimates. Lower positive values allow faster learning but may lead to instability due to rapidly changing targets. Typically set between 100-10000 for DQN variants, with exact values depending on environment complexity.

  • is_double – flag indicating whether to use Double Q-learning for target value calculation. If True, the algorithm uses the online network to select actions and the target network to evaluate their Q-values. This decoupling helps reduce the overestimation bias that standard Q-learning is prone to. If False, the algorithm selects actions by directly taking the maximum Q-value from the target network. Note: This parameter is most effective when used with a target network (target_update_freq > 0).