discrete_bcq#


class DiscreteBCQTrainingStats(*, train_time: float = 0.0, smoothed_loss: dict = <factory>, loss: float, q_loss: float, i_loss: float, reg_loss: float)[source]#

Bases: SimpleLossTrainingStats

q_loss: float#
i_loss: float#
reg_loss: float#
class DiscreteBCQPolicy(*, model: Module, imitator: Module, target_update_freq: int = 8000, unlikely_action_threshold: float = 0.3, action_space: Discrete, observation_space: Space | None = None, eps_inference: float = 0.0)[source]#

Bases: DiscreteQLearningPolicy

Parameters:
  • model – a model following the rules (s_B -> action_values_BA)

  • imitator – a model following the rules (s -> imitation_logits)

  • target_update_freq – the number of training iterations between each complete update of the target network. Controls how frequently the target Q-network parameters are updated with the current Q-network values. A value of 0 disables the target network entirely, using only a single network for both action selection and bootstrap targets. Higher values provide more stable learning targets but slow down the propagation of new value estimates. Lower positive values allow faster learning but may lead to instability due to rapidly changing targets. Typically set between 100-10000 for DQN variants, with exact values depending on environment complexity.

  • unlikely_action_threshold – the threshold (tau) for unlikely actions, as shown in Equ. (17) in the paper.

  • target_update_freq – the number of training iterations between each complete update of the target network. Controls how frequently the target Q-network parameters are updated with the current Q-network values. A value of 0 disables the target network entirely, using only a single network for both action selection and bootstrap targets. Higher values provide more stable learning targets but slow down the propagation of new value estimates. Lower positive values allow faster learning but may lead to instability due to rapidly changing targets. Typically set between 100-10000 for DQN variants, with exact values depending on environment complexity.

  • action_space – the environment’s action space.

  • observation_space – the environment’s observation space.

  • eps_inference – the epsilon value for epsilon-greedy exploration during inference, i.e. non-training cases (such as evaluation during test steps). The epsilon value is the probability of choosing a random action instead of the action chosen by the policy. A value of 0.0 means no exploration (fully greedy) and a value of 1.0 means full exploration (fully random).

forward(batch: ObsBatchProtocol, state: Any | None = None, model: Module | None = None) ImitationBatchProtocol[source]#

Compute action over the given batch data.

If you need to mask the action, please add a “mask” into batch.obs, for example, if we have an environment that has “0/1/2” three actions:

batch == Batch(
    obs=Batch(
        obs="original obs, with batch_size=1 for demonstration",
        mask=np.array([[False, True, False]]),
        # action 1 is available
        # action 0 and 2 are unavailable
    ),
    ...
)
Parameters:
  • batch

  • state – optional hidden state (for RNNs)

  • model – if not passed will use self.model. Typically used to pass the lagged target network instead of using the current model.

Returns:

A Batch which has 3 keys:

  • act the action.

  • logits the network’s raw output.

  • state the hidden state.

class DiscreteBCQ(*, policy: DiscreteBCQPolicy, optim: OptimizerFactory, gamma: float = 0.99, n_step_return_horizon: int = 1, target_update_freq: int = 8000, imitation_logits_penalty: float = 0.01)[source]#

Bases: OfflineAlgorithm[DiscreteBCQPolicy], LaggedNetworkFullUpdateAlgorithmMixin

Implementation of the discrete batch-constrained deep Q-learning (BCQ) algorithm. arXiv:1910.01708.

Parameters:
  • policy – the policy

  • optim – the optimizer factory for the policy’s model.

  • gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks

  • n_step_return_horizon – the number of future steps (> 0) to consider when computing temporal difference (TD) targets. Controls the balance between TD learning and Monte Carlo methods: higher values reduce bias (by relying less on potentially inaccurate value estimates) but increase variance (by incorporating more environmental stochasticity and reducing the averaging effect). A value of 1 corresponds to standard TD learning with immediate bootstrapping, while very large values approach Monte Carlo-like estimation that uses complete episode returns.

  • target_update_freq – the number of training iterations between each complete update of the target network. Controls how frequently the target Q-network parameters are updated with the current Q-network values. A value of 0 disables the target network entirely, using only a single network for both action selection and bootstrap targets. Higher values provide more stable learning targets but slow down the propagation of new value estimates. Lower positive values allow faster learning but may lead to instability due to rapidly changing targets. Typically set between 100-10000 for DQN variants, with exact values depending on environment complexity.

  • imitation_logits_penalty – regularization weight for imitation logits.

  • n_step_return_horizon – the number of future steps (> 0) to consider when computing temporal difference (TD) targets. Controls the balance between TD learning and Monte Carlo methods: higher values reduce bias (by relying less on potentially inaccurate value estimates) but increase variance (by incorporating more environmental stochasticity and reducing the averaging effect). A value of 1 corresponds to standard TD learning with immediate bootstrapping, while very large values approach Monte Carlo-like estimation that uses complete episode returns.

  • target_update_freq – the number of training iterations between each complete update of the target network. Controls how frequently the target Q-network parameters are updated with the current Q-network values. A value of 0 disables the target network entirely, using only a single network for both action selection and bootstrap targets. Higher values provide more stable learning targets but slow down the propagation of new value estimates. Lower positive values allow faster learning but may lead to instability due to rapidly changing targets. Typically set between 100-10000 for DQN variants, with exact values depending on environment complexity.