bcq#


class BCQTrainingStats(*, train_time: float = 0.0, smoothed_loss: dict = <factory>, actor_loss: float, critic1_loss: float, critic2_loss: float, vae_loss: float)[source]#

Bases: TrainingStats

actor_loss: float#
critic1_loss: float#
critic2_loss: float#
vae_loss: float#
class BCQPolicy(*, actor_perturbation: Module, action_space: Space, critic: Module, vae: VAE, forward_sampled_times: int = 100, observation_space: Space | None = None, action_scaling: bool = False, action_bound_method: Literal['clip', 'tanh'] | None = 'clip')[source]#

Bases: Policy

Parameters:
  • actor_perturbation – the actor perturbation. (s, a -> perturbed a)

  • critic – the first critic network.

  • vae – the VAE network, generating actions similar to those in batch.

  • forward_sampled_times – the number of sampled actions in forward function. The policy samples many actions and takes the action with the max value.

  • observation_space – the environment’s observation space

  • action_scaling – flag indicating whether, for continuous action spaces, actions should be scaled from the standard neural network output range [-1, 1] to the environment’s action space range [action_space.low, action_space.high]. This applies to continuous action spaces only (gym.spaces.Box) and has no effect for discrete spaces. When enabled, policy outputs are expected to be in the normalized range [-1, 1] (after bounding), and are then linearly transformed to the actual required range. This improves neural network training stability, allows the same algorithm to work across environments with different action ranges, and standardizes exploration strategies. Should be disabled if the actor model already produces outputs in the correct range.

  • action_bound_method – the method used for bounding actions in continuous action spaces to the range [-1, 1] before scaling them to the environment’s action space (provided that action_scaling is enabled). This applies to continuous action spaces only (gym.spaces.Box) and should be set to None for discrete spaces. When set to “clip”, actions exceeding the [-1, 1] range are simply clipped to this range. When set to “tanh”, a hyperbolic tangent function is applied, which smoothly constrains outputs to [-1, 1] while preserving gradients. The choice of bounding method affects both training dynamics and exploration behavior. Clipping provides hard boundaries but may create plateau regions in the gradient landscape, while tanh provides smoother transitions but can compress sensitivity near the boundaries. Should be set to None if the actor model inherently produces bounded outputs. Typically used together with action_scaling=True.

forward(batch: ObsBatchProtocol, state: dict | BatchProtocol | ndarray | None = None, **kwargs: Any) ActBatchProtocol[source]#

Compute action over the given batch data.

class BCQ(*, policy: BCQPolicy, actor_perturbation_optim: OptimizerFactory, critic_optim: OptimizerFactory, vae_optim: OptimizerFactory, critic2: Module | None = None, critic2_optim: OptimizerFactory | None = None, gamma: float = 0.99, tau: float = 0.005, lmbda: float = 0.75, num_sampled_action: int = 10)[source]#

Bases: OfflineAlgorithm[BCQPolicy], LaggedNetworkPolyakUpdateAlgorithmMixin

Implementation of Batch-Constrained Deep Q-learning (BCQ) algorithm. arXiv:1812.02900.

Parameters:
  • policy – the policy

  • actor_perturbation_optim – the optimizer factory for the policy’s actor perturbation network.

  • critic_optim – the optimizer factory for the policy’s critic network.

  • critic2 – the second critic network; if None, clone the critic from the policy

  • critic2_optim – the optimizer factory for the second critic network; if None, use optimizer factory of first critic

  • vae_optim – the optimizer factory for the VAE network.

  • gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks

  • tau – the soft update coefficient for target networks, controlling the rate at which target networks track the learned networks. When the parameters of the target network are updated with the current (source) network’s parameters, a weighted average is used: target = tau * source + (1 - tau) * target. Smaller values (closer to 0) create more stable but slower learning as target networks change more gradually. Higher values (closer to 1) allow faster learning but may reduce stability. Typically set to a small value (0.001 to 0.01) for most reinforcement learning tasks.

  • lmbda – param for Clipped Double Q-learning.

  • num_sampled_action – the number of sampled actions in calculating target Q. The algorithm samples several actions using VAE, and perturbs each action to get the target Q.