cql

cql#

Source code: tianshou/algorithm/imitation/cql.py

class CQLTrainingStats(*, train_time: float = 0.0, smoothed_loss: dict = <factory>, actor_loss: float, critic1_loss: float, critic2_loss: float, alpha: float | None = None, alpha_loss: float | None = None, cql_alpha: float | None = None, cql_alpha_loss: float | None = None)[source]#

Bases: SACTrainingStats

A data structure for storing loss statistics of the CQL learn step.

cql_alpha: float | None = None#

cql_alpha_loss: float | None = None#

class CQL(*, policy: SACPolicy, policy_optim: OptimizerFactory, critic: Module, critic_optim: OptimizerFactory, critic2: Module | None = None, critic2_optim: OptimizerFactory | None = None, cql_alpha_lr: float = 0.0001, cql_weight: float = 1.0, tau: float = 0.005, gamma: float = 0.99, alpha: float | Alpha = 0.2, temperature: float = 1.0, with_lagrange: bool = True, lagrange_threshold: float = 10.0, min_action: float = -1.0, max_action: float = 1.0, num_repeat_actions: int = 10, alpha_min: float = 0.0, alpha_max: float = 1000000.0, max_grad_norm: float = 1.0, calibrated: bool = True)[source]#

Bases: OfflineAlgorithm[SACPolicy], LaggedNetworkPolyakUpdateAlgorithmMixin

Implementation of the conservative Q-learning (CQL) algorithm. arXiv:2006.04779.

Parameters:

actor – the actor network following the rules (s -> a)
policy_optim – the optimizer factory for the policy/its actor network.
critic – the first critic network.
critic_optim – the optimizer factory for the first critic network.
action_space – the environment’s action space.
critic2 – the second critic network. (s, a -> Q(s, a)). If None, use the same network as critic (via deepcopy).
critic2_optim – the optimizer factory for the second critic network. If None, clone the first critic’s optimizer factory.
cql_alpha_lr – the learning rate for the Lagrange multiplier optimization. Controls how quickly the CQL regularization coefficient (alpha) adapts during training. Higher values allow faster adaptation but may cause instability in the training process. Lower values provide more stable but slower adaptation of the regularization strength. Only relevant when with_lagrange=True.
cql_weight – the coefficient that scales the conservative regularization term in the Q-function loss. Controls the strength of the conservative Q-learning component relative to standard TD learning. Higher values enforce more conservative value estimates by penalizing overestimation more strongly. Lower values allow the algorithm to behave more like standard Q-learning. Increasing this weight typically improves performance in purely offline settings where overestimation bias can lead to poor policy extraction.
tau – the soft update coefficient for target networks, controlling the rate at which target networks track the learned networks. When the parameters of the target network are updated with the current (source) network’s parameters, a weighted average is used: target = tau * source + (1 - tau) * target. Smaller values (closer to 0) create more stable but slower learning as target networks change more gradually. Higher values (closer to 1) allow faster learning but may reduce stability. Typically set to a small value (0.001 to 0.01) for most reinforcement learning tasks.
gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks
alpha – the entropy regularization coefficient alpha or an object which can be used to automatically tune it (e.g. an instance of AutoAlpha).
temperature – the temperature parameter used in the LogSumExp calculation of the CQL loss. Controls the sharpness of the softmax distribution when computing the expected Q-values. Lower values make the LogSumExp operation more selective, focusing on the highest Q-values. Higher values make the operation closer to an average, giving more weight to all Q-values. The temperature affects how conservatively the algorithm penalizes out-of-distribution actions.
with_lagrange – a flag indicating whether to automatically tune the CQL regularization strength. If True, uses Lagrangian dual gradient descent to dynamically adjust the CQL alpha parameter. This formulation maintains the CQL regularization loss near the lagrange_threshold value. Adaptive tuning helps balance conservative learning against excessive pessimism. If False, the conservative loss is scaled by a fixed cql_weight throughout training. The original CQL paper recommends setting this to True for most offline RL tasks.
lagrange_threshold – the target value for the CQL regularization loss when using Lagrangian optimization. When with_lagrange=True, the algorithm dynamically adjusts the CQL alpha parameter to maintain the regularization loss close to this threshold. Lower values result in more conservative behavior by enforcing stronger penalties on out-of-distribution actions. Higher values allow more optimistic Q-value estimates similar to standard Q-learning. This threshold effectively controls the level of conservatism in CQL’s value estimation.
min_action – the lower bound for each dimension of the action space. Used when sampling random actions for the CQL regularization term. Should match the environment’s action space minimum values. These random actions help penalize Q-values for out-of-distribution actions. Typically set to -1.0 for normalized continuous action spaces.
max_action – the upper bound for each dimension of the action space. Used when sampling random actions for the CQL regularization term. Should match the environment’s action space maximum values. These random actions help penalize Q-values for out-of-distribution actions. Typically set to 1.0 for normalized continuous action spaces.
num_repeat_actions – the number of action samples generated per state when computing the CQL regularization term. Controls how many random and policy actions are sampled for each state in the batch when estimating expected Q-values. Higher values provide more accurate approximation of the expected Q-values but increase computational cost. Lower values reduce computation but may provide less stable or less accurate regularization. The original CQL paper typically uses values around 10.
alpha_min – the minimum value allowed for the adaptive CQL regularization coefficient. When using Lagrangian optimization (with_lagrange=True), constrains the automatically tuned cql_alpha parameter to be at least this value. Prevents the regularization strength from becoming too small during training. Setting a positive value ensures the algorithm maintains at least some degree of conservatism. Only relevant when with_lagrange=True.
alpha_max – the maximum value allowed for the adaptive CQL regularization coefficient. When using Lagrangian optimization (with_lagrange=True), constrains the automatically tuned cql_alpha parameter to be at most this value. Prevents the regularization strength from becoming too large during training. Setting an appropriate upper limit helps avoid overly conservative behavior that might hinder learning useful value functions. Only relevant when with_lagrange=True.
max_grad_norm – the maximum L2 norm threshold for gradient clipping when updating critic networks. Gradients with norm exceeding this value will be rescaled to have norm equal to this value. Helps stabilize training by preventing excessively large parameter updates from outlier samples. Higher values allow larger updates but may lead to training instability. Lower values enforce more conservative updates but may slow down learning. Setting to a large value effectively disables gradient clipping.
calibrated – a flag indicating whether to use the calibrated version of CQL (CalQL). If True, calibrates Q-values by taking the maximum of computed Q-values and Monte Carlo returns. This modification helps address the excessive pessimism problem in standard CQL. Particularly useful for offline pre-training followed by online fine-tuning scenarios. Experimental results suggest this approach often achieves better performance than vanilla CQL. Based on techniques from the CalQL paper (arXiv:2303.05479).

process_buffer(buffer: TBuffer) → TBuffer[source]#

If self.calibrated = True, adds calibration_returns to buffer._meta.

Parameters:: buffer –
Returns:

cql

Contents

cql#