discrete_crr#
Source code: tianshou/algorithm/imitation/discrete_crr.py
- class DiscreteCRRTrainingStats(actor_loss: float, critic_loss: float, cql_loss: float, *, train_time: float = 0.0, smoothed_loss: dict = <factory>, loss: float)[source]#
Bases:
SimpleLossTrainingStats- actor_loss: float#
- critic_loss: float#
- cql_loss: float#
- class DiscreteCRR(*, policy: DiscreteActorPolicy, critic: Module | DiscreteCritic, optim: OptimizerFactory, gamma: float = 0.99, policy_improvement_mode: Literal['exp', 'binary', 'all'] = 'exp', ratio_upper_bound: float = 20.0, beta: float = 1.0, min_q_weight: float = 10.0, target_update_freq: int = 0, return_standardization: bool = False)[source]#
Bases:
OfflineAlgorithm[DiscreteActorPolicy],LaggedNetworkFullUpdateAlgorithmMixinImplementation of discrete Critic Regularized Regression. arXiv:2006.15134.
- Parameters:
policy – the policy
critic – the action-value critic (i.e., Q function) network. (s -> Q(s, *))
optim – the optimizer factory for the policy’s actor network and the critic networks.
gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks
policy_improvement_mode (str) – type of the weight function f. Possible values: “binary”/”exp”/”all”.
ratio_upper_bound – when policy_improvement_mode is “exp”, the value of the exp function is upper-bounded by this parameter.
beta – when policy_improvement_mode is “exp”, this is the denominator of the exp function.
min_q_weight – weight for CQL loss/regularizer. Default to 10.
target_update_freq – the number of training iterations between each complete update of the target network. Controls how frequently the target Q-network parameters are updated with the current Q-network values. A value of 0 disables the target network entirely, using only a single network for both action selection and bootstrap targets. Higher values provide more stable learning targets but slow down the propagation of new value estimates. Lower positive values allow faster learning but may lead to instability due to rapidly changing targets. Typically set between 100-10000 for DQN variants, with exact values depending on environment complexity.
return_standardization – whether to standardize episode returns by subtracting the running mean and dividing by the running standard deviation. Note that this is known to be detrimental to performance in many cases!