a2c

a2c#

Source code: tianshou/algorithm/modelfree/a2c.py

class A2CTrainingStats(*, train_time: float = 0.0, smoothed_loss: dict = <factory>, loss: tianshou.data.stats.SequenceSummaryStats, actor_loss: tianshou.data.stats.SequenceSummaryStats, vf_loss: tianshou.data.stats.SequenceSummaryStats, ent_loss: tianshou.data.stats.SequenceSummaryStats, gradient_steps: int)[source]#

Bases: TrainingStats

loss: SequenceSummaryStats#

actor_loss: SequenceSummaryStats#

vf_loss: SequenceSummaryStats#

ent_loss: SequenceSummaryStats#

gradient_steps: int#

class ActorCriticOnPolicyAlgorithm(*, policy: ProbabilisticActorPolicy, critic: Module | ContinuousCritic | DiscreteCritic, optim: OptimizerFactory, optim_include_actor: bool, max_grad_norm: float | None = None, gae_lambda: float = 0.95, max_batchsize: int = 256, gamma: float = 0.99, return_scaling: bool = False)[source]#

Bases: OnPolicyAlgorithm[ProbabilisticActorPolicy], ABC

Abstract base class for actor-critic algorithms that use generalized advantage estimation (GAE).

Parameters:

critic – the critic network. (s -> V(s))
optim – the optimizer factory.
optim_include_actor – whether the optimizer shall include the actor network’s parameters. Pass False for algorithms that shall update only the critic via the optimizer.
max_grad_norm – the maximum L2 norm threshold for gradient clipping. When not None, gradients will be rescaled using to ensure their L2 norm does not exceed this value. This prevents exploding gradients and stabilizes training by limiting the magnitude of parameter updates. Set to None to disable gradient clipping.
gae_lambda – the lambda parameter in [0, 1] for generalized advantage estimation (GAE). Controls the bias-variance tradeoff in advantage estimates, acting as a weighting factor for combining different n-step advantage estimators. Higher values (closer to 1) reduce bias but increase variance by giving more weight to longer trajectories, while lower values (closer to 0) reduce variance but increase bias by relying more on the immediate TD error and value function estimates. At λ=0, GAE becomes equivalent to the one-step TD error (high bias, low variance); at λ=1, it becomes equivalent to Monte Carlo advantage estimation (low bias, high variance). Intermediate values create a weighted average of n-step returns, with exponentially decaying weights for longer-horizon returns. Typically set between 0.9 and 0.99 for most policy gradient methods.
max_batchsize – the maximum number of samples to process at once when computing generalized advantage estimation (GAE) and value function predictions. Controls memory usage by breaking large batches into smaller chunks processed sequentially. Higher values may increase speed but require more GPU/CPU memory; lower values reduce memory requirements but may increase computation time. Should be adjusted based on available hardware resources and total batch size of your training data.
gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks
return_scaling – flag indicating whether to enable scaling of estimated returns by dividing them by their running standard deviation without centering the mean. This reduces the magnitude variation of advantages across different episodes while preserving their signs and relative ordering. The use of running statistics (rather than batch-specific scaling) means that early training experiences may be scaled differently than later ones as the statistics evolve. When enabled, this improves training stability in environments with highly variable reward scales and makes the algorithm less sensitive to learning rate settings. However, it may reduce the algorithm’s ability to distinguish between episodes with different absolute return magnitudes. Best used in environments where the relative ordering of actions is more important than the absolute scale of returns.

class A2C(*, policy: ProbabilisticActorPolicy, critic: Module | ContinuousCritic | DiscreteCritic, optim: OptimizerFactory, vf_coef: float = 0.5, ent_coef: float = 0.01, max_grad_norm: float | None = None, gae_lambda: float = 0.95, max_batchsize: int = 256, gamma: float = 0.99, return_scaling: bool = False)[source]#

Bases: ActorCriticOnPolicyAlgorithm

Implementation of (synchronous) Advantage Actor-Critic (A2C). arXiv:1602.01783.

Parameters:

policy – the policy containing the actor network.
critic – the critic network. (s -> V(s))
optim – the optimizer factory.
vf_coef – coefficient that weights the value loss relative to the actor loss in the overall loss function. Higher values prioritize accurate value function estimation over policy improvement. Controls the trade-off between policy optimization and value function fitting. Typically set between 0.5 and 1.0 for most actor-critic implementations.
ent_coef – coefficient that weights the entropy bonus relative to the actor loss. Controls the exploration-exploitation trade-off by encouraging policy entropy. Higher values promote more exploration by encouraging a more uniform action distribution. Lower values focus more on exploitation of the current policy’s knowledge. Typically set between 0.01 and 0.05 for most actor-critic implementations.
max_grad_norm – the maximum L2 norm threshold for gradient clipping. When not None, gradients will be rescaled using to ensure their L2 norm does not exceed this value. This prevents exploding gradients and stabilizes training by limiting the magnitude of parameter updates. Set to None to disable gradient clipping.
gae_lambda – the lambda parameter in [0, 1] for generalized advantage estimation (GAE). Controls the bias-variance tradeoff in advantage estimates, acting as a weighting factor for combining different n-step advantage estimators. Higher values (closer to 1) reduce bias but increase variance by giving more weight to longer trajectories, while lower values (closer to 0) reduce variance but increase bias by relying more on the immediate TD error and value function estimates. At λ=0, GAE becomes equivalent to the one-step TD error (high bias, low variance); at λ=1, it becomes equivalent to Monte Carlo advantage estimation (low bias, high variance). Intermediate values create a weighted average of n-step returns, with exponentially decaying weights for longer-horizon returns. Typically set between 0.9 and 0.99 for most policy gradient methods.
max_batchsize – the maximum size of the batch when computing GAE.
gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks
return_scaling – flag indicating whether to enable scaling of estimated returns by dividing them by their running standard deviation without centering the mean. This reduces the magnitude variation of advantages across different episodes while preserving their signs and relative ordering. The use of running statistics (rather than batch-specific scaling) means that early training experiences may be scaled differently than later ones as the statistics evolve. When enabled, this improves training stability in environments with highly variable reward scales and makes the algorithm less sensitive to learning rate settings. However, it may reduce the algorithm’s ability to distinguish between episodes with different absolute return magnitudes. Best used in environments where the relative ordering of actions is more important than the absolute scale of returns.

a2c

Contents

a2c#