icm#
Source code: tianshou/algorithm/modelbased/icm.py
- class ICMTrainingStats(wrapped_stats: TrainingStats, *, icm_loss: float, icm_forward_loss: float, icm_inverse_loss: float)[source]#
Bases:
TrainingStatsWrapperIn this particular case, super().__init__() should be called LAST in the subclass init.
- class ICMOffPolicyWrapper(*, wrapped_algorithm: OffPolicyAlgorithm[TPolicy], model: IntrinsicCuriosityModule, optim: OptimizerFactory, lr_scale: float, reward_scale: float, forward_loss_weight: float)[source]#
Bases:
OffPolicyWrapperAlgorithm[TPolicy],_ICMMixinImplementation of the Intrinsic Curiosity Module (ICM) algorithm for off-policy learning. arXiv:1705.05363.
- Parameters:
wrapped_algorithm – the base algorithm to which we want to add the ICM.
model – the ICM model.
optim – the optimizer factory for the ICM model.
lr_scale – a multiplier that effectively scales the learning rate for the ICM model updates. Higher values increase the step size during optimization of the intrinsic curiosity module. Lower values decrease the step size, leading to more gradual learning of the curiosity mechanism. This parameter offers an alternative to directly adjusting the base learning rate in the optimizer.
reward_scale – a multiplier that controls the magnitude of intrinsic rewards (curiosity-driven rewards generated by the agent itself) relative to extrinsic rewards (task-specific rewards provided by the environment). Scales the prediction error (curiosity signal) before adding it to the environment rewards. Higher values increase the agent’s motivation to explore novel states. Lower values decrease the influence of curiosity relative to task-specific rewards. Setting to zero effectively disables intrinsic motivation while still learning the ICM model.
forward_loss_weight – relative importance in [0, 1] of the forward model loss in relation to the inverse model loss. Controls the trade-off between state prediction and action prediction in the ICM algorithm. Higher values (> 0.5) prioritize learning to predict next states given current states and actions. Lower values (< 0.5) prioritize learning to predict actions given current and next states. The total loss combines both components: (1-forward_loss_weight)*inverse_loss + forward_loss_weight*forward_loss.
- class ICMOnPolicyWrapper(*, wrapped_algorithm: OnPolicyAlgorithm[TPolicy], model: IntrinsicCuriosityModule, optim: OptimizerFactory, lr_scale: float, reward_scale: float, forward_loss_weight: float)[source]#
Bases:
OnPolicyWrapperAlgorithm[TPolicy],_ICMMixinImplementation of the Intrinsic Curiosity Module (ICM) algorithm for on-policy learning. arXiv:1705.05363.
- Parameters:
wrapped_algorithm – the base algorithm to which we want to add the ICM.
model – the ICM model.
optim – the optimizer factory for the ICM model.
lr_scale – a multiplier that effectively scales the learning rate for the ICM model updates. Higher values increase the step size during optimization of the intrinsic curiosity module. Lower values decrease the step size, leading to more gradual learning of the curiosity mechanism. This parameter offers an alternative to directly adjusting the base learning rate in the optimizer.
reward_scale – a multiplier that controls the magnitude of intrinsic rewards (curiosity-driven rewards generated by the agent itself) relative to extrinsic rewards (task-specific rewards provided by the environment). Scales the prediction error (curiosity signal) before adding it to the environment rewards. Higher values increase the agent’s motivation to explore novel states. Lower values decrease the influence of curiosity relative to task-specific rewards. Setting to zero effectively disables intrinsic motivation while still learning the ICM model.
forward_loss_weight – relative importance in [0, 1] of the forward model loss in relation to the inverse model loss. Controls the trade-off between state prediction and action prediction in the ICM algorithm. Higher values (> 0.5) prioritize learning to predict next states given current states and actions. Lower values (< 0.5) prioritize learning to predict actions given current and next states. The total loss combines both components: (1-forward_loss_weight)*inverse_loss + forward_loss_weight*forward_loss.