random#


class MARLRandomTrainingStats(*, train_time: float = 0.0, smoothed_loss: dict = <factory>)[source]#

Bases: TrainingStats

class MARLRandomDiscreteMaskedOffPolicyAlgorithm(action_space: Space)[source]#

Bases: OffPolicyAlgorithm

A random agent used in multi-agent learning.

It randomly chooses an action from the legal actions (according to the given mask).

Parameters:

action_space – the environment’s action space.

class Policy(action_space: Space)[source]#

Bases: Policy

A random agent used in multi-agent learning.

It randomly chooses an action from the legal actions.

Parameters:
  • action_space – the environment’s action_space.

  • observation_space – the environment’s observation space.

  • action_scaling – flag indicating whether, for continuous action spaces, actions should be scaled from the standard neural network output range [-1, 1] to the environment’s action space range [action_space.low, action_space.high]. This applies to continuous action spaces only (gym.spaces.Box) and has no effect for discrete spaces. When enabled, policy outputs are expected to be in the normalized range [-1, 1] (after bounding), and are then linearly transformed to the actual required range. This improves neural network training stability, allows the same algorithm to work across environments with different action ranges, and standardizes exploration strategies. Should be disabled if the actor model already produces outputs in the correct range.

  • action_bound_method – the method used for bounding actions in continuous action spaces to the range [-1, 1] before scaling them to the environment’s action space (provided that action_scaling is enabled). This applies to continuous action spaces only (gym.spaces.Box) and should be set to None for discrete spaces. When set to “clip”, actions exceeding the [-1, 1] range are simply clipped to this range. When set to “tanh”, a hyperbolic tangent function is applied, which smoothly constrains outputs to [-1, 1] while preserving gradients. The choice of bounding method affects both training dynamics and exploration behavior. Clipping provides hard boundaries but may create plateau regions in the gradient landscape, while tanh provides smoother transitions but can compress sensitivity near the boundaries. Should be set to None if the actor model inherently produces bounded outputs. Typically used together with action_scaling=True.

forward(batch: ObsBatchProtocol, state: dict | BatchProtocol | ndarray | None = None, **kwargs: dict) ActBatchProtocol[source]#

Compute the random action over the given batch data.

The input should contain a mask in batch.obs, with “True” to be available and “False” to be unavailable. For example, batch.obs.mask == np.array([[False, True, False]]) means with batch size 1, action “1” is available but action “0” and “2” are unavailable.

Returns:

A Batch with “act” key, containing the random action.