marl#


class MapTrainingStats(agent_id_to_stats: dict[str | int, TrainingStats], train_time_aggregator: Literal['min', 'max', 'mean'] = 'max')[source]#

Bases: TrainingStats

get_loss_stats_dict() dict[str, float][source]#

Collects loss_stats_dicts from all agents, prepends agent_id to all keys, and joins results.

class MAPRolloutBatchProtocol(*args, **kwargs)[source]#

Bases: RolloutBatchProtocol, Protocol

class MultiAgentPolicy(policies: dict[str | int, Policy])[source]#

Bases: Policy

Parameters:
  • action_space – the environment’s action_space.

  • observation_space – the environment’s observation space.

  • action_scaling – flag indicating whether, for continuous action spaces, actions should be scaled from the standard neural network output range [-1, 1] to the environment’s action space range [action_space.low, action_space.high]. This applies to continuous action spaces only (gym.spaces.Box) and has no effect for discrete spaces. When enabled, policy outputs are expected to be in the normalized range [-1, 1] (after bounding), and are then linearly transformed to the actual required range. This improves neural network training stability, allows the same algorithm to work across environments with different action ranges, and standardizes exploration strategies. Should be disabled if the actor model already produces outputs in the correct range.

  • action_bound_method – the method used for bounding actions in continuous action spaces to the range [-1, 1] before scaling them to the environment’s action space (provided that action_scaling is enabled). This applies to continuous action spaces only (gym.spaces.Box) and should be set to None for discrete spaces. When set to “clip”, actions exceeding the [-1, 1] range are simply clipped to this range. When set to “tanh”, a hyperbolic tangent function is applied, which smoothly constrains outputs to [-1, 1] while preserving gradients. The choice of bounding method affects both training dynamics and exploration behavior. Clipping provides hard boundaries but may create plateau regions in the gradient landscape, while tanh provides smoother transitions but can compress sensitivity near the boundaries. Should be set to None if the actor model inherently produces bounded outputs. Typically used together with action_scaling=True.

add_exploration_noise(act: _TArrOrActBatch, batch: ObsBatchProtocol) _TArrOrActBatch[source]#

Add exploration noise from sub-policy onto act.

forward(batch: Batch, state: dict | Batch | None = None, **kwargs: Any) Batch[source]#

Dispatch batch data from obs.agent_id to every policy’s forward.

Parameters:
  • batch – TODO: document what is expected at input and make a BatchProtocol for it

  • state – if None, it means all agents have no state. If not None, it should contain keys of “agent_1”, “agent_2”, …

Returns:

a Batch with the following contents: TODO: establish a BatcProtocol for this

{
    "act": actions corresponding to the input
    "state": {
        "agent_1": output state of agent_1's policy for the state
        "agent_2": xxx
        ...
        "agent_n": xxx}
    "out": {
        "agent_1": output of agent_1's policy for the input
        "agent_2": xxx
        ...
        "agent_n": xxx}
}
class MARLDispatcher(algorithms: list[TAlgorithm], env: PettingZooEnv)[source]#

Bases: Generic[TAlgorithm]

Supports multi-agent learning by dispatching calls to the corresponding algorithm for each agent.

algorithms: dict[str | int, TAlgorithm]#

maps agent_id to the corresponding algorithm.

agent_idx#

maps agent_id to 0-based index.

create_policy() MultiAgentPolicy[source]#
dispatch_process_fn(batch: MAPRolloutBatchProtocol, buffer: ReplayBuffer, indices: ndarray) MAPRolloutBatchProtocol[source]#

Dispatch batch data from obs.agent_id to every algorithm’s processing function.

Save original multi-dimensional rew in “save_rew”, set rew to the reward of each agent during their “process_fn”, and restore the original reward afterwards.

dispatch_update_with_batch(batch: MAPRolloutBatchProtocol, algorithm_update_with_batch_fn: Callable[[TAlgorithm, RolloutBatchProtocol], TrainingStats]) MapTrainingStats[source]#

Dispatch the respective subset of the batch data to each algorithm.

Parameters:
  • batch – must map agent_ids to rollout batches

  • algorithm_update_with_batch_fn – a function that performs the algorithm-specific update with the given agent-specific batch data

class MultiAgentOffPolicyAlgorithm(*, algorithms: list[OffPolicyAlgorithm], env: PettingZooEnv)[source]#

Bases: OffPolicyAlgorithm[MultiAgentPolicy]

Multi-agent reinforcement learning where each agent uses off-policy learning.

Parameters:
  • algorithms – a list of off-policy algorithms.

  • env – the multi-agent RL environment

get_algorithm(agent_id: str | int) OffPolicyAlgorithm[source]#
class MultiAgentOnPolicyAlgorithm(*, algorithms: list[OnPolicyAlgorithm], env: PettingZooEnv)[source]#

Bases: OnPolicyAlgorithm[MultiAgentPolicy]

Multi-agent reinforcement learning where each agent uses on-policy learning.

Parameters:
  • algorithms – a list of off-policy algorithms.

  • env – the multi-agent RL environment

get_algorithm(agent_id: str | int) OnPolicyAlgorithm[source]#