psrl#


class PSRLTrainingStats(*, train_time: float = 0.0, smoothed_loss: dict = <factory>, psrl_rew_mean: float = 0.0, psrl_rew_std: float = 0.0)[source]#

Bases: TrainingStats

psrl_rew_mean: float = 0.0#
psrl_rew_std: float = 0.0#
class PSRLModel(trans_count_prior: ndarray, rew_mean_prior: ndarray, rew_std_prior: ndarray, gamma: float, epsilon: float)[source]#

Bases: object

Implementation of Posterior Sampling Reinforcement Learning Model.

Parameters:
  • trans_count_prior – dirichlet prior (alphas), with shape (n_state, n_action, n_state).

  • rew_mean_prior – means of the normal priors of rewards, with shape (n_state, n_action).

  • rew_std_prior – standard deviations of the normal priors of rewards, with shape (n_state, n_action).

  • gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks

  • epsilon – for precision control in value iteration.

observe(trans_count: ndarray, rew_sum: ndarray, rew_square_sum: ndarray, rew_count: ndarray) None[source]#

Add data into memory pool.

For rewards, we have a normal prior at first. After we observed a reward for a given state-action pair, we use the mean value of our observations instead of the prior mean as the posterior mean. The standard deviations are in inverse proportion to the number of the corresponding observations.

Parameters:
  • trans_count – the number of observations, with shape (n_state, n_action, n_state).

  • rew_sum – total rewards, with shape (n_state, n_action).

  • rew_square_sum – total rewards’ squares, with shape (n_state, n_action).

  • rew_count – the number of rewards, with shape (n_state, n_action).

sample_trans_prob() ndarray[source]#
sample_reward() ndarray[source]#
solve_policy() None[source]#
static value_iteration(trans_prob: ndarray, rew: ndarray, gamma: float, eps: float, value: ndarray) tuple[ndarray, ndarray][source]#

Value iteration solver for MDPs.

Parameters:
  • trans_prob – transition probabilities, with shape (n_state, n_action, n_state).

  • rew – rewards, with shape (n_state, n_action).

  • eps – for precision control.

  • gamma – the discount factor in [0, 1] for future rewards. This determines how much future rewards are valued compared to immediate ones. Lower values (closer to 0) make the agent focus on immediate rewards, creating “myopic” behavior. Higher values (closer to 1) make the agent value long-term rewards more, potentially improving performance in tasks where delayed rewards are important but increasing training variance by incorporating more environmental stochasticity. Typically set between 0.9 and 0.99 for most reinforcement learning tasks

  • value – the initialize value of value array, with shape (n_state, ).

Returns:

the optimal policy with shape (n_state, ).

class PSRLPolicy(*, trans_count_prior: ndarray, rew_mean_prior: ndarray, rew_std_prior: ndarray, action_space: Discrete, discount_factor: float = 0.99, epsilon: float = 0.01, observation_space: Space | None = None)[source]#

Bases: Policy

Parameters:
  • trans_count_prior – dirichlet prior (alphas), with shape (n_state, n_action, n_state).

  • rew_mean_prior – means of the normal priors of rewards, with shape (n_state, n_action).

  • rew_std_prior – standard deviations of the normal priors of rewards, with shape (n_state, n_action).

  • action_space – the environment’s action_space.

  • epsilon – for precision control in value iteration.

  • observation_space – the environment’s observation space

forward(batch: ObsBatchProtocol, state: dict | BatchProtocol | ndarray | None = None, **kwargs: Any) ActBatchProtocol[source]#

Compute action over the given batch data with PSRL model.

Returns:

A Batch with “act” key containing the action.

class PSRL(*, policy: PSRLPolicy, add_done_loop: bool = False)[source]#

Bases: OnPolicyAlgorithm[PSRLPolicy]

Implementation of Posterior Sampling Reinforcement Learning (PSRL).

Reference: Strens M., A Bayesian Framework for Reinforcement Learning, ICML, 2000.

Parameters:
  • policy – the policy

  • add_done_loop – a flag indicating whether to add a self-loop transition for terminal states in the MDP. If True, whenever an episode terminates, an artificial transition from the terminal state back to itself is added to the transition counts for all actions. This modification can help stabilize learning for terminal states that have limited samples. Setting to True can be beneficial in environments where episodes frequently terminate, ensuring that terminal states receive sufficient updates to their value estimates. Default is False, which preserves the standard MDP formulation without artificial self-loops.