rliable_evaluation#


The rliable-evaluation module provides a high-level interface to evaluate the results of an experiment with multiple runs on different seeds using the rliable library. The API is experimental and subject to change!.

class EvaluationSequenceEntry(env_step: int, rew: float, rew_std: float, iqm: float, iqm_confidence_interval: tuple[float, float])[source]#

Bases: object

A single entry in an evaluation sequence, representing data collected at a fixed environment step.

env_step: int#

The number of environment steps at which the evaluation was performed.

rew: float#

The mean episode return at the given env_step. Called rew (confusingly) to be consistent with Tianshou’s internal naming conventions.

rew_std: float#

The standard deviation of the episode returns at the given env_step, computed from multiple runs.

iqm: float#

The interquartile mean (IQM) of the episode returns at the given env_step, computed from multiple runs.

iqm_confidence_interval: tuple[float, float]#

The 95% confidence interval of the IQM of the episode returns at the given env_step.

class LoggedSummaryData(mean: numpy.ndarray, std: numpy.ndarray, max: numpy.ndarray, min: numpy.ndarray)[source]#

Bases: object

mean: ndarray#
std: ndarray#
max: ndarray#
min: ndarray#
class LoggedCollectStats(env_step: numpy.ndarray | None = None, n_collected_episodes: numpy.ndarray | None = None, n_collected_steps: numpy.ndarray | None = None, collect_time: numpy.ndarray | None = None, collect_speed: numpy.ndarray | None = None, returns_stat: LoggedSummaryData | None = None, lens_stat: LoggedSummaryData | None = None)[source]#

Bases: object

env_step: ndarray | None = None#
n_collected_episodes: ndarray | None = None#
n_collected_steps: ndarray | None = None#
collect_time: ndarray | None = None#
collect_speed: ndarray | None = None#
returns_stat: LoggedSummaryData | None = None#
lens_stat: LoggedSummaryData | None = None#
classmethod from_data_dict(data: dict) LoggedCollectStats[source]#

Create a LoggedCollectStats object from a dictionary.

Converts SequenceSummaryStats from dict format to dataclass format and ignores fields that are not present.

class MultiRunExperimentResult(exp_dir: str, exp_name: str, test_episode_returns_RE: ndarray, training_episode_returns_RE: ndarray, test_env_steps_E: ndarray, training_env_steps_E: ndarray)[source]#

Bases: object

The result of multiple runs of an experiment (runs usually just differing by random seeds) that can be used with the rliable library.

Glossary:
  • R: number of runs (typically, equal to the number of different seeds)

  • E: number of environment steps at which evaluation results were computed, i.e., the evaluation points

    n_1, n_2, …, n_E

exp_dir: str#

The base directory where each sub-directory contains the results of one experiment run.

exp_name: str#

The name of the experiment, typically the name of the algorithm or the experiment directory basename.

test_episode_returns_RE: ndarray#

The test episode returns for each run of the experiment, where each row corresponds to one run.

training_episode_returns_RE: ndarray#

The training episode returns for each run of the experiment, where each row corresponds to one run.

test_env_steps_E: ndarray#

The environment steps at which the test episodes were evaluated.

training_env_steps_E: ndarray#

The environment steps at which the training episodes were evaluated.

classmethod load_from_disk(exp_dir: str, exp_name: str | None = None, max_env_step: int | None = None) MultiRunExperimentResult[source]#

Load the experiment result from disk.

Parameters:
  • exp_dir – The directory from where the experiment results are restored.

  • exp_name – The name of the experiment. If not passed, will be inferred from the experiment directory name.

  • max_env_step – The maximum number of environment steps to consider. If None, all data is considered. Note: if the experiments have different numbers of steps, the minimum number is used.

eval_results(algo_name: str | None = None, score_thresholds: ndarray | None = None, save_as_json: bool = True, save_plots: bool = True, show_plots: bool = True, scope: DataScope = DataScope.TEST, ax_iqm_sample_efficiency: Axes | None = None, ax_performance_profile: Axes | None = None, algo2color: dict[str, str] | None = None) tuple[Figure, Axes, Figure, Axes][source]#

Evaluate the results of an experiment and create a sample efficiency curve and a performance profile.

Parameters:
  • algo_name – The name of the algorithm to be shown in the figure legend. If None, the name of the algorithm is set to the experiment dir.

  • score_thresholds – The score thresholds for the performance profile. If None, they will be inferred from the minimum and maximum test episode returns.

  • save_as_json – whether to save the evaluation results as a JSON file (in a format compatible by the Tianshou benchmarking visualization) in the experiment directory.

  • save_plots – whether to save the plots to the experiment directory.

  • show_plots – whether to display the plots.

  • scope – The scope of the evaluation, either ‘TEST’ or ‘TRAIN’.

  • ax_iqm_sample_efficiency – The axis to plot the IQM sample efficiency curve on. If None, a new figure is created.

  • ax_performance_profile – The axis to plot the performance profile on. If None, a new figure is created.

  • algo2color – A dictionary mapping algorithm names to colors. Useful for plotting the evaluations of multiple algorithms in the same figure, e.g., by first creating an ax_iqm and ax_profile with one evaluation and then passing them into the other evaluation. Same as the colors kwarg in the rliable plotting utils.

Returns:

The created figures and axes in the order: fig_iqm, ax_iqm, fig_profile, ax_profile.

to_evaluation_sequence(scope: DataScope = DataScope.TEST) Iterator[EvaluationSequenceEntry][source]#

Convert the experiment result to EvaluationSequence.

Parameters:

scope – The scope of the evaluation, either ‘TEST’ or ‘TRAIN’.

Returns:

The rliable EvaluationSequence.

load_and_eval_experiment(log_dir: str, show_plots: bool = True, save_plots: bool = True, save_as_json: bool = True, scope: DataScope | Literal['both'] = DataScope.TEST, max_env_step: int | None = None) MultiRunExperimentResult[source]#

Evaluate the experiments in the given log directory using the rliable API and return the loaded results object. By default, will persist the evaluation results as plots and JSON files in the experiment directory.

Parameters:
  • log_dir – The directory containing the experiment results.

  • show_plots – whether to display plots.

  • save_plots – whether to save plots to the log_dir.

  • save_as_json – whether to save the evaluation results as a JSON file (in a format compatible by the Tianshou benchmarking visualization) in the experiment directory.

  • scope – The scope of the evaluation (training or test) or ‘both’.

  • max_env_step – The maximum number of environment steps to consider. If None, all data is considered. Note: if the experiments have different numbers of steps, the minimum number is used.