Vectorized Environment#

In reinforcement learning, the agent interacts with environments to improve itself. In this tutorial we will concentrate on the environment part. Although there are many kinds of environments or their libraries in DRL research, Tianshou chooses to keep a consistent API with OPENAI Gym.

In Gym, an environment receives an action and returns next observation and reward. This process is slow and sometimes can be the throughput bottleneck in a DRL experiment.

Tianshou provides vectorized environment wrapper for a Gym environment. This wrapper allows you to make use of multiple cpu cores in your server to accelerate the data sampling.

Hide code cell content

import time

import gymnasium as gym
import numpy as np

from tianshou.env import DummyVectorEnv, SubprocVectorEnv
num_cpus = [1, 2, 5]
for num_cpu in num_cpus:
    env = SubprocVectorEnv([lambda: gym.make("CartPole-v1") for _ in range(num_cpu)])
    sampled_steps = 0
    time_start = time.time()
    while sampled_steps < 1000:
        act = np.random.choice(2, size=num_cpu)
        obs, rew, terminated, truncated, info = env.step(act)
        if np.sum(terminated):
        sampled_steps += num_cpu
    time_used = time.time() - time_start
    print(f"{time_used}s used to sample 1000 steps if using {num_cpu} cpus.")
0.3047788143157959s used to sample 1000 steps if using 1 cpus.
0.20440268516540527s used to sample 1000 steps if using 2 cpus.
0.16405653953552246s used to sample 1000 steps if using 5 cpus.

You may notice that the speed doesn’t increase linearly when we add subprocess numbers. There are multiple reasons behind this. One reason is that synchronize exception causes straggler effect. One way to solve this would be to use asynchronous mode. We leave this for further reading if you feel interested.

Note that SubprocVectorEnv should only be used when the environment execution is slow. In practice, DummyVectorEnv (or raw Gym environment) is actually more efficient for a simple environment like CartPole because now you avoid both straggler effect and the overhead of communication between subprocesses.



Just pass in a list of functions which return the initialized environment upon called.

# In Gym
gym_env = gym.make("CartPole-v1")

# In Tianshou
def create_cartpole_env() -> gym.Env:
    return gym.make("CartPole-v1")

# We can distribute the environments on the available cpus, which we assume to be 5 in this case
vector_env = DummyVectorEnv([create_cartpole_env for _ in range(5)])

EnvPool supporting#

Besides integrated environment wrappers, Tianshou also fully supports EnvPool. Explore its Github page yourself.

Environment execution and resetting#

The only difference between Vectorized environments and standard Gym environments is that passed in actions and returned rewards/observations are also vectorized.

# In gymnasium, env.reset() returns an observation, info tuple
print("In Gym, env.reset() returns a single observation.")

# In Tianshou, envs.reset() returns stacked observations.
print("In Tianshou, a VectorEnv's reset() returns stacked observations.")

info = vector_env.step(np.random.choice(2, size=vector_env.env_num))[4]
In Gym, env.reset() returns a single observation.
(array([ 0.02897538,  0.04216968, -0.03254366, -0.01787178], dtype=float32), {})
In Tianshou, a VectorEnv's reset() returns stacked observations.
(array([[-0.04430439,  0.00154157, -0.01929247,  0.0457021 ],
       [ 0.04496346,  0.04737651, -0.032956  ,  0.01457956],
       [ 0.04939883,  0.0221455 , -0.01858444, -0.00380491],
       [ 0.01733261, -0.00615817, -0.04921177,  0.02571244],
       [-0.04183235, -0.04162949, -0.02485959,  0.00987462]],
      dtype=float32), array([{}, {}, {}, {}, {}], dtype=object))
[{'env_id': 0} {'env_id': 1} {'env_id': 2} {'env_id': 3} {'env_id': 4}]

If we only want to execute several environments. The id argument can be used.

info = vector_env.step(np.random.choice(2, size=3), id=[0, 3, 1])[4]
[{'env_id': 0} {'env_id': 3} {'env_id': 1}]

Further Reading#

Other environment wrappers in Tianshou#

  • ShmemVectorEnv: use share memory instead of pipe based on SubprocVectorEnv;

  • RayVectorEnv: use Ray for concurrent activities and is currently the only choice for parallel simulation in a cluster with multiple machines.

Check the documentation for details.

Difference between synchronous and asynchronous mode (How to choose?)#

Explanation can be found at the Parallel Sampling tutorial.