Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] EOF error with gym.vector.AsyncVectorEnv() when calling the step method. #3281

Open
1 task done
jng164 opened this issue Jul 8, 2024 · 3 comments
Open
1 task done

Comments

@jng164
Copy link

jng164 commented Jul 8, 2024

Describe the bug
The code suddenly reaches an EOF error when calling the step method after 12M steps of training.

Code example
I am using gym.vector.AsyncVectorEnv(). I use the function make_envto create my environments.

def make_env(gym_id, seed, idx, capture_video, run_name, qubits, depth):
    
    def thunk():
        env = gym.make(gym_id, qubits=qubits, depth=depth, env_id=idx)
        env = gym.wrappers.RecordEpisodeStatistics(env)
        if capture_video and idx == 0:
            env = gym.wrappers.RecordVideo(env, f"videos/{run_name}")
        return env

    return thunk

The main part of the code is as follows:

if __name__ == "__main__":
    mp.set_start_method('spawn')
    device = torch.device("cuda" if torch.cuda.is_available() and args.cuda else "cpu")
    envs = gym.vector.AsyncVectorEnv(
        [make_env(args.gym_id, args.seed + i, i, args.capture_video, run_name, qubits, depth) for i in range(args.num_envs)],
    shared_memory=False)
    agent = AgentGNN(envs, device).to(device)#Graph Neural Network
    for update in range(1, num_updates + 1):
        for step in range(args.num_steps):  
            global_step += 1 * args.num_envs
            dones[step] = next_done
            try:
                with torch.no_grad():
                    action, logprob, _, value, logits, action_ids = agent.get_action_and_value(next_obs_graph, device=device)
                    values[step] = value.flatten()
                actions[step] = action
                logprobs[step] = logprob
                
                next_obs, reward, done, deprecated, info = envs.step(action_ids.cpu().numpy()) 
            except TypeError as e:
                print(f"Error: {e}")
            rewards[step] = torch.tensor(reward).to(device).view(-1)

            next_done = torch.Tensor(done).to(device)

As far as I understand the error, this code generates as much threads as environments I want. In one particular thread , the agent breaks in env.step(). As you can see, I tried to solve this issue with a try-except, but this does not work. I think this can be because the thread just keeps on hold until it breaks but I am not sure.

Traceback

Traceback (most recent call last):
  File "/home/jriu/Copt-cquere/rl-zx/ppo.py", line 204, in <module>
    next_obs, reward, done, deprecated, info = envs.step(action_ids.cpu().numpy())
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/vector_env.py", line 137, in step
    return self.step_wait()
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 320, in step_wait
    result, success = pipe.recv()
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 250, in recv
    buf = self._recv_bytes()
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
    buf = self._recv(4)
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
    raise EOFError
EOFError
/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/async_vector_env.py:457: UserWarning: WARN: Calling `close` while waiting for a pending call to `step` to complete.
Exception ignored in: <function AsyncVectorEnv.__del__ at 0x7ea18eb856c0>
Traceback (most recent call last):
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 546, in __del__
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/vector_env.py", line 205, in close
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 461, in close_extras
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/site-packages/gym/vector/async_vector_env.py", line 320, in step_wait
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 250, in recv
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 414, in _recv_bytes
  File "/home/jriu/anaconda3/envs/cquere/lib/python3.10/multiprocessing/connection.py", line 383, in _recv
EOFError: 

System Info
I use gym 0.26.2, torch 2.0.1 and python 3.10.14. I am using Ubuntu 24.04 LTS. All of the packages were installed using pip.

Additional context
Add any other context about the problem here.

Checklist

  • I have checked that there is no similar issue in the repo (required)
@Fengwenhao01
Copy link

l have the same problem.

@antoniopioricciardi
Copy link

Same, with gym 0.29.1, torch 2.0.1, python 3.9.18 and Pop!_OS (an Ubuntu distro) 22.04.

No issue running with SyncVectorEnv, or running Async on Mac. My python environment is installed via uv pip.

@w1463442883
Copy link

w1463442883 commented Jan 13, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants