Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Debug Memory Leak in Autogen #4893

Open
Leon0402 opened this issue Jan 4, 2025 · 2 comments
Open

Debug Memory Leak in Autogen #4893

Leon0402 opened this issue Jan 4, 2025 · 2 comments
Milestone

Comments

@Leon0402
Copy link
Contributor

Leon0402 commented Jan 4, 2025

@Leon0402 Can you show where your runtime is created? this might be due to the runtime is not removing references to created agents. 

To mitigate you might want to create new instances of runtime for each task.

I think we should handle it in a separate PR.

_Originally posted by @ekzhu in https://github.com/microsoft/autogen/issues/4885#issuecomment-2571434115_

Thanks @ekzhu, you could be right about that. Possibly some interplay with gather() as I read something in that direction about it. I currently try to reproduce in a smaller setup.

What do you mean by runtime? My Task Runner? This basically just is:

class TaskRunner:
    def __init__(self, cfg: Config):
        self._cfg = cfg
        
    async def run_agent(self, sample: TaskSample, output_dir: Path):
         // define agents here
         // run chat
         // save results to some file

I do not store anything in the object itself. So it was my assumption that agents should get cleaned up after run_agent is left.

@ekzhu
Copy link
Collaborator

ekzhu commented Jan 4, 2025

Thanks for creating the issue. A simple setup can be tried without the jupyter code executor to isolate the cause. And then add the jupyter executor to run a simple piece of code. See the difference.

@ekzhu ekzhu added this to the 0.4.1 milestone Jan 4, 2025
@Leon0402
Copy link
Contributor Author

Leon0402 commented Jan 6, 2025

Here is a memory chart of my long running task:
Image

So yeah not great :D Debugging and isolating the cause is not too easy though. I think I was able to get something useful though.

Image

This is after one full iteration of run_task, so down in the for loop here, where everything should be cleaned up by the gather.

async def run_task(cfg: Config, task: TaskType):
    ...

    semaphore = asyncio.Semaphore(cfg.concurrency_limit)

    async def run_single_sample(task_runner: TaskRunner, task_sample: TaskSample):
        async with semaphore:
            await task_runner.run_agent(task_sample, cfg.output_dir / task.value)

    task_runner = TaskRunner(cfg)
    samples = [run_single_sample(task_runner, task_sample) for task_sample in sliced_samples]
    await tqdm.gather(*samples, desc=f"Task: {task.value}")


async def run_tasks(cfg: Config):
    for task in cfg.tasks:
        await run_task(cfg, task)
        # Added a asyncio.sleep(60) to be sure
        --> HERE

Looking at my debug output, it seems:

I am not too familar with the whole async stuff yet, but that line looks a little bit shady to me. I know __del__ has some heavy caveats in python.

Any thoughts on this?

Edit: I cannot really reliably reproduce this. So maybe my theory is wrong here :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants