Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak #1409

Open
sergii-mamedov opened this issue Nov 22, 2024 · 1 comment
Open

Possible memory leak #1409

sergii-mamedov opened this issue Nov 22, 2024 · 1 comment

Comments

@sergii-mamedov
Copy link

Hi @JosepSampe,

I think I found two memory leaks in lithops:

  1. FaaSInvoker and self.executor
    You create a ThreadPoolExecutor

    self.executor = ThreadPoolExecutor(invoke_pool_threads)

    and use it here
    future = self.executor.submit(self._invoke_task, job, call_ids_range)

    but you don't do self.executor.shutdown() so the threads remain hanging.
    In our case, the worker using lithops is a daemon. This means that for each new dataset for AWS Lambda, we create 64 new threads because we are reusing an existing Serverless Executor. This leads to hundreds of hanging threads after a couple of hours .

  2. FaaSInvoker and _start_async_invokers
    This is much less of a problem, but also creates two threads each time

    lithops/lithops/invokers.py

    Lines 332 to 336 in c4ecc74

    for inv_id in range(self.ASYNC_INVOKERS):
    p = threading.Thread(target=invoker_process, args=(inv_id,))
    self.invokers.append(p)
    p.daemon = True
    p.start()

    Yes, there is a stop() method below but it automatically works only for class FaaSRemoteInvoker. In case I call executor = lithops.ServerlessExecutor(config=lithops_config, ...) then the stop() method only works in the case of a context manager:
    def __exit__(self, exc_type, exc_value, traceback):
    """ Context manager method """
    self.job_monitor.stop()
    self.invoker.stop()
    self.compute_handler.clear()

    I'm not insisting that there is a memory leak here, but even in the documentation there are many examples where FunctionExecutor is used without with.

P.S. Now the main developer of METASPACE is @lmacielvieira, so I'm tagging him as well @Bisho2122

@JosepSampe
Copy link
Member

JosepSampe commented Jan 11, 2025

Hi @sergii-mamedov @lmacielvieira

After extensive testing of Lithops, I’ve identified the following findings:

The self.executor = ThreadPoolExecutor(invoke_pool_threads) is initialized once, creating a fixed-size pool of reusable threads. These threads persist throughout the execution, and do not grow at any time.

The self.invoker.stop() method is used to stop the async invokers. However, as you pointed out, it is only invoked when using the context manager. Despite this, the thread pool created in the async invokers remains fixed in size as in the previos case.

After several experiments running map() operations for hours, I verified that the number of threads do not grow without bounds. However, I found that it can grow to more than 500 due to this hardcoded line:

with ThreadPoolExecutor(max_workers=250) as executor:

For the tests I used the py-spy tool, for example with: py-spy top --pid $(pgrep -f "python3 examples/map.py")

To improve this, I created this PR (#1414) with fixes that will substantially reduce the number of threads, so that the number of threads in the async invokers will depend on your configuration.

I also verified that in the rest of the components of the code where we create threads, they are properly closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants