Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery + gevent error #5344

Closed
johnnymetz opened this issue Mar 19, 2023 · 23 comments
Closed

Celery + gevent error #5344

johnnymetz opened this issue Mar 19, 2023 · 23 comments

Comments

@johnnymetz
Copy link

johnnymetz commented Mar 19, 2023

Summary of problem

Getting an error when running ddtrace, celery and gevent.

Perhaps I need to add the following code to the celery app (per these instructions)?

from ddtrace import patch
patch(gevent=True)

Potentially related issues:

Which version of dd-trace-py are you using?

1.9.3

Looks like 1.9.3 includes some gevent changes (see here) so may be a side effect of that.

Which version of pip are you using?

23.0.1

Which libraries and their versions are you using?

Python 3.11.2

`pip freeze` amqp==5.1.1 arrow==1.2.3 asgiref==3.6.0 astroid==2.15.0 async-timeout==4.0.2 attrs==22.2.0 bandit==1.7.5 beautifulsoup4==4.11.2 billiard==3.6.4.0 blinker==1.5 boto3==1.26.92 botocore==1.29.92 Brotli==1.0.9 bytecode==0.14.0 cattrs==22.2.0 celery==5.2.7 certifi==2022.12.7 cffi==1.15.1 charset-normalizer==3.1.0 chevron==0.14.0 click==8.1.3 click-didyoumean==0.3.0 click-plugins==1.1.1 click-repl==0.2.0 ConfigArgParse==1.5.3 coverage==7.2.1 cryptography==39.0.2 cytoolz==0.12.1 ddsketch==2.0.4 ddtrace==1.9.3 deepdiff==6.2.3 deepmerge==1.1.0 defusedxml==0.7.1 dill==0.3.6 dj-database-url==1.2.0 Django==4.1.7 django-colorfield==0.8.0 django-cors-headers==3.14.0 django-cprofile-middleware==1.0.5 django-csp==3.7 django-db-geventpool==4.0.1 django-environ==0.10.0 django-extensions==3.2.1 django-filter==22.1 django-redis==5.2.0 django-sendgrid-v5==1.2.2 django-sesame==3.1 django-storages==1.13.2 django-timezone-field==5.0 django-waffle==3.0.0 django-webpack-loader==1.8.1 django-zen-queries==2.1.0 django_debug_toolbar==3.8.1 djangorestframework==3.14.0 djangorestframework-csv==2.1.1 djangorestframework-dataclasses==1.2.0 djangorestframework-link-header-pagination==0.1.1 envier==0.4.0 et-xmlfile==1.1.0 execnet==1.9.0 factory-boy==3.2.1 Faker==17.6.0 Flask==2.2.3 Flask-BasicAuth==0.2.0 Flask-Cors==3.0.10 fqdn==1.5.1 genson==1.2.2 gevent==22.10.2 geventhttpclient==2.0.8 gitdb==4.0.10 GitPython==3.1.31 greenlet==2.0.2 gunicorn==20.1.0 idna==3.4 iniconfig==2.0.0 isoduration==20.11.0 isort==5.12.0 itsdangerous==2.1.2 Jinja2==3.1.2 jmespath==1.0.1 jsonpointer==2.3 jsonschema==4.17.3 kombu==5.2.4 lazy-object-proxy==1.9.0 locust==2.15.1 markdown-it-py==2.2.0 MarkupSafe==2.1.2 mccabe==0.7.0 mdurl==0.1.2 mixpanel==4.9.0 moto==4.1.4 msgpack==1.0.4 networkx==3.0 newrelic==8.7.0 nplusone==1.0.0 oauthlib==3.2.2 openpyxl==3.1.2 ordered-set==4.1.0 orjson==3.8.7 packaging==23.0 pbr==5.11.1 Pillow==9.4.0 platformdirs==3.1.0 pluggy==1.0.0 prompt-toolkit==3.0.38 protobuf==4.22.1 psutil==5.9.4 psycogreen==1.0.2 psycopg2==2.9.5 py-mini-racer==0.6.0 pycparser==2.21 Pygments==2.14.0 PyJWT==2.6.0 pylint==2.17.0 pylint-django==2.5.3 pylint-plugin-utils==0.7 pyrsistent==0.19.3 pytest==7.2.2 pytest-cov==4.0.0 pytest-django==4.5.2 pytest-socket==0.6.0 pytest-xdist==3.2.1 python-dateutil==2.8.2 python-http-client==3.3.7 python3-openid==3.2.0 pytz==2022.1 PyYAML==6.0 pyzmq==25.0.0 redis==4.5.1 requests==2.28.2 requests-oauthlib==1.3.1 responses==0.23.1 rfc3339-validator==0.1.4 rfc3986-validator==0.1.1 rich==13.3.2 rollbar==0.16.3 roundrobin==0.0.4 s3transfer==0.6.0 sendgrid==6.9.6 six==1.16.0 smmap==5.0.0 social-auth-app-django==5.0.0 social-auth-core==4.4.0 soupsieve==2.4 sq-native==1.1.2.0.0 sqlparse==0.4.3 sqreen==1.28.0b1 starkbank-ecdsa==2.2.0 stevedore==5.0.0 tenacity==8.2.2 time-machine==2.9.0 toml==0.10.2 tomlkit==0.11.6 toolz==0.12.0 types-PyYAML==6.0.12.8 types-toml==0.10.8.5 typing_extensions==4.5.0 tzdata==2022.7 unicodecsv==0.14.1 uri-template==1.2.0 uritemplate==4.1.1 urllib3==1.26.14 vine==5.0.0 watchdog==2.3.1 wcwidth==0.2.6 webcolors==1.12 Werkzeug==2.2.3 whitenoise==6.4.0 wrapt==1.15.0 xmltodict==0.13.0 zope.event==4.6 zope.interface==5.5.2

How can we reproduce your problem?

Run celery with the gevent pool:

ddtrace-run celery --app=remix.celery.app worker --pool=gevent --concurrency=25 --loglevel=info

Then run a celery task that involves writing a file to S3 (using boto3).

What is the result that you get?

[2023-03-16 14:46:14,840: ERROR/MainProcess] Signal handler <function trace_failure at 0x7f0e362ad6c0> raised: AttributeError("'_Code' object has no attribute 'co_positions'")
Traceback (most recent call last):
  File "/app/.heroku/python/lib/python3.11/site-packages/celery/app/trace.py", line 451, in trace_task
    R = retval = fun(*args, **kwargs)
                 ^^^^^^^^^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/site-packages/celery/app/trace.py", line 734, in __protected_call__
    return self.run(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/store/tasks/store_tasks.py", line 55, in export_databases_task
    job.data.save(filename, ContentFile(file_or_bytes))
  File "/app/.heroku/python/lib/python3.11/site-packages/django/db/models/fields/files.py", line 92, in save
    self.name = self.storage.save(name, content, max_length=self.field.max_length)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/site-packages/django/core/files/storage.py", line 56, in save
    name = self._save(name, content)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/site-packages/storages/backends/s3boto3.py", line 444, in _save
    obj = self.bucket.Object(name)
          ^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/site-packages/storages/backends/s3boto3.py", line 389, in bucket
    self._bucket = self.connection.Bucket(self.bucket_name)
                   ^^^^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/site-packages/storages/backends/s3boto3.py", line 356, in connection
    self._connections.connection = session.resource(
                                   ^^^^^^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/site-packages/boto3/session.py", line 446, in resource
    client = self.client(
             ^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/site-packages/boto3/session.py", line 299, in client
    return self._session.create_client(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/site-packages/botocore/session.py", line 976, in create_client
    client = client_creator.create_client(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/site-packages/botocore/client.py", line 155, in create_client
    client_args = self._get_client_args(
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/site-packages/botocore/client.py", line 485, in _get_client_args
    return args_creator.get_client_args(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/site-packages/botocore/args.py", line 129, in get_client_args
    endpoint = endpoint_creator.create_endpoint(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/site-packages/botocore/endpoint.py", line 409, in create_endpoint
    http_session = http_session_cls(
                   ^^^^^^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/site-packages/botocore/httpsession.py", line 313, in __init__
    self._manager = PoolManager(**self._get_pool_manager_kwargs())
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/site-packages/botocore/httpsession.py", line 331, in _get_pool_manager_kwargs
    'ssl_context': self._get_ssl_context(),
                   ^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/site-packages/botocore/httpsession.py", line 340, in _get_ssl_context
    return create_urllib3_context()
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/site-packages/botocore/httpsession.py", line 129, in create_urllib3_context
    context.options |= options
    ^^^^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/ssl.py", line 624, in options
    super(SSLContext, SSLContext).options.__set__(self, value)
  File "/app/.heroku/python/lib/python3.11/ssl.py", line 624, in options
    super(SSLContext, SSLContext).options.__set__(self, value)
  File "/app/.heroku/python/lib/python3.11/ssl.py", line 624, in options
    super(SSLContext, SSLContext).options.__set__(self, value)
  [Previous line repeated 469 more times]
RecursionError: maximum recursion depth exceeded

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/app/.heroku/python/lib/python3.11/site-packages/celery/utils/dispatch/signal.py", line 276, in send
    response = receiver(signal=self, sender=sender, **named)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/site-packages/ddtrace/contrib/celery/signals.py", line 177, in trace_failure
    span.set_exc_info(ex.type, ex.exception, ex.tb)
  File "/app/.heroku/python/lib/python3.11/site-packages/ddtrace/span.py", line 446, in set_exc_info
    traceback.print_exception(exc_type, exc_val, exc_tb, file=buff, limit=20)
  File "/app/.heroku/python/lib/python3.11/traceback.py", line 124, in print_exception
    te = TracebackException(type(value), value, tb, limit=limit, compact=True)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/traceback.py", line 690, in __init__
    self.stack = StackSummary._extract_from_extended_frame_gen(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/traceback.py", line 416, in _extract_from_extended_frame_gen
    for f, (lineno, end_lineno, colno, end_colno) in frame_gen:
  File "/app/.heroku/python/lib/python3.11/traceback.py", line 353, in _walk_tb_with_full_positions
    positions = _get_code_position(tb.tb_frame.f_code, tb.tb_lasti)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/.heroku/python/lib/python3.11/traceback.py", line 366, in _get_code_position
    positions_gen = code.co_positions()
                    ^^^^^^^^^^^^^^^^^
AttributeError: '_Code' object has no attribute 'co_positions'

Note, I'm only seeing this error in Heroku. Been unable to reproduce it locally.

The AttributeError: '_Code' object has no attribute 'co_positions' part of the exception is likely related to ddtrace not being completely compatible with Celery + Python 3.11 yet - see #4149

What is the result that you expected?

Celery task is able to run successfully.

@P403n1x87
Copy link
Contributor

Does the issue occur without ddtrace? It looks like the original exception would happen regardless, probably because of gevent monkey patching either not happening, or happening too late.

@johnnymetz
Copy link
Author

johnnymetz commented Mar 20, 2023

Confirmed the issue only happens with ddtrace

@emmettbutler
Copy link
Collaborator

@johnnymetz a likely workaround for this crash is setting the DD_UNLOAD_MODULES_FROM_SITECUSTOMIZE=0 environment variable.

@johnnymetz
Copy link
Author

@emmettbutler that fixed it! Are there any downsides to DD_UNLOAD_MODULES_FROM_SITECUSTOMIZE=0. The docs suggest it's fine.

@emmettbutler
Copy link
Collaborator

Glad to hear it @johnnymetz! That flag being 0 basically turns off a bugfix, so if it makes your app work as expected, I'd say it's fine to leave it in.

@johnnymetz
Copy link
Author

Great, sounds good to me

@johnnymetz
Copy link
Author

DD_UNLOAD_MODULES_FROM_SITECUSTOMIZE=0 didn't actually fix the error. I tested it incorrectly.

Any other ideas? Looks like gevent isn't being monkey patched correctly: https://stackoverflow.com/questions/49820173/recursionerror-maximum-recursion-depth-exceeded-from-ssl-py-supersslcontex

@johnnymetz johnnymetz reopened this Mar 22, 2023
@emmettbutler
Copy link
Collaborator

@johnnymetz one thing that would be helpful to understand is what the latest version of ddtracepy that doesn't cause this error is.

@emmettbutler
Copy link
Collaborator

emmettbutler commented Mar 22, 2023

@johnnymetz another important point to note is that as of 1.9.3 the ddtrace library doesn't call gevent.monkey.patch_all(), so the user has to make sure they're calling it as needed by the application. I assume in this case celery is handling that for you, but figured I'd mention it because I'm not certain of that.

You might also try making a little script like this

from importlib.util import find_spec
print(getattr(find_spec("gevent"), "loader", None))

And then running it in the same environment you're seeing this error in

ddtrace-run python testscript.py

If that prints None, I'd expect DD_UNLOAD_MODULES_FROM_SITECUSTOMIZE=1 to fix your issue. If not, I'm still unsure.

@johnnymetz
Copy link
Author

one thing that would be helpful to understand is what the latest version of ddtracepy that doesn't cause this error is.

I've tried the following ddtrace versions and seeing the error in all of them:

1.9.4
1.9.3
1.9.2
1.8.0

I assume in this case celery is handling that for you

Looks like Celery is handling this for us: https://github.com/celery/celery/blob/a80da3965fefcf9c7638c0a264314cd194a71d1f/celery/__init__.py#L109

@emmettbutler
Copy link
Collaborator

Oh wow, the fact that you're seeing the same issue in 1.8.0 changes my thinking on this quite a bit. This appears to be a completely different error than anything we've fixed recently related to gevent. Good to know it's not a new regression in 1.9.3, at least. Thanks for being so responsive @johnnymetz.

@johnnymetz
Copy link
Author

Just tested this without gevent and it works fine:

ddtrace-run celery --app=remix.celery.app worker

So this appears to be a gevent bug.

Just submitted another gevent + boto bug which may be related: #5398

@P403n1x87
Copy link
Contributor

@johnnymetz thank you very much for all the testing that you've done so far! Having seen the linked issue #5398 I'm starting to suspect that the problem might stem from the ssl module and the way it interacts with ddtrace when gevent does its monkey-patching. I have already started investigating and I'm planning to give the demo linked in #5398 a try. In the meantime, if you would like to help us more, a couple of things to try that might be helpful to us are:

  • test an older version of ddtrace (e.g. ddtrace==1.8.0) and ensure that DD_GEVENT_PATCH_ALL is set to 1 or true.
  • test 1.9.3 but with DD_TRACE_ENABLED=0 (the goal here is to make sure that there are no ddtrace worker threads running, so we want to disable tracing, profiling etc...) to be sure only the application has access to the ssl stack.

For this particular issue, another thing that might be worth testing, if possible, is using an older version of Python too, e.g. 3.10, and ddtrace=1.9.3.

@P403n1x87
Copy link
Contributor

@johnnymetz thanks again for all the testing thus far! Do you reckon you would be able to put a small reproducer together? I'm currently having not much luck in reproducing this issue 🙁

@johnnymetz
Copy link
Author

johnnymetz commented Mar 31, 2023

@P403n1x87 in an effort to reproduce this bug in a barebones project, I've run into a separate bug that is blocking me: #5453

May be related to this issue.

@P403n1x87
Copy link
Contributor

@johnnymetz many thanks for helping us with this. There is a good chance that the two issues are related. An indication of this would be If the error goes away when disabling the tracer with DD_TRACE_ENABLED=0.

@johnnymetz
Copy link
Author

@P403n1x87 I just confirmed DD_TRACE_ENABLED=0 fixes the issue.

Let me know if this is enough to work off of or if you still need me to generate a reproducer project. If you do need the reproducer, I can work on that once #5453 is resolved.

@P403n1x87
Copy link
Contributor

@johnnymetz thanks for confirming that. Thanks to your other reproducer I think we've got a lead for now. It seems that Celery imports some Django modules (via kombu), which trigger our Django monkey-patching for the instrumentation. This in turn causes some modules to be loaded too early (and this probably includes ssl too). Once that is resolved I would expect this issue to also be fixed 🤞 .

@johnnymetz
Copy link
Author

@P403n1x87 looks like DD_TRACE_ENABLED=0 fixes the RecursionError bug but I'm not seeing any celery data in Datadog

@P403n1x87
Copy link
Contributor

That's to be expected. DD_TRACE_ENABLED=0 is disabling the tracer, so traces won't be collected. This is not a fix unfortunately, only evidence that the tracer integrations are likely responsible for loading the ssl module too early. This should hopefully be fixed by #5455.

@johnnymetz
Copy link
Author

Hi @P403n1x87 any update on this?

@P403n1x87
Copy link
Contributor

@johnnymetz we're still working on this. The draft PR with the potential fix still needs a bit of ironing out as we are seeing some issues with the traces generated during tests, so we're trying to make sure we still produce the same data as before. Thanks for your patience with this and apologies for the long wait.

@johnnymetz
Copy link
Author

Confirmed this is fixed in 1.12.6. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants