-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-Node Cluster Hangs on Zentity 1.6.1 Requests #56
Comments
Here's the local docker cluster profiled by VisualVM after around 20 alternating setup then delete index requests. I'm not very experienced in profiling Java apps, but it looks like each request is creating quite a bit on the heap and leaving it behind, then perhaps the "hanging" happens on garbage collection. Something about going from ~100MB to ~350MB for those requests feels funny, like something is hanging around. |
@austince I haven't seen this error message in Elasticsearch before:
I found that the message originates from a method that Elasticsearch nodes use to publish cluster state. Here's the documented behavior for cluster state publication. The The first thing I would try is to rule out the plugins. If you remove the zentity plugin, and then repeatedly create/remove indices or perform other similar operations that zentity performs, do you still see the behavior? |
I was not able to reproduce this when testing on a local docker cluster without the plugin installed, nor on a cluster with the plugin installed but not used. I tried to reproduce the issue by sending repeating create/ delete index requests. I created a small "stress test" that can reproduce the issue, which is in the attached zip ( Here's the profiling when running that: The first spike is during non-zentity requests, and the more drawn out spikes are from repeated Zentity setup (and then delete zentity model index) requests. The timeouts seem to roughly line up with the 30s timeout. It also looks like the issue gums up the TransportService as well, which I think controls the threads that the plugin handlers run on? Though I'm not really sure, just going by some logged thread names. If that is the service responsible for publishing cluster state changes, that would tie in to the timeouts we're seeing in that area.
|
Thanks @austince this is good to know about. Looking at the contents of your linked I'll have to dig in and see what the issue might be. Thanks for sharing the stress test package! |
I narrowed down a couple things after using your stress test scripts:
I don't see anything in the Elasticsearch 7.9.0 release notes that explains the change in behavior. It lists a known memory leak ("manifests in Elasticsearch when a single document is updated repeatedly with a forced refresh"), but the issue we observe occurs during any operation from a plugin. I suspect something changed in the behavior of the Elasticsearch Java APIs that zentity uses (e.g. BaseRestHandler, RestChannelConsumer, or NodeClient), or in the behavior of the Elasticsearch server and its handling plugins. |
Awesome, thanks for digging in Dave. We'll test out ES 7.9.0 in kubernetes on our end as well -- that's a good enough work around for now. What's the best way to look into the potential changes on the ES plugin handling side? Should I open up a discussion on discuss.elastic.co? |
I'm going to approach this from a few angles:
Creating the sample plugin would be a good thing to try before reaching out to discuss.elastic.co or the elasticsearch repo, because it could help us further isolate the issue and make a more actionable question if needed. It would also give us something that they can use to reproduce the issue easily, if needed. |
@austince I was able to reproduce the issue on a more simple plugin. I've submitted a bug report to the elasticsearch team with a modified version of your stress test. I labeled it as a "bug" because our usage of the NodeClient has worked with plugins since at least Elasticsearch 5.x, and the expected behavior stopped working after 7.9.0 without an explanation in the release notes. Let's see what the team responds with. |
Update from the bug report: The issue appears to be rooted in the use of blocking calls in zentity's handlers (e.g. whenever the So the next question is, what's the right way for a plugin to await the response of requests to Elasticsearch before returning a single response to the user? I submitted this question on the discussion board to see if other plugin authors can offer guidance. |
The question on the discussion board now has an answer with some guidance. This issue will continue to be my priority until it's resolved and released in the next patch version 1.6.2 (release plan). I've updated the website to include a warning label for this known issue to help people avoid choosing a version that will lead to cluster instability. |
Ah, that makes a lot of sense. That seems like a very reasonable solution as well. Thanks so much for these findings. |
I pushed my first commit to the new
I haven't yet attempted to implement async actions to Job, which I believe has the only other blocking call in the plugin. The blocking method is defined here, and it's called many times here. This going to take some careful thought. @austince maybe you could take a look, too? |
Looks great! I'll take a read through tomorrow and see if I can think of any useful notes. Update: sorry I haven't had time to look into this yet, busy end of the week but hope to have time tomorrow. |
Pushed a second commit to the Now that I'm more familiar with the async APIs in Elasticsearch, and being satisfied with the current state of async in the REST handlers, I'll move onto the more challenging async implementation in the Job class. |
Pushed a commit to the Things are looking promising (async pun?):
The solution was much simpler than I anticipated. Thanks to this recent blog post for showing a pattern that uses CompletableFuture in Elasticsearch. That saved me from refactoring the entire Job class. I almost want to rework my prior commits (the REST handlers) to use this pattern. Not a big priority, though. They work and we need to ship out this fix soon. |
I had some time to read through and give some notes on the commits, let me know what you think! |
Maybe opening a draft PR for that branch would let us consolidate the conversation about the implementation? |
++ Sounds good. It's ready at PR #67. |
Merged to master |
Hey there, I'm currently experiencing an issue running Zentity 1.6.1 w/ Elasticsearch 7.10.1 inside a multi-node cluster, but not on a single-node cluster. When sending alernating setup/ delete requests (as well as with other requests), it sometimes hangs and looks like the Elasticsearch
CoordinatorPublication
gets gummed up. I can replicate this both in a local docker-compose setup (attached below) and in Kubernetes with an elastic-on-k8s cluster w/ 3 master and 3 data nodes.Here are the logs from the docker-compose setup, where I've deleted then created the index, and the coordination hangs for 30+ seconds:
Do you think this originates in the plugin or in a misconfiguration of the clusters?
Docker Compose file
Elastic K8s manifest
The text was updated successfully, but these errors were encountered: