Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-Node Cluster Hangs on Zentity 1.6.1 Requests #56

Closed
austince opened this issue Jan 22, 2021 · 19 comments
Closed

Multi-Node Cluster Hangs on Zentity 1.6.1 Requests #56

austince opened this issue Jan 22, 2021 · 19 comments
Labels
bug Something isn't working release:1.6.2

Comments

@austince
Copy link
Contributor

austince commented Jan 22, 2021

Hey there, I'm currently experiencing an issue running Zentity 1.6.1 w/ Elasticsearch 7.10.1 inside a multi-node cluster, but not on a single-node cluster. When sending alernating setup/ delete requests (as well as with other requests), it sometimes hangs and looks like the Elasticsearch CoordinatorPublication gets gummed up. I can replicate this both in a local docker-compose setup (attached below) and in Kubernetes with an elastic-on-k8s cluster w/ 3 master and 3 data nodes.

Here are the logs from the docker-compose setup, where I've deleted then created the index, and the coordination hangs for 30+ seconds:

elasticsearch    | {"type": "server", "timestamp": "2021-01-22T15:56:16,893Z", "level": "INFO", "component": "o.e.c.m.MetadataDeleteIndexService", "cluster.name": "docker-cluster", "node.name": "primary", "message": "[.zentity-models/kCCUX_6bS3CZeDQzImGi2A] deleting index", "cluster.uuid": "Zi3JrTDvRkmyjizI6z-6QQ", "node.id": "eZpuNPEsRqKPl6bhvojRJQ"  }
elasticsearch    | {"type": "deprecation", "timestamp": "2021-01-22T15:56:31,234Z", "level": "DEPRECATION", "component": "o.e.d.c.m.MetadataCreateIndexService", "cluster.name": "docker-cluster", "node.name": "primary", "message": "index name [.zentity-models] starts with a dot '.', in the next major version, index names starting with a dot are reserved for hidden indices and system indices", "cluster.uuid": "Zi3JrTDvRkmyjizI6z-6QQ", "node.id": "eZpuNPEsRqKPl6bhvojRJQ"  }
elasticsearch    | {"type": "server", "timestamp": "2021-01-22T15:56:31,309Z", "level": "INFO", "component": "o.e.c.m.MetadataCreateIndexService", "cluster.name": "docker-cluster", "node.name": "primary", "message": "[.zentity-models] creating index, cause [api], templates [], shards [1]/[1]", "cluster.uuid": "Zi3JrTDvRkmyjizI6z-6QQ", "node.id": "eZpuNPEsRqKPl6bhvojRJQ"  }
elasticsearch    | {"type": "server", "timestamp": "2021-01-22T15:56:41,313Z", "level": "INFO", "component": "o.e.c.c.C.CoordinatorPublication", "cluster.name": "docker-cluster", "node.name": "primary", "message": "after [10s] publication of cluster state version [928] is still waiting for {es-data-2}{Xjwq8qUrReyh5VUi21l3aQ}{btWNi8GkTJaAjVjbcQxe2g}{172.19.0.2}{172.19.0.2:9300}{dir} [SENT_PUBLISH_REQUEST]", "cluster.uuid": "Zi3JrTDvRkmyjizI6z-6QQ", "node.id": "eZpuNPEsRqKPl6bhvojRJQ"  }
elasticsearch    | {"type": "server", "timestamp": "2021-01-22T15:57:01,314Z", "level": "WARN", "component": "o.e.c.c.C.CoordinatorPublication", "cluster.name": "docker-cluster", "node.name": "primary", "message": "after [30s] publication of cluster state version [928] is still waiting for {es-data-2}{Xjwq8qUrReyh5VUi21l3aQ}{btWNi8GkTJaAjVjbcQxe2g}{172.19.0.2}{172.19.0.2:9300}{dir} [SENT_PUBLISH_REQUEST]", "cluster.uuid": "Zi3JrTDvRkmyjizI6z-6QQ", "node.id": "eZpuNPEsRqKPl6bhvojRJQ"  }

Do you think this originates in the plugin or in a misconfiguration of the clusters?

Docker Compose file
version: '3.7'

x-plugin-volume: &plugin-volume "./target/releases/:/plugins"

x-base-es: &base-es
image: docker.elastic.co/elasticsearch/elasticsearch-oss:${ES_VERSION:-7.10.2}
user: "elasticsearch"

install all plugins in mounted /plugin directory and start the elasticsearch server

command:
- /bin/bash
- -c
- elasticsearch-plugin install --batch https://zentity.io/releases/zentity-1.6.1-elasticsearch-7.10.2.zip && elasticsearch
ulimits:
nofile:
soft: 65536
hard: 65536
memlock:
soft: -1
hard: -1
environment: &base-env
cluster.name: docker-cluster
network.host: 0.0.0.0
# minimum_master_nodes need to be explicitly set when bound on a public IP
# set to 1 to allow single node clusters
# Details: elastic/elasticsearch#17288
discovery.zen.minimum_master_nodes: "1"
# Reduce virtual memory requirements, see docker/for-win#5202 (comment)
bootstrap.memory_lock: "false"
ES_JAVA_OPTS: "-Xms512m -Xmx512m -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=0.0.0.0:5050"
http.cors.enabled: "true"
http.cors.allow-origin: "*"
cluster.initial_master_nodes: primary
networks:
- elastic

x-base-primary-node: &base-primary-node
<<: *base-es
environment:
<<: *base-env
node.name: primary
node.master: "true"
node.data: "false"
node.ingest: "false"

x-base-data-node: &base-data-node
<<: *base-es
environment:
<<: *base-env
discovery.zen.ping.unicast.hosts: elasticsearch
node.master: "false"
node.data: "true"
node.ingest: "true"

services:
elasticsearch:
<<: *base-primary-node
hostname: elasticsearch
container_name: elasticsearch
volumes:
- *plugin-volume
- es-primary:/usr/share/elasticsearch/data
ports:
- "${ES_PORT:-9200}:9200" # http
- "${DEBUGGER_PORT:-5050}:5050" # debugger

es-data-1:
<<: *base-data-node
hostname: es-data-1
container_name: es-data-1
volumes:
- *plugin-volume
- es-data-1:/usr/share/elasticsearch/data
ports:
- "${DEBUGGER_PORT_DATA_1:-5051}:5050" # debugger

es-data-2:
<<: *base-data-node
hostname: es-data-2
container_name: es-data-2
volumes:
- *plugin-volume
- es-data-2:/usr/share/elasticsearch/data
ports:
- "${DEBUGGER_PORT_DATA_2:-5052}:5050" # debugger

kibana:
image: docker.elastic.co/kibana/kibana-oss:${KIBANA_VERSION:-7.10.1}
hostname: kibana
container_name: kibana
logging:
driver: none
environment:
- server.host=0.0.0.0
- server.name=kibana.local
- elasticsearch.url=http://elasticsearch:9200
ports:
- '${KIBANA_PORT:-5601}:5601'
networks:
- elastic

volumes:
es-primary:
driver: local
es-data-1:
driver: local
es-data-2:
driver: local

networks:
elastic:

Elastic K8s manifest
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  annotations:
    common.k8s.elastic.co/controller-version: 1.3.1
    elasticsearch.k8s.elastic.co/cluster-uuid: 8xDpRuE4T8ufu_KSJV4hFw
  creationTimestamp: "2021-01-20T17:28:29Z"
  generation: 4
  labels:
    app.kubernetes.io/instance: eck-entity-resolution
    app.kubernetes.io/managed-by: Tiller
    app.kubernetes.io/name: eck-entity-resolution
    app.kubernetes.io/part-of: eck
    app.kubernetes.io/version: 1.1.2
    helm.sh/chart: eck-entity-resolution-0.3.0
  name: eck-entity-resolution
  namespace: entity-resolution
  resourceVersion: "273469952"
  selfLink: /apis/elasticsearch.k8s.elastic.co/v1/namespaces/entity-resolution/elasticsearches/eck-entity-resolution
  uid: cff37de2-c6c3-4ebd-a230-e45f00bdc7e7
spec:
  auth:
    fileRealm:
    - secretName: eck-entity-resolution-users
    roles:
    - secretName: eck-entity-resolution-roles
  http:
    service:
      metadata:
        creationTimestamp: null
      spec: {}
    tls:
      certificate: {}
      selfSignedCertificate:
        disabled: true
  nodeSets:
  - config:
      node.data: false
      node.ingest: false
      node.master: true
    count: 3
    name: primary-node
    podTemplate:
      spec:
        containers:
        - env:
          - name: ES_JAVA_OPTS
            value: -Xms500m -Xmx500m
          name: elasticsearch
          resources:
            limits:
              cpu: 1
              memory: 1Gi
            requests:
              cpu: 0.5
              memory: 1Gi
        initContainers:
        - command:
          - sh
          - -c
          - |
            bin/elasticsearch-plugin install --batch https://github.com/zentity-io/zentity/releases/download/zentity-1.6.1/zentity-1.6.1-elasticsearch-7.10.1.zip
          name: install-plugins
        - command:
          - sh
          - -c
          - sysctl -w vm.max_map_count=262144
          name: sysctl
          securityContext:
            privileged: true
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 2Gi
        storageClassName: standard-expandable
  - config:
      node.data: true
      node.ingest: true
      node.master: false
    count: 3
    name: data-node
    podTemplate:
      containers:
      - env:
        - name: ES_JAVA_OPTS
          value: -Xms4g -Xmx4g -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=0.0.0.0:5005
        name: elasticsearch
        resources:
          limits:
            cpu: 2
            memory: 8Gi
          requests:
            cpu: 0.5
            memory: 8Gi
      spec:
        initContainers:
        - command:
          - sh
          - -c
          - |
            bin/elasticsearch-plugin install --batch https://github.com/zentity-io/zentity/releases/download/zentity-1.6.1/zentity-1.6.1-elasticsearch-7.10.1.zip
          name: install-plugins
        - command:
          - sh
          - -c
          - sysctl -w vm.max_map_count=262144
          name: sysctl
          securityContext:
            privileged: true
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 25Gi
        storageClassName: sdd-fast-expandable
  transport:
    service:
      metadata:
        creationTimestamp: null
      spec: {}
  updateStrategy:
    changeBudget: {}
  version: 7.10.1
status:
  availableNodes: 6
  health: green
  phase: Ready
  version: 7.10.1
@austince austince changed the title Cluster Hangs Multi-Node Cluster Hangs on Zentity Requests Jan 22, 2021
@austince
Copy link
Contributor Author

visual-vm-heap-growth

Here's the local docker cluster profiled by VisualVM after around 20 alternating setup then delete index requests. I'm not very experienced in profiling Java apps, but it looks like each request is creating quite a bit on the heap and leaving it behind, then perhaps the "hanging" happens on garbage collection. Something about going from ~100MB to ~350MB for those requests feels funny, like something is hanging around.

@davemoore-
Copy link
Member

@austince I haven't seen this error message in Elasticsearch before:

after [...] publication of cluster state version [...] is still waiting for ...

I found that the message originates from a method that Elasticsearch nodes use to publish cluster state. Here's the documented behavior for cluster state publication. The 10s and 30s durations reported by your log messages are really long, especially for a smaller cluster, and so my initial speculation is that this is caused by either a cluster misconfiguration or a bottleneck on the network or disk. I'd need to reproduce the issue and dig a bit to be sure.

The first thing I would try is to rule out the plugins. If you remove the zentity plugin, and then repeatedly create/remove indices or perform other similar operations that zentity performs, do you still see the behavior?

@austince austince changed the title Multi-Node Cluster Hangs on Zentity Requests Multi-Node Cluster Hangs on Zentity 1.6.1 Requests Jan 22, 2021
@austince
Copy link
Contributor Author

austince commented Jan 22, 2021

I was not able to reproduce this when testing on a local docker cluster without the plugin installed, nor on a cluster with the plugin installed but not used. I tried to reproduce the issue by sending repeating create/ delete index requests.

I created a small "stress test" that can reproduce the issue, which is in the attached zip (stress-test-cluster.sh) along with the docker-compose.yml file used for testing, which may be the source of cluster misconfiguration errors. I don't think the issue lies in the local network or disk, but could be wrong there.

zentity-stress-test.zip

Here's the profiling when running that:

zentity-stress-test

The first spike is during non-zentity requests, and the more drawn out spikes are from repeated Zentity setup (and then delete zentity model index) requests. The timeouts seem to roughly line up with the 30s timeout.

It also looks like the issue gums up the TransportService as well, which I think controls the threads that the plugin handlers run on? Though I'm not really sure, just going by some logged thread names. If that is the service responsible for publishing cluster state changes, that would tie in to the timeouts we're seeing in that area.

elasticsearch    | {"type": "server", "timestamp": "2021-01-22T21:43:36,620Z", "level": "WARN", "component": "o.e.t.TransportService", "cluster.name": "docker-cluster", "node.name": "primary", "message": "Received response for a request that has timed out, sent [19614ms] ago, timed out [4604ms] ago, action [cluster:monitor/nodes/stats[n]], node [{es-data-2}{Xjwq8qUrReyh5VUi21l3aQ}{6HLd44y9Qriva3OcIXukKg}{172.22.0.2}{172.22.0.2:9300}{dir}], id [4037]", "cluster.uuid": "Zi3JrTDvRkmyjizI6z-6QQ", "node.id": "eZpuNPEsRqKPl6bhvojRJQ"  }

@davemoore-
Copy link
Member

Thanks @austince this is good to know about. Looking at the contents of your linked stress-test-cluster.sh, the actions that produce this behavior are actions that change cluster state (e.g. adding and removing indices), which lines up with the log messages about cluster state publication timeouts. An action that shouldn't affect cluster state is a search (e.g. resolution) so it will be good to confirm if the issue doesn't appear for resolution jobs and model GETs.

I'll have to dig in and see what the issue might be. Thanks for sharing the stress test package!

@davemoore-
Copy link
Member

I narrowed down a couple things after using your stress test scripts:

  • The issue only occurs on Elasticsearch >= 7.9.0.
  • The issue occurs for any _zentity endpoint, including resolution requests which are read-only operations.
  • The issue occurs after only a half-dozen or so requests to any endpoint.

I don't see anything in the Elasticsearch 7.9.0 release notes that explains the change in behavior. It lists a known memory leak ("manifests in Elasticsearch when a single document is updated repeatedly with a forced refresh"), but the issue we observe occurs during any operation from a plugin. I suspect something changed in the behavior of the Elasticsearch Java APIs that zentity uses (e.g. BaseRestHandler, RestChannelConsumer, or NodeClient), or in the behavior of the Elasticsearch server and its handling plugins.

@davemoore- davemoore- added the bug Something isn't working label Jan 24, 2021
@austince
Copy link
Contributor Author

Awesome, thanks for digging in Dave. We'll test out ES 7.9.0 in kubernetes on our end as well -- that's a good enough work around for now. What's the best way to look into the potential changes on the ES plugin handling side? Should I open up a discussion on discuss.elastic.co?

@davemoore-
Copy link
Member

@austince

What's the best way to look into the potential changes on the ES plugin handling side?

I'm going to approach this from a few angles:

  • Creating a sample plugin whose only function is to create an index using the given NodeClient, ruling out other variables that zentity might be introducing, and see if the behavior still appears in the stress test.
  • Reviewing how official plugins use the NodeClient, to see if and how zentity implements the client differently.
  • Possibly introducing debug logging to help isolate the point of failure in zentity. This may be an outcome of the troubleshooting process that I merge into the main plugin either way.
  • Profiling the behavior of the plugin (e.g. using VisualVM) as you have been.

Creating the sample plugin would be a good thing to try before reaching out to discuss.elastic.co or the elasticsearch repo, because it could help us further isolate the issue and make a more actionable question if needed. It would also give us something that they can use to reproduce the issue easily, if needed.

@davemoore-
Copy link
Member

davemoore- commented Jan 26, 2021

@austince I was able to reproduce the issue on a more simple plugin.

I've submitted a bug report to the elasticsearch team with a modified version of your stress test. I labeled it as a "bug" because our usage of the NodeClient has worked with plugins since at least Elasticsearch 5.x, and the expected behavior stopped working after 7.9.0 without an explanation in the release notes. Let's see what the team responds with.

@davemoore-
Copy link
Member

Update from the bug report:

The issue appears to be rooted in the use of blocking calls in zentity's handlers (e.g. whenever the NodeClient calls the get() method). Apparently this has always been an anti-pattern for plugins, and it would cause an integration test to fail (even in versions prior to 7.9) when assertions are enabled on the Elasticsearch cluster (i.e. setting -ea in the JVM options). I didn't know that this assertions capability existed, so I never saw the issue. See more details in the bug report response.

So the next question is, what's the right way for a plugin to await the response of requests to Elasticsearch before returning a single response to the user? I submitted this question on the discussion board to see if other plugin authors can offer guidance.

@davemoore-
Copy link
Member

The question on the discussion board now has an answer with some guidance.

This issue will continue to be my priority until it's resolved and released in the next patch version 1.6.2 (release plan).

I've updated the website to include a warning label for this known issue to help people avoid choosing a version that will lead to cluster instability.

@austince
Copy link
Contributor Author

Ah, that makes a lot of sense. That seems like a very reasonable solution as well. Thanks so much for these findings.

@davemoore-
Copy link
Member

davemoore- commented Jan 26, 2021

I pushed my first commit to the new feature-async-actions branch (compare to master).

  • Manual tests of SetupAction and ModelsAction show promise. I can setup zentity and manage entity models on Elasticsearch >= 7.9.0, which is progress.
  • There's some duplicate code in ResolutionAction. There's probably a better way to implement async there.
  • Not sure how I feel about the prolific use of anonymous functions, but it's working so far.

I haven't yet attempted to implement async actions to Job, which I believe has the only other blocking call in the plugin. The blocking method is defined here, and it's called many times here. This going to take some careful thought. @austince maybe you could take a look, too?

@austince
Copy link
Contributor Author

austince commented Jan 26, 2021

Looks great! I'll take a read through tomorrow and see if I can think of any useful notes.

Update: sorry I haven't had time to look into this yet, busy end of the week but hope to have time tomorrow.

@davemoore-
Copy link
Member

davemoore- commented Jan 28, 2021

Pushed a second commit to the feature-async-actions branch (diff). Refactored the async code to achieve proper exception handling, which I verified through manual tests. Also started to implement debug logging.

Now that I'm more familiar with the async APIs in Elasticsearch, and being satisfied with the current state of async in the REST handlers, I'll move onto the more challenging async implementation in the Job class.

@davemoore-
Copy link
Member

Pushed a commit to the feature-async-actions branch (diff), which replaces blocking search calls with async search calls in the Job class.

Things are looking promising (async pun?):

  • Integration tests are passing
  • Manual tests are passing with assertions enabled in the Elasticsearch JVM
  • Performance of resolution jobs are at least as good as before

The solution was much simpler than I anticipated. Thanks to this recent blog post for showing a pattern that uses CompletableFuture in Elasticsearch. That saved me from refactoring the entire Job class. I almost want to rework my prior commits (the REST handlers) to use this pattern. Not a big priority, though. They work and we need to ship out this fix soon.

@austince
Copy link
Contributor Author

I had some time to read through and give some notes on the commits, let me know what you think!

@austince
Copy link
Contributor Author

Maybe opening a draft PR for that branch would let us consolidate the conversation about the implementation?

@davemoore-
Copy link
Member

++ Sounds good. It's ready at PR #67.

@davemoore-
Copy link
Member

Merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working release:1.6.2
Projects
None yet
Development

No branches or pull requests

2 participants