-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consul streaming leaking rpc connections #17369
Comments
When we stop sending blocking queries to a Consul client, it takes 20 minutes to gradually evict sub-material view caches, and we can see |
@weichuliu , thanks for your time investigating this issue and the deep analysis. We will look into this issue ASAP. |
Thanks @huikang I'm looking forward to the fix. This one is pretty annoying for us, and I believe it's annoying for anyone that uses streaming feature. |
Hi @weichuliu, I cannot recreate this on the latest consul nor consul 1.15.2. I have created watches after each rebalancing and also waited for the rebalancing to get back to the original server. The I have use the following configuration on kubernetes:
Could you provide the simplest configuration that you used for the recreation steps? |
I didn't use k8s for testing so I can't debug your deployment. Here is how I ran on my local (M1 Macbook, but shouldnt matter here).
The consul I use:
To start the cluster, I ran these (each line in a seperate terminal). consul agent -server -config-file=./consul1.hcl
consul agent -server -config-file=./consul2.hcl
consul agent -server -config-file=./consul3.hcl
consul agent -config-file=./client.hcl To catch the client metrics on UDP 8125, I personally uses datadog-mock. datadog-mock $ cd src/datadog-mock
$ go build
$ ./datadog-mock | grep grpc.client.connections I then registered 100 services to the client. Note that the Consul client listens to default localhost:8500, so for i in $(seq 100); do consul services register -name test$i; done The last step, I ran this Python script ( import time
import signal
import subprocess
import random
consul_subprocesses = []
def signal_handler(sig, frame):
for p in consul_subprocesses:
p.send_signal(signal.SIGINT)
time.sleep(1)
exit()
signal.signal(signal.SIGINT, signal_handler)
while True:
time.sleep(20)
print(f'{time.ctime()} new watch')
p = subprocess.Popen(['consul', 'watch', '-type=service', f'-service=test{random.randint(1,100)}', 'cat'])
consul_subprocesses.append(p) After 20 minutes, here is the metrics |
I admit that I tried to follow the steps I wrote in the beginning, and couldn't reproduce (the connections at most hit 3), I forgot what step I missed there. However, with the brutal spawning new consul watch every 20 seconds + many services + let the thing run for a while, looks like the issue is reproduce-able on 1.15.2. |
@jmurret It's been a while since you asked for reproducing the issue. I wonder if you are able to reproduce this with the steps? |
Just like this one: #22045 |
Overview of the Issue
Consul client is leaking gRPC connections (1 per several minutes) when serving streaming requests.
On 1.10+, with streaming_backend enabled, Consul client agent will handle blocking queries with streaming -- it maintains a cache and establish a "subscription" on the change using gRPC to talk to Consul servers.
In our deployment, our Consul client agent are constantly getting lot of blocking queries. All subscriptions are supposed to use 1 single gRPC connection. However, we see
consul.grpc.client.connections
growing over time.In the TRACE log, we found that the gRPC Router manager is constantly (every several minutes) refreshing gRPC connection, i.e. creating a new gRPC connection to replace the old one.
During such event, usually the
consul.grpc.client.connections
will bex
, thenx+1
and back tox
. Which is the new connection replacing old one.However, when there are active blocking queries, the old subscriptions will be holding the old gRPC connection, prevent it from closing. so
consul.grpc.client.connections
isx
->x+1
, and keeps growing. As long as the old subscription is still valid, in other word the corresponding cache is not evicted, the old gRPC connection is held open.In our deployment, a single client agent can easily have more than 300 gRPC connections.
Reproduction Steps
1.14.6
and1.15.2
and both can reproduce the issue.telemery
for Consul client, and monitorconsul.grpc.client.connections
metrics.consul monitor -log-level trace | grep -i channel
on client side, to catch the gRPC rebalance event.for i in $(seq 100); do consul services register -name test$i; done
.Subchannel Connectivity change to READY
TRACE log, then start a new watch:consul watch -type=service -service=test1 cat
.consul watch
after a gRPC rebalance happened.Expected behavior
Old gRPC connection should be released and closed, and the overall
grpc.client.connections
should always be low.Consul info for both Client and Server
As long as Consul server/client support streaming.
Operating system and Environment details
This bug is OS agnostic.
The text was updated successfully, but these errors were encountered: