Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow shape transaction processing causes ReplicationClient to lose connection #2372

Open
msfstef opened this issue Feb 25, 2025 · 1 comment

Comments

@msfstef
Copy link
Contributor

msfstef commented Feb 25, 2025

The processing of transactions is done from the ReplicationClient via a GenServer.call to the ShapeLogCollector, which will fan out the transaction for shape consumers to process and then return.

OpenTelemetry.with_span(
"pg_txn.replication_client.transaction_received",
[
num_changes: txn.num_changes,
num_relations: MapSet.size(txn.affected_relations),
xid: txn.xid
],
stack_id,
fn -> apply(m, f, [txn | args]) end
)

If any of the shape consumers is slow to process the transaction, e.g. because of IO, then during that time the ReplicationClient is blocked on this call and cannot reply to PG's keep-alive messages.

This can lead to cases where, even though the processing was slow, it successfully returns, but at that point an attempt to reply on the connection happens and an ssl send: closed or similar connection error occurs, crashing the ReplicationClient and leading to a restart and re-processing of the same transaction once the connection is re-established.

This seems to happen with SSL connections, and my hypothesis is that because keep-alives are not being replied to, the connection has closed by the time we actually try to acknowledge the transaction that we have processed.

You can reproduce this with an SSL enabled DB and artificial slowdown of storage processing, like a Process.sleep(10000) in the consumer process. The SSL connection seems to die within 5-10 seconds, and setting timeouts didn't seem to change this.

Thankfully the processing of transactions is/should be idempotent as they are indexed on their LSN/offset, although I don't know how that plays with compaction.

My suggestion is to make the transaction processing asynchronous, such that we can reply to keep-alives while the transaction is being processed, while deferring/ignoring subsequent transactions until the previous one is done processing. Not sure if the latter part is easily done.

@msfstef
Copy link
Contributor Author

msfstef commented Feb 25, 2025

After testing a little bit, Postgres will keep sending transactions regardless of the acknowledged LSN - so I think the current behaviour is probably fine, the connection restarts and resumes from the last acknowledged message.

My only concern is whether we can always guarantee idempotency of writes, even with compaction - alternatively we could filter out written transactions via some sort of stored global LSN we keep (I think we have one already)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant