You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The processing of transactions is done from the ReplicationClient via a GenServer.call to the ShapeLogCollector, which will fan out the transaction for shape consumers to process and then return.
If any of the shape consumers is slow to process the transaction, e.g. because of IO, then during that time the ReplicationClient is blocked on this call and cannot reply to PG's keep-alive messages.
This can lead to cases where, even though the processing was slow, it successfully returns, but at that point an attempt to reply on the connection happens and an ssl send: closed or similar connection error occurs, crashing the ReplicationClient and leading to a restart and re-processing of the same transaction once the connection is re-established.
This seems to happen with SSL connections, and my hypothesis is that because keep-alives are not being replied to, the connection has closed by the time we actually try to acknowledge the transaction that we have processed.
You can reproduce this with an SSL enabled DB and artificial slowdown of storage processing, like a Process.sleep(10000) in the consumer process. The SSL connection seems to die within 5-10 seconds, and setting timeouts didn't seem to change this.
Thankfully the processing of transactions is/should be idempotent as they are indexed on their LSN/offset, although I don't know how that plays with compaction.
My suggestion is to make the transaction processing asynchronous, such that we can reply to keep-alives while the transaction is being processed, while deferring/ignoring subsequent transactions until the previous one is done processing. Not sure if the latter part is easily done.
The text was updated successfully, but these errors were encountered:
After testing a little bit, Postgres will keep sending transactions regardless of the acknowledged LSN - so I think the current behaviour is probably fine, the connection restarts and resumes from the last acknowledged message.
My only concern is whether we can always guarantee idempotency of writes, even with compaction - alternatively we could filter out written transactions via some sort of stored global LSN we keep (I think we have one already)
The processing of transactions is done from the
ReplicationClient
via aGenServer.call
to theShapeLogCollector
, which will fan out the transaction for shape consumers to process and then return.electric/packages/sync-service/lib/electric/postgres/replication_client.ex
Lines 284 to 293 in 4ab04c4
If any of the shape consumers is slow to process the transaction, e.g. because of IO, then during that time the
ReplicationClient
is blocked on this call and cannot reply to PG's keep-alive messages.This can lead to cases where, even though the processing was slow, it successfully returns, but at that point an attempt to reply on the connection happens and an
ssl send: closed
or similar connection error occurs, crashing theReplicationClient
and leading to a restart and re-processing of the same transaction once the connection is re-established.This seems to happen with SSL connections, and my hypothesis is that because keep-alives are not being replied to, the connection has closed by the time we actually try to acknowledge the transaction that we have processed.
You can reproduce this with an SSL enabled DB and artificial slowdown of storage processing, like a
Process.sleep(10000)
in the consumer process. The SSL connection seems to die within 5-10 seconds, and setting timeouts didn't seem to change this.Thankfully the processing of transactions is/should be idempotent as they are indexed on their LSN/offset, although I don't know how that plays with compaction.
My suggestion is to make the transaction processing asynchronous, such that we can reply to keep-alives while the transaction is being processed, while deferring/ignoring subsequent transactions until the previous one is done processing. Not sure if the latter part is easily done.
The text was updated successfully, but these errors were encountered: