Slow shape transaction processing causes `ReplicationClient` to lose connection #2372

msfstef · 2025-02-25T15:44:45Z

The processing of transactions is done from the ReplicationClient via a GenServer.call to the ShapeLogCollector, which will fan out the transaction for shape consumers to process and then return.

electric/packages/sync-service/lib/electric/postgres/replication_client.ex

Lines 284 to 293 in 4ab04c4

    
           OpenTelemetry.with_span( 
        
             "pg_txn.replication_client.transaction_received", 
        
             [ 
        
               num_changes: txn.num_changes, 
        
               num_relations: MapSet.size(txn.affected_relations), 
        
               xid: txn.xid 
        
             ], 
        
             stack_id, 
        
             fn -> apply(m, f, [txn | args]) end 
        
           )

If any of the shape consumers is slow to process the transaction, e.g. because of IO, then during that time the ReplicationClient is blocked on this call and cannot reply to PG's keep-alive messages.

This can lead to cases where, even though the processing was slow, it successfully returns, but at that point an attempt to reply on the connection happens and an ssl send: closed or similar connection error occurs, crashing the ReplicationClient and leading to a restart and re-processing of the same transaction once the connection is re-established.

This seems to happen with SSL connections, and my hypothesis is that because keep-alives are not being replied to, the connection has closed by the time we actually try to acknowledge the transaction that we have processed.

You can reproduce this with an SSL enabled DB and artificial slowdown of storage processing, like a Process.sleep(10000) in the consumer process. The SSL connection seems to die within 5-10 seconds, and setting timeouts didn't seem to change this.

Thankfully the processing of transactions is/should be idempotent as they are indexed on their LSN/offset, although I don't know how that plays with compaction.

My suggestion is to make the transaction processing asynchronous, such that we can reply to keep-alives while the transaction is being processed, while deferring/ignoring subsequent transactions until the previous one is done processing. Not sure if the latter part is easily done.

The text was updated successfully, but these errors were encountered:

msfstef · 2025-02-25T17:01:31Z

After testing a little bit, Postgres will keep sending transactions regardless of the acknowledged LSN - so I think the current behaviour is probably fine, the connection restarts and resumes from the last acknowledged message.

My only concern is whether we can always guarantee idempotency of writes, even with compaction - alternatively we could filter out written transactions via some sort of stored global LSN we keep (I think we have one already)

msfstef added bug reliability labels Feb 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow shape transaction processing causes `ReplicationClient` to lose connection #2372

Slow shape transaction processing causes `ReplicationClient` to lose connection #2372

msfstef commented Feb 25, 2025 •

edited

Loading

msfstef commented Feb 25, 2025

Slow shape transaction processing causes ReplicationClient to lose connection #2372

Slow shape transaction processing causes ReplicationClient to lose connection #2372

Comments

msfstef commented Feb 25, 2025 • edited Loading

msfstef commented Feb 25, 2025

Slow shape transaction processing causes `ReplicationClient` to lose connection #2372

Slow shape transaction processing causes `ReplicationClient` to lose connection #2372

msfstef commented Feb 25, 2025 •

edited

Loading