[Bug] "Not enough non-faulty bookies available error code: -6" persisting after outage while all bookies are reported as healthy #23807
Labels
type/bug
The PR fixed a bug or issue reported a bug
Search before asking
Read release policy
Version
Apache pulsar-3.8.0 chart with pulsar 4.0.1 deployed in EKS.
Minimal reproduce step
Pulsar consisting of:
4 bookies
3 brokers
3 zookeepers
two persistent topics configured with E=3, Qw=3, Qa=2
Two bookies and zookeeper restart and are unavailable for minutes.
Both topics owned by the same brooker.
What did you expect to see?
What did you see instead?
Both were rightly complaining on "org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available", however when bookies were back, only first topic a7u_transactions was able to recover, second topic a7u_transactions_sanitized was stuck in endless loop of un-fencing and failed writes until broker was restarted an it opened normally on the new owner.
Before the restart in following morning, we've double checked the bookies status:
all were up, there were no other topics were reporting any issues with bookies.
We don't understand why a7u_transactions recovered while a7u_transactions_sanitized remained impacted for hours while all bookies were up. It seemed like it's an edge case where there was no attempt to form new ensemble for the affected topic.
Detailed logs attached.
Anything else?
bookie-0 returns at:
bookie-1 returns at:
broker-1 (owner) logs when topics seemed to start behaving differently:
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: