Replies: 3 comments
-
After a lot of debugging I have narrowed it down to the following: First, the step that is causing Garnet to be unable to recover at all is the FastCommit mode being enable, which is calling Second, unfortunately I'm still having an issue because somewhere along my commit history there appears to be some sort of data corruption that causes the replaying of records to break. Specifically, at a certain record (Record A) it jumps way ahead in the commit data and lands on the wrong part of another record (Record B). Record A ends with a length byte of 0 so it triggers the logic that skips to the next page when it encounters trailing zeroed out bytes. And indeed Record A does have trailing zeroed out bytes but it is not close to the end of the page. Unlike other instances where we skip to the next page, which results in byte jumps between 4 and ~350 bytes, this jump skips ahead 3496116 bytes to get to Record B. Record B doesn't doesn't start with the typical I've tried deleting the trailing zeroed out bytes from Record A and I've tried removing the |
Beta Was this translation helpful? Give feedback.
-
There may have been a breaking change to the AOF format recently. Try to revert back to the version of Garnet that was used when the AOF was originally created. |
Beta Was this translation helpful? Give feedback.
-
The production server was on version
So it seems reverting to |
Beta Was this translation helpful? Give feedback.
-
My Garnet server recently become unable to recover from last checkpoint. Notably, it used to recover the HybridLog Stats fine, and then initiate the AOF replay, but, despite no changes to the config, it did this on the last server reset:
It's worth pointing out that I'm using a configuration that involves a primary garnet server along with a secondary that functions as a low compute/memory backup who's purpose is only to write to aof in case of the primary being down. My guess is that maybe this strategy failed and somehow corrupted the checkpoint but I did test this in development and the primary was able to pickup the data that the secondary wrote while the primary was down so I hadn't expected this kind of failure, but given what I'm doing I'm not sure if this is a bug or a bad config based on a misunderstanding of how AOF works. Here are my config files:
docker-compose.yml
garnet.conf
garnet-aof.conf
Beta Was this translation helpful? Give feedback.
All reactions