-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Properly reflect rollbacks/restores in target tables #40
Comments
What is the source and target in this scenario ? Actually i am trying to understand the code change so following the jira issues as well might ask silly question. |
@gzagarwal The idea here is that the source can be any of the supported sources and target is any of the supported targets. The vision was that a rollback/restore to a previous point in time or commit would trigger the same in the target format if possible (fallback to current behavior of computing files to add/remove to the target format's view) |
Hi @the-other-tim-brown @gzagarwal I’m interested in working on this feature :) |
And I have a question after doing some initial investigation about how to handle source table. Based on my understanding, our sync() process is externally controlled (like time-based or event-driven), so each sync might capture multiple operations on the table. For formats like Iceberg (via snapshot ID), detecting a rollback is straightforward. However, with Delta and Hudi, it becomes more complex. Delta relies on a log-based system, and Hudi on a time-based model—both of which may involve several operations (commit, add, delete, rollback/restore) between syncs. This makes rollback detection more difficult, especially if multiple operations have occurred since the rollback. In such cases, maybe we still want treat changes as simple add/delete operations, as we do now, if mixed operation types are involved? These are just my initial thoughts based on my investigation, and I may be missing something. I would appreciate any suggestions or input you might have! |
@danielhumanmod what you've described is how we're currently handling the rollbacks/restores but I am thinking it may be less computationally expensive if we can just restore to a particular point in time in the table instead of computing a large diff with the current state of the table. |
Thanks for the clarification @the-other-tim-brown ! Based on the discussion, my current idea is:
Does this approach align with your thoughts? |
Yes it does |
Hi @the-other-tim-brown, based on the idea we discussed above, my plan is dividing this feature into two PRs:
I’ve completed a proof of concept for the first part and would like to discuss a few points with you before proceeding with further implementation. My main concern is that the fallback might happen frequently in cases where the source and target are not synced often. I’ve explained the root cause and included an example in the PR. Could you review the high-level idea in #569 and let me know if this approach is acceptable to you? |
@danielhumanmod I will take a look today or tomorrow. Apologies for the delay on my end. |
Right now when we see a rollback or restore in the source table, we just treat it as files being removed from the table. We should update this to instead issue a rollback command in the target tables so that the histories are more consistent between the source and target.
The text was updated successfully, but these errors were encountered: