-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
many Reconfigurator execution steps are fatal that shouldn't be #6999
Comments
I just hit this issue. I'm on I expected to see the reconfiguration request start and continuously fail because the new node isn't available to start. When this didn't happen after a few minutes I realized that I probably needed to make the clickhouse parts of the executor skip failures just like I did in #7059 for @plotnick's single-node clickhouse PR. I took a look at the blueprint executor via
I think we decided we wanted to skip this as well and have any dependencies fail if their datasets aren't there. I plan to take care of the multi-node clickhouse bits, but this is slightly outside my wheelhouse. I can probably take a stab at it if we agree this is the way to go though. |
I expected after restarting sled
|
Ha, I just didn't wait long enough! Everything resumed and the reconfiguration started.
|
@davepacheco @sunshowers @jgallagher @smklein I've been thinking about the value of returning errors and it's really handy to be able to see them in We have a way to add to the execution report with |
This PR is intended to fix #6999 and #7098. The executor steps are all independent already except those that will be fixed in this PR. A disk disposition was added to the blueprint and gets changed from `InService` to `Expunged` when the `PlanningInput` changes. A disk state was also added that goes from `Active` -> `Decommissioned`. This disk state changes when the parent sled is expunged or the sled-agent has been guaranteed to have seen the disk expungement as reported via inventory. This code fully implements the planner bits. The exucutor bits are going to be pulled in from the [fewer-fatal-errors-in-executor-branch](main...fewer-fatal-errors-in-executor). The executor bits are going to be expanded to implement the expunge and decommission as dictated by the blueprint. Because the planner changes are tightly coupled with the executor, they will all come in this PR. Further code will also be added to represent the disk disposition and state in the blueprint and diff tables for omdb output. This PR is primarily blocked on [revamping the diff system](#7240). This is particularly necessary because we need to change those tables and want to do it rigorously after we have a visitor based mechanism for doing so. There is some more cleanup to do here as well, in particular around the `physical_disk_filter` and things that may not be needed when the executor bits get fully implemented.
Right now, failure at many blueprint execution steps causes all of execution to stop. That feels like the conservative thing to do, but it's not necessarily. When fixing a broken system requires Reconfigurator to proceed despite some failures, this choice is problematic. This came up in a colo incident a few weeks ago where we wanted to use Reconfigurator to repair the system by deploying a second Oximeter zone. In trying to do so, execution failed because it couldn't deploy datasets to a sled agent that was known to be broken (and unrelated to the problem we were trying to repair). There's no reason that failure to deploy datasets to one sled should prevent Reconfigurator from trying to deploy zones to a different sled. Arguably, it's wrong to say that failing to deploy datasets to a sled should prevent us from trying to deploy zones to the same sled. After all, it's totally possible that we successfully deployed datasets during a previous execution, or that there are no new datasets to be created there. We could instead leave it to the sled agent to check this dependency, failing the zones request if it's missing some dataset that needs to be there.
In today's update call I proposed a pattern where:
PUT /omicron-zones
should fail if the body refers to datasets that haven't been configured).The conclusion was that this is directionally a good pattern, but it's not clear how unsafe it would be to just ignore these dependencies today. So the proposal here is:
The text was updated successfully, but these errors were encountered: