-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Execution of failpoint should not block deactivation #64
Comments
@henrybear327 are you interested and have bandwidth to take care of this issue? Thanks |
I noticed this issue long time ago, but just did not get time to have a deep dive on it. |
@ahrtr please assign it to me as I would like to give it a try! Thank you! |
I will be putting up a PoC in the next couple of days :) |
There are 2 main flows of the gofail library: namely enable/disable and execution (`Acquire`) of the failpoints. Currently, a mutex is protecting both flows, thus only one action can make progress at a time. This PR proposes a fine-grained mutex, as each failpoint is protected under a dedicated `RWMutex`. The existing `failpointsMu` will only be protecting the main shared data structures, such as `failpoints` map. Notice that in our current implementation, the execution of the same failpoint is still sequential (there is a lock within `eval` on the term being executed) Reference: - etcd-io#64
There are 2 main flows of the gofail library: namely enable/disable and execution (`Acquire`) of the failpoints. Currently, a mutex is protecting both flows, thus only one action can make progress at a time. This PR proposes a fine-grained mutex, as each failpoint is protected under a dedicated `RWMutex`. The existing `failpointsMu` will only be protecting the main shared data structures, such as `failpoints` map. Notice that in our current implementation, the execution of the same failpoint is still sequential (there is a lock within `eval` on the term being executed) Reference: - etcd-io#64 Signed-off-by: Chun-Hung Tseng <[email protected]>
There are 2 main flows of the gofail library: namely enable/disable and execution (`Acquire`) of the failpoints. Currently, a mutex is protecting both flows, thus only one action can make progress at a time. This PR proposes a fine-grained mutex, as each failpoint is protected under a dedicated `RWMutex`. The existing `failpointsMu` will only be protecting the main shared data structures, such as `failpoints` map. Notice that in our current implementation, the execution of the same failpoint is still sequential (there is a lock within `eval` on the term being executed) Reference: - etcd-io#64 Signed-off-by: Chun-Hung Tseng <[email protected]>
There are 2 main flows of the gofail library: namely enable/disable and execution (`Acquire`) of the failpoints. Currently, a mutex is protecting both flows, thus only one action can make progress at a time. This PR proposes a fine-grained mutex, as each failpoint is protected under a dedicated `RWMutex`. The existing `failpointsMu` will only be protecting the main shared data structures, such as `failpoints` map. Notice that in our current implementation, the execution of the same failpoint is still sequential (there is a lock within `eval` on the term being executed) Reference: - etcd-io#64 Signed-off-by: Chun-Hung Tseng <[email protected]>
There are 2 main flows of the gofail library: namely enable/disable and execution (`Acquire`) of the failpoints. Currently, a mutex is protecting both flows, thus only one action can make progress at a time. This PR proposes a fine-grained mutex, as each failpoint is protected under a dedicated `RWMutex`. The existing `failpointsMu` will only be protecting the main shared data structures, such as `failpoints` map. Notice that in our current implementation, the execution of the same failpoint is still sequential (there is a lock within `eval` on the term being executed) Reference: - etcd-io#64 Signed-off-by: Chun-Hung Tseng <[email protected]>
Reference: - https://github.com/etcd-io/etcd/pull/17719/files - etcd-io/gofail#64 Signed-off-by: Chun-Hung Tseng <[email protected]>
Reference: - etcd-io#17719 - etcd-io/gofail#64 Signed-off-by: Chun-Hung Tseng <[email protected]>
Because of etcd-io/gofail#64 being merged, we can rollback the change. Reference: - etcd-io#17719 Signed-off-by: Chun-Hung Tseng <[email protected]>
@ahrtr I think we can close this issue! |
Thanks. |
Because of etcd-io/gofail#64 being merged, we can rollback the change. Reference: - etcd-io#17719 Signed-off-by: Chun-Hung Tseng <[email protected]>
Because of etcd-io/gofail#64 being merged, we can rollback the change. Reference: - etcd-io#17719 Signed-off-by: Chun-Hung Tseng <[email protected]>
Useful for etcd-io/etcd#17680
If I use a http and setup a failpoint with a long sleep (e.g. 1s), deactivating it runs into a problem.
Execution and failpoint change run under the same lock. Deactivation will be really delayed as it's racing against execution for the same lock. For example in etcd-io/etcd#17680, I want to inject a sleep into watch write loop, results in continuous execution of the endpoint to sleep. As result it can take a long time to deactivate it.
I would expect that failpoint execution only uses lock to copy the failpoint term, and then execution is done outside of critical section, allowing deactivation to be immediate.
The text was updated successfully, but these errors were encountered: