Execution of failpoint should not block deactivation #64

serathius · 2024-04-01T16:20:05Z

If I use a http and setup a failpoint with a long sleep (e.g. 1s), deactivating it runs into a problem.
Execution and failpoint change run under the same lock. Deactivation will be really delayed as it's racing against execution for the same lock. For example in etcd-io/etcd#17680, I want to inject a sleep into watch write loop, results in continuous execution of the endpoint to sleep. As result it can take a long time to deactivate it.

I would expect that failpoint execution only uses lock to copy the failpoint term, and then execution is done outside of critical section, allowing deactivation to be immediate.

ahrtr · 2024-04-08T18:13:12Z

@henrybear327 are you interested and have bandwidth to take care of this issue? Thanks

ahrtr · 2024-04-08T18:15:10Z

I noticed this issue long time ago, but just did not get time to have a deep dive on it.

henrybear327 · 2024-04-08T18:22:10Z

@henrybear327 are you interested and have bandwidth to take care of this issue? Thanks

@ahrtr please assign it to me as I would like to give it a try! Thank you!

henrybear327 · 2024-04-15T11:26:27Z

I will be putting up a PoC in the next couple of days :)

There are 2 main flows of the gofail library: namely enable/disable and execution (`Acquire`) of the failpoints. Currently, a mutex is protecting both flows, thus only one action can make progress at a time. This PR proposes a fine-grained mutex, as each failpoint is protected under a dedicated `RWMutex`. The existing `failpointsMu` will only be protecting the main shared data structures, such as `failpoints` map. Notice that in our current implementation, the execution of the same failpoint is still sequential (there is a lock within `eval` on the term being executed) Reference: - etcd-io#64

There are 2 main flows of the gofail library: namely enable/disable and execution (`Acquire`) of the failpoints. Currently, a mutex is protecting both flows, thus only one action can make progress at a time. This PR proposes a fine-grained mutex, as each failpoint is protected under a dedicated `RWMutex`. The existing `failpointsMu` will only be protecting the main shared data structures, such as `failpoints` map. Notice that in our current implementation, the execution of the same failpoint is still sequential (there is a lock within `eval` on the term being executed) Reference: - etcd-io#64 Signed-off-by: Chun-Hung Tseng <[email protected]>

Reference: - https://github.com/etcd-io/etcd/pull/17719/files - etcd-io/gofail#64 Signed-off-by: Chun-Hung Tseng <[email protected]>

Reference: - etcd-io#17719 - etcd-io/gofail#64 Signed-off-by: Chun-Hung Tseng <[email protected]>

Because of etcd-io/gofail#64 being merged, we can rollback the change. Reference: - etcd-io#17719 Signed-off-by: Chun-Hung Tseng <[email protected]>

henrybear327 · 2024-05-17T12:25:36Z

@ahrtr I think we can close this issue!

ahrtr · 2024-05-17T12:28:28Z

@ahrtr I think we can close this issue!

Thanks.

Because of etcd-io/gofail#64 being merged, we can rollback the change. Reference: - etcd-io#17719 Signed-off-by: Chun-Hung Tseng <[email protected]>

ahrtr assigned henrybear327 Apr 8, 2024

serathius mentioned this issue Apr 12, 2024

Robustness flake due to gofailpoint deactivation timeout etcd-io/etcd#17592

Closed

henrybear327 mentioned this issue Apr 18, 2024

Fix execution of failpoint should not block deactivation #65

Merged

henrybear327 added a commit to henrybear327/etcd that referenced this issue May 15, 2024

Rollback changes on PR17719

2349258

Reference: - https://github.com/etcd-io/etcd/pull/17719/files - etcd-io/gofail#64 Signed-off-by: Chun-Hung Tseng <[email protected]>

henrybear327 added a commit to henrybear327/etcd that referenced this issue May 15, 2024

Rollback changes on PR17719

acdaba1

Reference: - etcd-io#17719 - etcd-io/gofail#64 Signed-off-by: Chun-Hung Tseng <[email protected]>

henrybear327 mentioned this issue May 15, 2024

[DO-NOT-MERGE] Rollback changes on #17719 etcd-io/etcd#18018

Closed

henrybear327 mentioned this issue May 17, 2024

Rollback increase timeout for deactivating failpoint etcd-io/etcd#18022

Closed

ahrtr closed this as completed May 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Execution of failpoint should not block deactivation #64

Execution of failpoint should not block deactivation #64

serathius commented Apr 1, 2024

ahrtr commented Apr 8, 2024

ahrtr commented Apr 8, 2024

henrybear327 commented Apr 8, 2024

henrybear327 commented Apr 15, 2024

henrybear327 commented May 17, 2024

ahrtr commented May 17, 2024

Execution of failpoint should not block deactivation #64

Execution of failpoint should not block deactivation #64

Comments

serathius commented Apr 1, 2024

ahrtr commented Apr 8, 2024

ahrtr commented Apr 8, 2024

henrybear327 commented Apr 8, 2024

henrybear327 commented Apr 15, 2024

henrybear327 commented May 17, 2024

ahrtr commented May 17, 2024