-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16982 csum: recalculate checksum on retrying #15786
base: master
Are you sure you want to change the base?
Conversation
Ticket title is 'We should not report checksum errors against the nmve device for key verification' |
I have already tested it by manually injecting failure, and I'm working on turning that into a unit test.
|
7f74db4
to
bb23b17
Compare
@@ -5140,6 +5141,11 @@ obj_csum_update(struct dc_object *obj, daos_obj_update_t *args, struct obj_auxi_ | |||
if (!obj_csum_dedup_candidate(&obj->cob_co->dc_props, args->iods, args->nr)) | |||
return 0; | |||
|
|||
if (obj_auxi->csum_retry) { | |||
/* Release old checksum result and prepare for new calculation */ | |||
daos_csummer_free_ic(obj->cob_co->dc_csummer, &obj_auxi->rw_args.iod_csums); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we probably want to do this after a couple of retries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's really easy to add but I wonder if that is indeed necessary, because cksum error is a rare event by itself.
How about revising it to:
if (obj_auxi->csum_retry && obj_auxi->csum_retry_cnt > 2) { ... }
would that work for you?
/* Release old checksum result and prepare for new calculation */ | ||
daos_csummer_free_ic(obj->cob_co->dc_csummer, &obj_auxi->rw_args.iod_csums); | ||
} | ||
|
||
return dc_obj_csum_update(obj->cob_co->dc_csummer, obj->cob_co->dc_props, | ||
obj->cob_md.omd_id, args->dkey, args->iods, args->sgls, args->nr, | ||
obj_auxi->reasb_req.orr_singv_los, &obj_auxi->rw_args.dkey_csum, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of the actual issue we saw, it was the dkey_csum that needs to be recalculated, is that happening here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes if I read the code correctly because we release the previous calculation above.
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/3/execution/node/344/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/3/execution/node/334/log |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/3/execution/node/387/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/3/execution/node/345/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/3/execution/node/480/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/3/execution/node/339/log |
Test stage Build on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/4/execution/node/375/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/4/execution/node/373/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/4/execution/node/345/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/4/execution/node/338/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/4/execution/node/335/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/4/execution/node/342/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/5/execution/node/373/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/5/execution/node/319/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/5/execution/node/345/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/5/execution/node/342/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/6/execution/node/374/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/6/execution/node/371/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/6/execution/node/356/log |
Test stage Build RPM on Leap 15.5 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/6/execution/node/355/log |
Test stage Build DEB on Ubuntu 20.04 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/6/execution/node/359/log |
Test stage Build on Leap 15.5 with Intel-C and TARGET_PREFIX completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/7/execution/node/373/log |
Test stage Build RPM on EL 9 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/7/execution/node/348/log |
Test stage Build RPM on EL 8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15786/7/execution/node/347/log |
c63ecc9
to
e119758
Compare
e119758
to
f0a07e6
Compare
This PR fixes retry logic by actually recalculating the checksum; also it removes the code that incorrectly records nvme error. Run-GHA: true Change-Id: Ib0287851fea4d125eecda48c5ccb3c73ed85b8f8 Signed-off-by: Jinshan Xiong <[email protected]>
f0a07e6
to
eb6a7d1
Compare
Functional on EL 8.8 Test Results131 tests 127 ✅ 1h 30m 53s ⏱️ Results for commit eb6a7d1. |
@wangdi1 @liuxuezhao can you please take a look? |
This PR fixes retry logic by actually recalculating the checksum; also it removes the code that incorrectly records nvme error.
This is a quick fix before we make an ultimate fix discussed here: https://daos-stack.slack.com/archives/C4SM0RZ54/p1738030213108609
Change-Id: Ib0287851fea4d125eecda48c5ccb3c73ed85b8f8
Signed-off-by: Jinshan Xiong [email protected]
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: