Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16251 pool: DEBUG patch, IV pool map buf investigation #14929

Draft
wants to merge 13 commits into
base: release/2.6
Choose a base branch
from

Conversation

kccain
Copy link
Contributor

@kccain kccain commented Aug 14, 2024

Adds debug logging in IV code, to examine pool map buffer corruption scenarios:

  • possible prevention of uninitialized d_sg_list_t in crt_hdlr_iv_sync_aux() and call_pre_sync_cb() which could theoretically impact pool buffer map contents from IV communication. And, adds some associated logging.
  • crt_ivsync_issue_rpc() explicitly log if bulk or inline corpc will be used. To correspond to the crt_hdlr_iv_sync_aux() and call_pre_sync_cb() logging.

And, in case it becomes needed during investigation, this change also contains a cherry-pick of PR 14702:
DAOS-16164 pool: Update target status to UPIN for no_data_sync mode

Finally, includes a manual cherry pick of PR 14971, aaoganez/rpc-bulk-deadlines

  • Switch rpc headers to transfer deadline instead of a timeout
  • Add checks at the start and end of bulk transfer to ensure deadline has not expired.
  • Add deadline expiration checks in all places where rpc_priv timeout is initialized

Allow-unstable-test: true
faults-enabled: false

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Adds debug logging in IV code, to examine pool map buffer corruption
scenarios:
- possible prevention of uninitialized d_sg_list_t in
  crt_hdlr_iv_sync_aux() and call_pre_sync_cb()
  which could theoretically impact pool buffer map contents
  from IV communication. And, adds some associated logging.
- crt_ivsync_issue_rpc() explicitly log if bulk or inline corpc
  will be used. To correspond to the crt_hdlr_iv_sync_aux() and
  call_pre_sync_cb() logging.

And, in case it becomes needed during investigation, this change also
contains a cherry-pick of PR 14702:
DAOS-16164 pool: Update target status to UPIN for no_data_sync mode

Allow-unstable-test: true
faults-enabled: false

Co-authored-by: Alexander A Oganezov <[email protected]>
Signed-off-by: Kenneth Cain <[email protected]>
Copy link

github-actions bot commented Aug 14, 2024

Ticket title is 'DAOS 2.4.2-4: Errored DAOS engine 0 exited unexpectedly on daos_user'
Status is 'In Progress'
Labels: 'ALCF,pre_acceptance_issues,scrubbed_2.8'
https://daosio.atlassian.net/browse/DAOS-16251

for more detailed information.

Add RPC_INFO() macro and use in IV sync code path logging
rather than D_INFO().

Signed-off-by: Kenneth Cain <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14929/2/testReport/

kccain added 5 commits August 16, 2024 10:19
Allow-unstable-test: true
faults-enabled: false

Signed-off-by: Kenneth Cain <[email protected]>
In crt_hg_unpack_header(), log when the RPC header is known
to have been transferred via bulk.

Allow-unstable-test: true
faults-enabled: false

Signed-off-by: Kenneth Cain <[email protected]>
 - ivc_on_get stores random entry_priv_val into priv_entry for many
    ivc_ent_get implementations. Although not used, this should be
    avoided.

  - ds_iv_done stores pointer to stack variable rc in cb_info->future,
    which outlives the stack frame of ds_iv_done. Although not used,
    this pointer is confusing.

  - ds_pool_iv_map_update associates the input map buffer with the map
    version from ds_pool, rather than the input map version. Although
    this may be fine, we should really not ask for unnecessary
    trouble/concern.

Signed-off-by: Li Wei <[email protected]>

Signed-off-by: Kenneth Cain <[email protected]>
- Switch rpc headers to transfer deadline instead of a timeout
- Add checks at the start and end of bulk transfer to ensure deadline has not expired.
- Add deadline expiration checks in all places where rpc_priv timeout is initialized

Allow-unstable-test: true
faults-enabled: false

Signed-off-by: Alexander A Oganezov <[email protected]>

Signed-off-by: Kenneth Cain <[email protected]>
Allow-unstable-test: true
faults-enabled: false
Skip-nlt: true
Skip-fault-injection-test: true

Signed-off-by: Kenneth Cain <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-14929/6/execution/node/1270/log

…bug_and_reint_no_data_sync

Signed-off-by: Kenneth Cain <[email protected]>
Allow-unstable-test: true
faults-enabled: false

Signed-off-by: Kenneth Cain <[email protected]>
…bug_and_reint_no_data_sync

Signed-off-by: Kenneth Cain <[email protected]>
Allow-unstable-test: true
faults-enabled: false

Signed-off-by: Kenneth Cain <[email protected]>
Allow-unstable-test: true
faults-enabled: false

Signed-off-by: Kenneth Cain <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants