Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16464 test: improve online_rebuild_mdtest.py (#15108) #15807

Draft
wants to merge 1 commit into
base: release/2.6
Choose a base branch
from

Conversation

rpadma2
Copy link
Contributor

@rpadma2 rpadma2 commented Jan 28, 2025

Test-tag: EcodOnlineRebuildMdtest
Test-repeat: 3
Skip-unit-tests: true
Skip-fault-injection-test: true

  • Run with a stonewall and stop ranks after half of the stonewall time so the timing is more reliable than arbitrarily sleeping for 30 seconds.
  • Catch exceptions raised in the mdtest thread.
  • Reduce logging.
  • Misc refactoring improvements

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Copy link

Ticket title is 'erasurecode/online_rebuild_mdtest.py:EcodOnlineRebuildMdtest.test_ec_online_rebuild_mdtest - time out waiting for mdtest after server stop'
Status is 'Reopened'
Labels: 'pr_test,request_for_2.6.3,testp2'
https://daosio.atlassian.net/browse/DAOS-16464

Test-tag: EcodOnlineRebuildMdtest
Test-repeat: 3
Skip-unit-tests: true
Skip-fault-injection-test: true

- Run with a stonewall and stop ranks after half of the stonewall time so
the timing is more reliable than arbitrarily sleeping for 30 seconds.
- Catch exceptions raised in the mdtest thread.
- Reduce logging.
- Misc refactoring improvements

Signed-off-by: Dalton Bohning <[email protected]>
Signed-off-by: Padmanabhan <[email protected]>
@rpadma2 rpadma2 force-pushed the rpadma2/daos_16464_2_6 branch from c8a67ab to 843e252 Compare January 28, 2025 23:23
@rpadma2 rpadma2 added the clean-cherry-pick Cherry-pick from another branch that did not require additional edits label Jan 28, 2025
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15807/2/execution/node/926/log

@rpadma2
Copy link
Contributor Author

rpadma2 commented Jan 30, 2025

The single failure is known issue : DAOS-16737.
{}

2025-01-29 23:07:35,981 process          L0604 INFO | Running '/usr/bin/daos container destroy TestPool_1 TestContainer_1 --force'
2025-01-29 23:08:36,330 process          L0416 DEBUG| [stderr] 01/29-23:08:36.32 wolf-51 DAOS[95432/95436/0] rpc  ERR  src/cart/crt_context.c:1252 crt_context_timeout_check(0x7fd098088070) [opc=0x2060001 (DAOS_POOL_MODULE:POOL_CONNECT) rpcid=0x3f0abd2500000006 rank:tag=6:0] ctx_id 0, (status: 0x38) timed out (60 seconds), target (6:0)
2025-01-29 23:08:36,331 process          L0416 DEBUG| [stderr] 01/29-23:08:36.32 wolf-51 DAOS[95432/95436/0] hg   ERR  src/cart/crt_hg.c:1361 crt_hg_req_send_cb(0x7fd098088070) [opc=0x2060001 (DAOS_POOL_MODULE:POOL_CONNECT) rpcid=0x3f0abd2500000006 rank:tag=6:0] RPC failed; rc: DER_TIMEDOUT(-1011): 'Time out'
2025-01-29 23:09:00,177 runner           L0218 DEBUG| Original status: {}
2025-01-29 23:09:00,178 runner           L0233 ERROR| ERROR Test died without reporting the status. -> TestAbortError: 4-./erasurecode/online_rebuild_mdtest.py:EcodOnlineRebuildMdtest.test_ec_online_rebuild_mdtest;run-container-hosts-servers-12_server-mdtest-client_processes-dfs_oclass_mux-12_server_ec8p2gx-pool-server_config-engines-0-storage-0-1-setup-bbe1.
2025-01-29 23:09:00,178 runner           L0235 WARNI| Killing hanged test process 92323
2025-01-29 23:09:00,179 stacktrace       L0039 ERROR| 

@rpadma2 rpadma2 marked this pull request as ready for review January 30, 2025 16:10
@rpadma2 rpadma2 requested review from a team as code owners January 30, 2025 16:10
@rpadma2 rpadma2 requested a review from a team January 31, 2025 17:02
Copy link
Contributor

@phender phender left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change doesn't seem to fix the issue as https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15807/2/artifact/Functional%20Hardware%20Large/erasurecode/online_rebuild_mdtest.py/repeat001/job.log encountered the same failure reported in the ticket:

2025-01-29 22:41:59,187 process          L0604 INFO | Running '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d system stop --force --ranks=4'
2025-01-29 22:41:59,274 process          L0416 DEBUG| [stderr] DEBUG 2025/01/29 22:41:59.274166 main.go:228: debug output enabled
2025-01-29 22:41:59,274 process          L0416 DEBUG| [stderr] DEBUG 2025/01/29 22:41:59.274703 main.go:260: control config loaded from /var/tmp/daos_testing/configs/daos_control.yml
2025-01-29 22:41:59,280 process          L0416 DEBUG| [stderr] DEBUG 2025/01/29 22:41:59.280020 system.go:465: DAOS system stop request: *mgmt.SystemStopReq (sys:"daos_server-2.6.3" force:true ranks:"4")
2025-01-29 22:41:59,281 process          L0416 DEBUG| [stderr] DEBUG 2025/01/29 22:41:59.281445 rpc.go:278: request hosts: [wolf-110:10001 wolf-111:10001 wolf-112:10001 wolf-114:10001 wolf-115:10001]
2025-01-29 22:41:59,939 process          L0416 DEBUG| [stderr] DEBUG 2025/01/29 22:41:59.939903 response.go:179: wolf-110:10001: *mgmt.SystemStopResp stopped:4
2025-01-29 22:41:59,940 process          L0416 DEBUG| [stdout] Rank Operation Result 
2025-01-29 22:41:59,940 process          L0416 DEBUG| [stdout] ---- --------- ------ 
2025-01-29 22:41:59,940 process          L0416 DEBUG| [stdout] 4    stop      OK     
2025-01-29 22:41:59,941 process          L0416 DEBUG| [stdout] 
2025-01-29 22:42:00,942 process          L0686 INFO | Command '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d system stop --force --ranks=4' finished with 0 after 1.6973729133605957s
2025-01-29 22:42:00,972 command_utils    L1348 INFO | Updating the expected state for rank 4 on wolf-115: joined -> ['stopped', 'excluded']
2025-01-29 22:43:59,479 process          L0416 DEBUG| [stdout] Continue stonewall hit min: 16623 max: 17205 avg: 16939.5 
2025-01-29 22:43:59,479 process          L0416 DEBUG| [stdout] 
2025-01-29 22:44:03,540 process          L0416 DEBUG| [stdout] 
2025-01-29 22:44:03,540 process          L0416 DEBUG| [stdout] SUMMARY rate: (of 1 iterations)
2025-01-29 22:44:03,540 process          L0416 DEBUG| [stdout]    Operation                     Max            Min           Mean        Std Dev
2025-01-29 22:44:03,541 process          L0416 DEBUG| [stdout]    ---------      
2025-01-29 22:44:03,541 process          L0416 DEBUG| [stdout]                ---            ---           ----        -------
2025-01-29 22:44:03,541 process          L0416 DEBUG| [stdout]    File creation                 505.818        505.818        505.818          0.000
2025-01-29 22:44:03,541 process          L0416 DEBUG| [stdout]    File stat                       0.000          0.000          0.000          0.000
2025-01-29 22:44:03,541 process          L0416 DEBUG| [stdout]    File read                       0.000          0.000          0.000          0.000
2025-01-29 22:44:03,541 process          L0416 DEBUG| [stdout]    File removal                    0.000          0.000          0.000          0.000
2025-01-29 22:44:03,541 process          L0416 DEBUG| [stdout]    Tree creation                  25.716         25.716         25.716          0.000
2025-01-29 22:44:03,541 process          L0416 DEBUG| [stdout]    Tree removal                    0.000          0.000          0.000          0.000
2025-01-29 22:44:03,541 process          L0416 DEBUG| [stdout] -- finished at 01/29/2025 22:44:03 --
2025-01-29 22:44:03,542 process          L0416 DEBUG| [stdout] 
2025-01-29 23:06:59,336 stacktrace       L0039 ERROR| 
2025-01-29 23:06:59,336 stacktrace       L0042 ERROR| Reproduced traceback from: /localhome/jenkins/venv/lib64/python3.6/site-packages/avocado/core/test.py:767
2025-01-29 23:06:59,353 stacktrace       L0045 ERROR| Traceback (most recent call last):
2025-01-29 23:06:59,353 stacktrace       L0045 ERROR|   File "/usr/lib/daos/TESTING/ftest/erasurecode/online_rebuild_mdtest.py", line 36, in test_ec_online_rebuild_mdtest
2025-01-29 23:06:59,354 stacktrace       L0045 ERROR|     self.start_online_mdtest(ranks_to_stop)
2025-01-29 23:06:59,354 stacktrace       L0045 ERROR|   File "/usr/lib/daos/TESTING/ftest/util/ec_utils.py", line 460, in start_online_mdtest
2025-01-29 23:06:59,354 stacktrace       L0045 ERROR|     job.join()
2025-01-29 23:06:59,354 stacktrace       L0045 ERROR|   File "/usr/lib64/python3.6/threading.py", line 1077, in join
2025-01-29 23:06:59,354 stacktrace       L0045 ERROR|     self._wait_for_tstate_lock()
2025-01-29 23:06:59,354 stacktrace       L0045 ERROR|   File "/usr/lib64/python3.6/threading.py", line 1093, in _wait_for_tstate_lock
2025-01-29 23:06:59,354 stacktrace       L0045 ERROR|     elif lock.acquire(block, timeout):
2025-01-29 23:06:59,354 stacktrace       L0045 ERROR|   File "/localhome/jenkins/venv/lib64/python3.6/site-packages/avocado/plugins/runner.py", line 77, in sigterm_handler
2025-01-29 23:06:59,354 stacktrace       L0045 ERROR|     raise RuntimeError("Test interrupted by SIGTERM")
2025-01-29 23:06:59,354 stacktrace       L0045 ERROR| RuntimeError: Test interrupted by SIGTERM
2025-01-29 23:06:59,354 stacktrace       L0046 ERROR| 
2025-01-29 23:06:59,354 test             L0772 DEBUG| Local variables:
2025-01-29 23:06:59,355 test             L0775 DEBUG|  -> ranks_to_stop <class 'list'>: [4]
2025-01-29 23:06:59,355 test             L0775 DEBUG|  -> self <class 'online_rebuild_mdtest.EcodOnlineRebuildMdtest'>: 4-./erasurecode/online_rebuild_mdtest.py:EcodOnlineRebuildMdtest.test_ec_online_rebuild_mdtest;run-container-hosts-servers-12_server-mdtest-client_processes-dfs_oclass_mux-12_server_ec8p2gx-pool-server_config-engines-0-storage-0-1-setup-bbe1
2025-01-29 23:06:59,355 test             L0357 INFO | ====================================================================================================
2025-01-29 23:06:59,356 test             L0471 INFO | ==> Step 4: tearDown(): Called due to exceeding the 1590s test timeout [elapsed since last step: 1527.74s]

@rpadma2
Copy link
Contributor Author

rpadma2 commented Feb 1, 2025

Thanks @phender . Also, DAOS-16737 is related to pool destroy (not container destroy).
Not sure why test timeout is not sufficient for 2.6. Let me increase the timeout and restart the testing...

@rpadma2 rpadma2 marked this pull request as draft February 1, 2025 03:10
@rpadma2 rpadma2 removed the clean-cherry-pick Cherry-pick from another branch that did not require additional edits label Feb 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

5 participants