-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16464 test: improve online_rebuild_mdtest.py (#15108) #15807
base: release/2.6
Are you sure you want to change the base?
Conversation
Ticket title is 'erasurecode/online_rebuild_mdtest.py:EcodOnlineRebuildMdtest.test_ec_online_rebuild_mdtest - time out waiting for mdtest after server stop' |
Test-tag: EcodOnlineRebuildMdtest Test-repeat: 3 Skip-unit-tests: true Skip-fault-injection-test: true - Run with a stonewall and stop ranks after half of the stonewall time so the timing is more reliable than arbitrarily sleeping for 30 seconds. - Catch exceptions raised in the mdtest thread. - Reduce logging. - Misc refactoring improvements Signed-off-by: Dalton Bohning <[email protected]> Signed-off-by: Padmanabhan <[email protected]>
c8a67ab
to
843e252
Compare
Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15807/2/execution/node/926/log |
The single failure is known issue : DAOS-16737.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change doesn't seem to fix the issue as https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15807/2/artifact/Functional%20Hardware%20Large/erasurecode/online_rebuild_mdtest.py/repeat001/job.log encountered the same failure reported in the ticket:
2025-01-29 22:41:59,187 process L0604 INFO | Running '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d system stop --force --ranks=4'
2025-01-29 22:41:59,274 process L0416 DEBUG| [stderr] DEBUG 2025/01/29 22:41:59.274166 main.go:228: debug output enabled
2025-01-29 22:41:59,274 process L0416 DEBUG| [stderr] DEBUG 2025/01/29 22:41:59.274703 main.go:260: control config loaded from /var/tmp/daos_testing/configs/daos_control.yml
2025-01-29 22:41:59,280 process L0416 DEBUG| [stderr] DEBUG 2025/01/29 22:41:59.280020 system.go:465: DAOS system stop request: *mgmt.SystemStopReq (sys:"daos_server-2.6.3" force:true ranks:"4")
2025-01-29 22:41:59,281 process L0416 DEBUG| [stderr] DEBUG 2025/01/29 22:41:59.281445 rpc.go:278: request hosts: [wolf-110:10001 wolf-111:10001 wolf-112:10001 wolf-114:10001 wolf-115:10001]
2025-01-29 22:41:59,939 process L0416 DEBUG| [stderr] DEBUG 2025/01/29 22:41:59.939903 response.go:179: wolf-110:10001: *mgmt.SystemStopResp stopped:4
2025-01-29 22:41:59,940 process L0416 DEBUG| [stdout] Rank Operation Result
2025-01-29 22:41:59,940 process L0416 DEBUG| [stdout] ---- --------- ------
2025-01-29 22:41:59,940 process L0416 DEBUG| [stdout] 4 stop OK
2025-01-29 22:41:59,941 process L0416 DEBUG| [stdout]
2025-01-29 22:42:00,942 process L0686 INFO | Command '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d system stop --force --ranks=4' finished with 0 after 1.6973729133605957s
2025-01-29 22:42:00,972 command_utils L1348 INFO | Updating the expected state for rank 4 on wolf-115: joined -> ['stopped', 'excluded']
2025-01-29 22:43:59,479 process L0416 DEBUG| [stdout] Continue stonewall hit min: 16623 max: 17205 avg: 16939.5
2025-01-29 22:43:59,479 process L0416 DEBUG| [stdout]
2025-01-29 22:44:03,540 process L0416 DEBUG| [stdout]
2025-01-29 22:44:03,540 process L0416 DEBUG| [stdout] SUMMARY rate: (of 1 iterations)
2025-01-29 22:44:03,540 process L0416 DEBUG| [stdout] Operation Max Min Mean Std Dev
2025-01-29 22:44:03,541 process L0416 DEBUG| [stdout] ---------
2025-01-29 22:44:03,541 process L0416 DEBUG| [stdout] --- --- ---- -------
2025-01-29 22:44:03,541 process L0416 DEBUG| [stdout] File creation 505.818 505.818 505.818 0.000
2025-01-29 22:44:03,541 process L0416 DEBUG| [stdout] File stat 0.000 0.000 0.000 0.000
2025-01-29 22:44:03,541 process L0416 DEBUG| [stdout] File read 0.000 0.000 0.000 0.000
2025-01-29 22:44:03,541 process L0416 DEBUG| [stdout] File removal 0.000 0.000 0.000 0.000
2025-01-29 22:44:03,541 process L0416 DEBUG| [stdout] Tree creation 25.716 25.716 25.716 0.000
2025-01-29 22:44:03,541 process L0416 DEBUG| [stdout] Tree removal 0.000 0.000 0.000 0.000
2025-01-29 22:44:03,541 process L0416 DEBUG| [stdout] -- finished at 01/29/2025 22:44:03 --
2025-01-29 22:44:03,542 process L0416 DEBUG| [stdout]
2025-01-29 23:06:59,336 stacktrace L0039 ERROR|
2025-01-29 23:06:59,336 stacktrace L0042 ERROR| Reproduced traceback from: /localhome/jenkins/venv/lib64/python3.6/site-packages/avocado/core/test.py:767
2025-01-29 23:06:59,353 stacktrace L0045 ERROR| Traceback (most recent call last):
2025-01-29 23:06:59,353 stacktrace L0045 ERROR| File "/usr/lib/daos/TESTING/ftest/erasurecode/online_rebuild_mdtest.py", line 36, in test_ec_online_rebuild_mdtest
2025-01-29 23:06:59,354 stacktrace L0045 ERROR| self.start_online_mdtest(ranks_to_stop)
2025-01-29 23:06:59,354 stacktrace L0045 ERROR| File "/usr/lib/daos/TESTING/ftest/util/ec_utils.py", line 460, in start_online_mdtest
2025-01-29 23:06:59,354 stacktrace L0045 ERROR| job.join()
2025-01-29 23:06:59,354 stacktrace L0045 ERROR| File "/usr/lib64/python3.6/threading.py", line 1077, in join
2025-01-29 23:06:59,354 stacktrace L0045 ERROR| self._wait_for_tstate_lock()
2025-01-29 23:06:59,354 stacktrace L0045 ERROR| File "/usr/lib64/python3.6/threading.py", line 1093, in _wait_for_tstate_lock
2025-01-29 23:06:59,354 stacktrace L0045 ERROR| elif lock.acquire(block, timeout):
2025-01-29 23:06:59,354 stacktrace L0045 ERROR| File "/localhome/jenkins/venv/lib64/python3.6/site-packages/avocado/plugins/runner.py", line 77, in sigterm_handler
2025-01-29 23:06:59,354 stacktrace L0045 ERROR| raise RuntimeError("Test interrupted by SIGTERM")
2025-01-29 23:06:59,354 stacktrace L0045 ERROR| RuntimeError: Test interrupted by SIGTERM
2025-01-29 23:06:59,354 stacktrace L0046 ERROR|
2025-01-29 23:06:59,354 test L0772 DEBUG| Local variables:
2025-01-29 23:06:59,355 test L0775 DEBUG| -> ranks_to_stop <class 'list'>: [4]
2025-01-29 23:06:59,355 test L0775 DEBUG| -> self <class 'online_rebuild_mdtest.EcodOnlineRebuildMdtest'>: 4-./erasurecode/online_rebuild_mdtest.py:EcodOnlineRebuildMdtest.test_ec_online_rebuild_mdtest;run-container-hosts-servers-12_server-mdtest-client_processes-dfs_oclass_mux-12_server_ec8p2gx-pool-server_config-engines-0-storage-0-1-setup-bbe1
2025-01-29 23:06:59,355 test L0357 INFO | ====================================================================================================
2025-01-29 23:06:59,356 test L0471 INFO | ==> Step 4: tearDown(): Called due to exceeding the 1590s test timeout [elapsed since last step: 1527.74s]
Thanks @phender . Also, DAOS-16737 is related to pool destroy (not container destroy). |
Test-tag: EcodOnlineRebuildMdtest
Test-repeat: 3
Skip-unit-tests: true
Skip-fault-injection-test: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: