-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16834 test: Support testing MD on SSD Phase 2 #15767
base: master
Are you sure you want to change the base?
Conversation
Add support for the dmg pool create --mem-ratio argument and enable testing MD on SSD Phase 2 on a few functional tests. Tests run with the launch.py --nvme=auto_md_on_ssd argument will be executed with an extra yaml file enabling the muxing of two /run/launhc/nvme branches which can then be used by tests to repeat testing with pools created with and without the --mem-ratio dmg pool create argument. Skip-unit_tests: true Skip-fault-injection-test: true Test-tag: MdtestSmall IorSmall Skip-func-hw-test-medium-md-on-ssd: false Signed-off-by: Phil Henderson <[email protected]>
Ticket title is 'Support running existing tests in MD on SSD stages with two pool variants' |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: MdtestSmall IorSmall Skip-func-hw-test-medium-md-on-ssd: false Signed-off-by: Phil Henderson <[email protected]>
Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15767/2/execution/node/920/log |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: MdtestSmall IorSmall Skip-func-hw-test-medium-md-on-ssd: false Signed-off-by: Phil Henderson <[email protected]>
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15767/3/testReport/ |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: MdtestSmall IorSmall Skip-func-hw-test-medium-md-on-ssd: false Signed-off-by: Phil Henderson <[email protected]>
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15767/4/testReport/ |
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: MdtestSmall IorSmall Skip-func-hw-test-medium-md-on-ssd: false Signed-off-by: Phil Henderson <[email protected]>
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: MdtestSmall IorSmall Skip-func-hw-test-medium-md-on-ssd: false Signed-off-by: Phil Henderson <[email protected]>
Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15767/6/testReport/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I understand, and there are some limitations to my understanding with regard to the test framework specifics, LGTM
src/tests/ftest/ior/small.yaml
Outdated
size: 90% | ||
!filter-only : /run/launch/nvme/default # yamllint disable-line rule:colons | ||
md_on_ssd_p2: | ||
size: 90% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you using 90% because 100% fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think historically 100% used to fail so tests never used it. But I think that issue is fixed now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% works:
2025-02-28 08:42:09,159 process L0604 INFO | Running '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d -j pool create TestPool_1 --group=jenkins --size=100% --user=jenkins'
2025-02-28 08:42:09,286 process L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:09.286357 main.go:228: debug output enabled
2025-02-28 08:42:09,286 process L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:09.286824 main.go:260: control config loaded from /var/tmp/daos_testing/configs/daos_control.yml
2025-02-28 08:42:09,289 process L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:09.289599 system.go:333: DAOS system query request: *mgmt.SystemQueryReq (sys:"daos_server-2.7.101" state_mask:65535)
2025-02-28 08:42:09,290 process L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:09.290362 rpc.go:278: request hosts: [wolf-227:10001 wolf-228:10001]
2025-02-28 08:42:09,297 process L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:09.296991 response.go:179: wolf-227:10001: *mgmt.SystemQueryResp@12 joined:0-3
2025-02-28 08:42:09,298 process L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:09.298610 rpc.go:278: request hosts: [wolf-227:10001 wolf-228:10001]
2025-02-28 08:42:12,563 process L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:12.562957 pool.go:1122: Added SMD device a485bfc0-d042-46f5-aca7-177e6721f039 (rank 0, ctrlr 0000:65:00.0) as usable: device state="NORMAL", smd-size=3831110828032 ctrlr-total-free=3831110828032
2025-02-28 08:42:12,563 process L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:12.563035 pool.go:1122: Added SMD device a60f8568-60ce-4d94-b1a4-4ebcb87ec1f2 (rank 1, ctrlr 0000:a5:00.0) as usable: device state="NORMAL", smd-size=3831110828032 ctrlr-total-free=3831110828032
2025-02-28 08:42:12,563 process L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:12.563072 pool.go:1122: Added SMD device e778618e-9991-447b-bc2f-3834d90135eb (rank 2, ctrlr 0000:65:00.0) as usable: device state="NORMAL", smd-size=3831110828032 ctrlr-total-free=3831110828032
2025-02-28 08:42:12,563 process L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:12.563106 pool.go:1122: Added SMD device 541da8fc-4777-4fea-a160-04c44a037992 (rank 3, ctrlr 0000:a5:00.0) as usable: device state="NORMAL", smd-size=3831110828032 ctrlr-total-free=3831110828032
2025-02-28 08:42:12,563 process L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:12.563160 pool.go:1198: Maximal size of a pool: scmBytes=109 GB (109363331072 B) nvmeBytes=3.8 TB (3831110828032 B)
2025-02-28 08:42:12,563 process L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:12.563294 pool.go:291: auto-percentage-size pool create mode: &{poolRequest:{msRequest:{} unaryRequest:{request:{timeout:0 deadline:{wall:0 ext:0 loc:<nil>} Sys: HostList:[]} rpc:<nil>} retryableRequest:{retryTimeout:0 retryInterval:0 retryMaxTries:0 retryTestFn:<nil> retryFn:<nil>}} UUID:00000000-0000-0000-0000-000000000000 User:jenkins@ UserGroup:jenkins@ ACL:nil NumSvcReps:0 Properties:[label:TestPool_1] TotalBytes:0 TierRatio:[] NumRanks:0 Ranks:[] TierBytes:[109363331072 3831110828032] MemRatio:0}
2025-02-28 08:42:12,564 process L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:12.564735 pool.go:343: Create DAOS pool request: *mgmt.PoolCreateReq uuid:2ebefbb5-82ed-4e86-8d03-ae62b3e7bf25 u:jenkins@ g:jenkins@ p:[number:1 strval:"TestPool_1"] ranks: tiers:0: 109 GB (109363331072)1: 3.8 TB (3831110828032)mem-ratio: 0.00
2025-02-28 08:42:12,565 process L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:12.565240 rpc.go:278: request hosts: [wolf-227:10001 wolf-228:10001]
2025-02-28 08:42:27,543 process L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:27.543332 response.go:179: wolf-227:10001: *mgmt.PoolCreateResp svc_ldr:3 svc_ranks:0-3 tgt_ranks:0-3 tiers:0: 109 GB (109363331072)1: 3.8 TB (3831110828032)meta-file-size: 109 GB (109363331072)
2025-02-28 08:42:27,543 process L0416 DEBUG| [stdout] {
2025-02-28 08:42:27,543 process L0416 DEBUG| [stdout] "response": {
2025-02-28 08:42:27,544 process L0416 DEBUG| [stdout] "uuid": "2ebefbb5-82ed-4e86-8d03-ae62b3e7bf25",
2025-02-28 08:42:27,544 process L0416 DEBUG| [stdout] "svc_ldr": 3,
2025-02-28 08:42:27,544 process L0416 DEBUG| [stdout] "svc_reps": [
2025-02-28 08:42:27,544 process L0416 DEBUG| [stdout] 0,
2025-02-28 08:42:27,544 process L0416 DEBUG| [stdout] 1,
2025-02-28 08:42:27,544 process L0416 DEBUG| [stdout] 2,
2025-02-28 08:42:27,544 process L0416 DEBUG| [stdout] 3
2025-02-28 08:42:27,544 process L0416 DEBUG| [stdout] ],
2025-02-28 08:42:27,544 process L0416 DEBUG| [stdout] "tgt_ranks": [
2025-02-28 08:42:27,544 process L0416 DEBUG| [stdout] 0,
2025-02-28 08:42:27,544 process L0416 DEBUG| [stdout] 1,
2025-02-28 08:42:27,544 process L0416 DEBUG| [stdout] 2,
2025-02-28 08:42:27,544 process L0416 DEBUG| [stdout] 3
2025-02-28 08:42:27,544 process L0416 DEBUG| [stdout] ],
2025-02-28 08:42:27,544 process L0416 DEBUG| [stdout] "tier_bytes": [
2025-02-28 08:42:27,544 process L0416 DEBUG| [stdout] 109363331072,
2025-02-28 08:42:27,544 process L0416 DEBUG| [stdout] 3831110828032
2025-02-28 08:42:27,545 process L0416 DEBUG| [stdout] ],
2025-02-28 08:42:27,545 process L0416 DEBUG| [stdout] "mem_file_bytes": 109363331072,
2025-02-28 08:42:27,545 process L0416 DEBUG| [stdout] "md_on_ssd_active": false
2025-02-28 08:42:27,545 process L0416 DEBUG| [stdout] },
2025-02-28 08:42:27,545 process L0416 DEBUG| [stdout] "error": null,
2025-02-28 08:42:27,545 process L0416 DEBUG| [stdout] "status": 0
2025-02-28 08:42:27,545 process L0416 DEBUG| [stdout] }
2025-02-28 08:42:28,546 process L0686 INFO | Command '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d -j pool create TestPool_1 --group=jenkins --size=100% --user=jenkins' finished with 0 after 19.384146213531494s
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: MdtestSmall IorSmall Skip-func-hw-test-medium-md-on-ssd: false Signed-off-by: Phil Henderson <[email protected]>
Skip-unit-tests: true Skip-fault-injection-test: true Test-tag: MdtestSmall IorSmall Skip-func-hw-test-medium-md-on-ssd: false Signed-off-by: Phil Henderson <[email protected]>
I did not encounter https://daosio.atlassian.net/browse/DAOS-17145 in https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15767/8 where all tests passed. |
yaml_file = os.path.join(yaml_dir, "extra_yaml_launch_params.yaml") | ||
lines = ['launch:'] | ||
lines.append(' nvme: !mux') | ||
labels = ['default'] | ||
if self._nvme.startswith("auto_md_on_ssd"): | ||
labels.append('md_on_ssd_p2') | ||
for label in labels: | ||
lines.append(f' {label}:') | ||
lines.append(' on: true') | ||
write_yaml_file(logger, yaml_file, lines) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this mean that when we use "auto_md_on_ssd", EVERY test has its variants doubled? And then specifically ior and mdtest small run a further additional variant?
For this to work as intended I think we should not use nvme: !mux
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then specifically ior and mdtest small run a further additional variant
Correction, they still have their variants doubled. But the mux in their configs filters out variants so in total it's still doubled.
But this means other tests will have their variants doubled by the extra yaml, and they will not filter them out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running a test other than ior/mdtest small in the md on ssd stage would highlight this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. We should only add the extra_yaml_launch_params.yaml
(with the mux) if we detect it being used as a filter, e.g. search the test yaml file for !filter-only : /run/launch/nvme
. I'll also add running an additional test on the next push.
Add support for the dmg pool create --mem-ratio argument and enable testing MD on SSD Phase 2 on a few functional tests. Tests run with the launch.py --nvme=auto_md_on_ssd argument will be executed with an extra yaml file enabling the muxing of two /run/launhc/nvme branches which can then be used by tests to repeat testing with pools created with and without the --mem-ratio dmg pool create argument.
Skip-unit_tests: true
Skip-fault-injection-test: true
Test-tag: MdtestSmall IorSmall
Skip-func-hw-test-medium-md-on-ssd: false
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: