Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16834 test: Support testing MD on SSD Phase 2 #15767

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

phender
Copy link
Contributor

@phender phender commented Jan 22, 2025

Add support for the dmg pool create --mem-ratio argument and enable testing MD on SSD Phase 2 on a few functional tests. Tests run with the launch.py --nvme=auto_md_on_ssd argument will be executed with an extra yaml file enabling the muxing of two /run/launhc/nvme branches which can then be used by tests to repeat testing with pools created with and without the --mem-ratio dmg pool create argument.

Skip-unit_tests: true
Skip-fault-injection-test: true
Test-tag: MdtestSmall IorSmall
Skip-func-hw-test-medium-md-on-ssd: false

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Add support for the dmg pool create --mem-ratio argument and enable
testing MD on SSD Phase 2 on a few functional tests.  Tests run with the
launch.py --nvme=auto_md_on_ssd argument will be executed with an extra
yaml file enabling the muxing of two /run/launhc/nvme branches which
can then be used by tests to repeat testing with pools created with and
without the --mem-ratio dmg pool create argument.

Skip-unit_tests: true
Skip-fault-injection-test: true
Test-tag: MdtestSmall IorSmall
Skip-func-hw-test-medium-md-on-ssd: false

Signed-off-by: Phil Henderson <[email protected]>
Copy link

Ticket title is 'Support running existing tests in MD on SSD stages with two pool variants'
Status is 'In Progress'
Labels: 'md_on_ssd2'
https://daosio.atlassian.net/browse/DAOS-16834

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: MdtestSmall IorSmall
Skip-func-hw-test-medium-md-on-ssd: false

Signed-off-by: Phil Henderson <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15767/2/execution/node/920/log

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: MdtestSmall IorSmall
Skip-func-hw-test-medium-md-on-ssd: false

Signed-off-by: Phil Henderson <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15767/3/testReport/

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: MdtestSmall IorSmall
Skip-func-hw-test-medium-md-on-ssd: false

Signed-off-by: Phil Henderson <[email protected]>
@phender phender marked this pull request as ready for review January 23, 2025 22:07
@phender phender requested review from a team as code owners January 23, 2025 22:07
@phender phender requested a review from daltonbohning January 23, 2025 22:07
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15767/4/testReport/

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: MdtestSmall IorSmall
Skip-func-hw-test-medium-md-on-ssd: false

Signed-off-by: Phil Henderson <[email protected]>
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: MdtestSmall IorSmall
Skip-func-hw-test-medium-md-on-ssd: false

Signed-off-by: Phil Henderson <[email protected]>
@phender phender requested a review from tanabarr January 29, 2025 23:20
@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium MD on SSD completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15767/6/testReport/

tanabarr
tanabarr previously approved these changes Jan 30, 2025
Copy link
Contributor

@tanabarr tanabarr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I understand, and there are some limitations to my understanding with regard to the test framework specifics, LGTM

size: 90%
!filter-only : /run/launch/nvme/default # yamllint disable-line rule:colons
md_on_ssd_p2:
size: 90%
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you using 90% because 100% fails?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think historically 100% used to fail so tests never used it. But I think that issue is fixed now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100% works:

2025-02-28 08:42:09,159 process          L0604 INFO | Running '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d -j pool create TestPool_1 --group=jenkins --size=100% --user=jenkins'
2025-02-28 08:42:09,286 process          L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:09.286357 main.go:228: debug output enabled
2025-02-28 08:42:09,286 process          L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:09.286824 main.go:260: control config loaded from /var/tmp/daos_testing/configs/daos_control.yml
2025-02-28 08:42:09,289 process          L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:09.289599 system.go:333: DAOS system query request: *mgmt.SystemQueryReq (sys:"daos_server-2.7.101"  state_mask:65535)
2025-02-28 08:42:09,290 process          L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:09.290362 rpc.go:278: request hosts: [wolf-227:10001 wolf-228:10001]
2025-02-28 08:42:09,297 process          L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:09.296991 response.go:179: wolf-227:10001: *mgmt.SystemQueryResp@12 joined:0-3 
2025-02-28 08:42:09,298 process          L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:09.298610 rpc.go:278: request hosts: [wolf-227:10001 wolf-228:10001]
2025-02-28 08:42:12,563 process          L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:12.562957 pool.go:1122: Added SMD device a485bfc0-d042-46f5-aca7-177e6721f039 (rank 0, ctrlr 0000:65:00.0) as usable: device state="NORMAL", smd-size=3831110828032 ctrlr-total-free=3831110828032
2025-02-28 08:42:12,563 process          L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:12.563035 pool.go:1122: Added SMD device a60f8568-60ce-4d94-b1a4-4ebcb87ec1f2 (rank 1, ctrlr 0000:a5:00.0) as usable: device state="NORMAL", smd-size=3831110828032 ctrlr-total-free=3831110828032
2025-02-28 08:42:12,563 process          L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:12.563072 pool.go:1122: Added SMD device e778618e-9991-447b-bc2f-3834d90135eb (rank 2, ctrlr 0000:65:00.0) as usable: device state="NORMAL", smd-size=3831110828032 ctrlr-total-free=3831110828032
2025-02-28 08:42:12,563 process          L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:12.563106 pool.go:1122: Added SMD device 541da8fc-4777-4fea-a160-04c44a037992 (rank 3, ctrlr 0000:a5:00.0) as usable: device state="NORMAL", smd-size=3831110828032 ctrlr-total-free=3831110828032
2025-02-28 08:42:12,563 process          L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:12.563160 pool.go:1198: Maximal size of a pool: scmBytes=109 GB (109363331072 B) nvmeBytes=3.8 TB (3831110828032 B)
2025-02-28 08:42:12,563 process          L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:12.563294 pool.go:291: auto-percentage-size pool create mode: &{poolRequest:{msRequest:{} unaryRequest:{request:{timeout:0 deadline:{wall:0 ext:0 loc:<nil>} Sys: HostList:[]} rpc:<nil>} retryableRequest:{retryTimeout:0 retryInterval:0 retryMaxTries:0 retryTestFn:<nil> retryFn:<nil>}} UUID:00000000-0000-0000-0000-000000000000 User:jenkins@ UserGroup:jenkins@ ACL:nil NumSvcReps:0 Properties:[label:TestPool_1] TotalBytes:0 TierRatio:[] NumRanks:0 Ranks:[] TierBytes:[109363331072 3831110828032] MemRatio:0}
2025-02-28 08:42:12,564 process          L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:12.564735 pool.go:343: Create DAOS pool request: *mgmt.PoolCreateReq uuid:2ebefbb5-82ed-4e86-8d03-ae62b3e7bf25 u:jenkins@ g:jenkins@ p:[number:1  strval:"TestPool_1"] ranks: tiers:0: 109 GB (109363331072)1: 3.8 TB (3831110828032)mem-ratio: 0.00 
2025-02-28 08:42:12,565 process          L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:12.565240 rpc.go:278: request hosts: [wolf-227:10001 wolf-228:10001]
2025-02-28 08:42:27,543 process          L0416 DEBUG| [stderr] DEBUG 2025/02/28 08:42:27.543332 response.go:179: wolf-227:10001: *mgmt.PoolCreateResp svc_ldr:3 svc_ranks:0-3 tgt_ranks:0-3 tiers:0: 109 GB (109363331072)1: 3.8 TB (3831110828032)meta-file-size: 109 GB (109363331072)
2025-02-28 08:42:27,543 process          L0416 DEBUG| [stdout] {
2025-02-28 08:42:27,543 process          L0416 DEBUG| [stdout]   "response": {
2025-02-28 08:42:27,544 process          L0416 DEBUG| [stdout]     "uuid": "2ebefbb5-82ed-4e86-8d03-ae62b3e7bf25",
2025-02-28 08:42:27,544 process          L0416 DEBUG| [stdout]     "svc_ldr": 3,
2025-02-28 08:42:27,544 process          L0416 DEBUG| [stdout]     "svc_reps": [
2025-02-28 08:42:27,544 process          L0416 DEBUG| [stdout]       0,
2025-02-28 08:42:27,544 process          L0416 DEBUG| [stdout]       1,
2025-02-28 08:42:27,544 process          L0416 DEBUG| [stdout]       2,
2025-02-28 08:42:27,544 process          L0416 DEBUG| [stdout]       3
2025-02-28 08:42:27,544 process          L0416 DEBUG| [stdout]     ],
2025-02-28 08:42:27,544 process          L0416 DEBUG| [stdout]     "tgt_ranks": [
2025-02-28 08:42:27,544 process          L0416 DEBUG| [stdout]       0,
2025-02-28 08:42:27,544 process          L0416 DEBUG| [stdout]       1,
2025-02-28 08:42:27,544 process          L0416 DEBUG| [stdout]       2,
2025-02-28 08:42:27,544 process          L0416 DEBUG| [stdout]       3
2025-02-28 08:42:27,544 process          L0416 DEBUG| [stdout]     ],
2025-02-28 08:42:27,544 process          L0416 DEBUG| [stdout]     "tier_bytes": [
2025-02-28 08:42:27,544 process          L0416 DEBUG| [stdout]       109363331072,
2025-02-28 08:42:27,544 process          L0416 DEBUG| [stdout]       3831110828032
2025-02-28 08:42:27,545 process          L0416 DEBUG| [stdout]     ],
2025-02-28 08:42:27,545 process          L0416 DEBUG| [stdout]     "mem_file_bytes": 109363331072,
2025-02-28 08:42:27,545 process          L0416 DEBUG| [stdout]     "md_on_ssd_active": false
2025-02-28 08:42:27,545 process          L0416 DEBUG| [stdout]   },
2025-02-28 08:42:27,545 process          L0416 DEBUG| [stdout]   "error": null,
2025-02-28 08:42:27,545 process          L0416 DEBUG| [stdout]   "status": 0
2025-02-28 08:42:27,545 process          L0416 DEBUG| [stdout] }
2025-02-28 08:42:28,546 process          L0686 INFO | Command '/usr/bin/dmg -o /var/tmp/daos_testing/configs/daos_control.yml -d -j pool create TestPool_1 --group=jenkins --size=100% --user=jenkins' finished with 0 after 19.384146213531494s

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: MdtestSmall IorSmall
Skip-func-hw-test-medium-md-on-ssd: false

Signed-off-by: Phil Henderson <[email protected]>
Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: MdtestSmall IorSmall
Skip-func-hw-test-medium-md-on-ssd: false

Signed-off-by: Phil Henderson <[email protected]>
@phender
Copy link
Contributor Author

phender commented Mar 3, 2025

Comment on lines +1155 to +1164
yaml_file = os.path.join(yaml_dir, "extra_yaml_launch_params.yaml")
lines = ['launch:']
lines.append(' nvme: !mux')
labels = ['default']
if self._nvme.startswith("auto_md_on_ssd"):
labels.append('md_on_ssd_p2')
for label in labels:
lines.append(f' {label}:')
lines.append(' on: true')
write_yaml_file(logger, yaml_file, lines)
Copy link
Contributor

@daltonbohning daltonbohning Mar 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this mean that when we use "auto_md_on_ssd", EVERY test has its variants doubled? And then specifically ior and mdtest small run a further additional variant?
For this to work as intended I think we should not use nvme: !mux

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then specifically ior and mdtest small run a further additional variant

Correction, they still have their variants doubled. But the mux in their configs filters out variants so in total it's still doubled.

But this means other tests will have their variants doubled by the extra yaml, and they will not filter them out

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Running a test other than ior/mdtest small in the md on ssd stage would highlight this

Copy link
Contributor Author

@phender phender Mar 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. We should only add the extra_yaml_launch_params.yaml (with the mux) if we detect it being used as a filter, e.g. search the test yaml file for !filter-only : /run/launch/nvme. I'll also add running an additional test on the next push.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants