Skip to content

Commit

Permalink
Merge remote-tracking branch 'origin/master' into dbohning/daos-16845
Browse files Browse the repository at this point in the history
Test-tag: test_enospace_time_with_fg DfuseSpaceCheck MultipleContainerDelete test_enospace_lazy_with_fg
Test-repeat: 1
Skip-unit-tests: true
Skip-fault-injection-test: true

Signed-off-by: Dalton Bohning <[email protected]>
  • Loading branch information
daltonbohning committed Mar 4, 2025
2 parents 46591f9 + 92ccc7f commit 860b76e
Show file tree
Hide file tree
Showing 164 changed files with 6,562 additions and 3,114 deletions.
2 changes: 2 additions & 0 deletions .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,9 @@ site_scons/ @daos-stack/dev-build-owners @daos-stack/dev-build-watchers
utils/sl @daos-stack/dev-build-owners @daos-stack/dev-build-watchers

# ftest-watchers: files affecting functional tests
# pydaos/raw is client code only used by ftest
src/tests/ftest @daos-stack/ftest-owners @daos-stack/ftest-watchers
src/client/pydaos/raw @daos-stack/ftest-owners @daos-stack/ftest-watchers

# telem-watchers: Changes related to the telemetry library
src/utils/daos_metrics @daos-stack/telem-watchers
Expand Down
28 changes: 7 additions & 21 deletions .github/pull_request_template.md
Original file line number Diff line number Diff line change
@@ -1,23 +1,9 @@
### Before requesting gatekeeper:
### Steps for the author:

* [ ] Two review approvals and any prior change requests have been resolved.
* [ ] Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
* [ ] `Features:` (or `Test-tag*`) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
* [ ] Commit messages follows the guidelines outlined [here](https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments).
* [ ] Any tests skipped by the ticket being addressed have been run and passed in the PR.
* [ ] Commit message follows the [guidelines](https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments).
* [ ] Appropriate [Features or Test-tag](https://daosio.atlassian.net/wiki/spaces/DC/pages/10984259629/Test+Tags) pragmas were used.
* [ ] At least two positive code reviews including at least one code owner from each category referenced in the PR.
* [ ] Testing is complete. If necessary, forced-landing label added and a reason added in a comment.

### Gatekeeper:

* [ ] You are the appropriate gatekeeper to be landing the patch.
* [ ] The PR has 2 reviews by people familiar with the code, including appropriate owners.
* [ ] Githooks were used. If not, request that user install them and check copyright dates.
* [ ] Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
* [ ] All builds have passed. Check non-required builds for any new compiler warnings.
* [ ] Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
* [ ] If applicable, the PR has addressed any potential version compatibility issues.
* [ ] Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
* [ ] Extra checks if forced landing is requested
* [ ] Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
* [ ] No new NLT or valgrind warnings. Check the classic view.
* [ ] Quick-build or Quick-functional is not used.
* [ ] Fix the commit message upon landing. Check the standard [here](https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments). Edit it to create a single commit. If necessary, ask submitter for a new summary.
#### After all prior steps are complete:
* [ ] Gatekeeper requested (daos-gatekeeper added as a reviewer).
4 changes: 2 additions & 2 deletions .github/workflows/ossf-scorecard.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ jobs:
persist-credentials: false

- name: "Run analysis"
uses: ossf/scorecard-action@62b2cac7ed8198b15735ed49ab1e5cf35480ba46 # v2.4.0
uses: ossf/scorecard-action@f49aabe0b5af0936a0987cfb85d86b75731b0186 # v2.4.1
with:
results_file: results.sarif
results_format: sarif
Expand Down Expand Up @@ -71,6 +71,6 @@ jobs:
# Upload the results to GitHub's code scanning dashboard (optional).
# Commenting out will disable upload of results to your repo's Code Scanning dashboard
- name: "Upload to code-scanning"
uses: github/codeql-action/upload-sarif@9e8d0789d4a0fa9ceb6b1738f7e269594bdd67f0 # v3.28.9
uses: github/codeql-action/upload-sarif@b56ba49b26e50535fa1e7f7db0f4f7b4bf65d80d # v3.28.10
with:
sarif_file: results.sarif
2 changes: 1 addition & 1 deletion .github/workflows/trivy.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ jobs:
trivy-config: 'utils/trivy/trivy.yaml'

- name: Upload Trivy scan results to GitHub Security tab
uses: github/codeql-action/upload-sarif@9e8d0789d4a0fa9ceb6b1738f7e269594bdd67f0 # v3.28.9
uses: github/codeql-action/upload-sarif@b56ba49b26e50535fa1e7f7db0f4f7b4bf65d80d # v3.28.10
with:
sarif_file: 'trivy-results.sarif'

Expand Down
1 change: 1 addition & 0 deletions site_scons/components/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -180,6 +180,7 @@ def define_mercury(reqs):
'-DBUILD_TESTING_UNIT:BOOL=OFF',
'-DMERCURY_USE_BOOST_PP:BOOL=ON',
'-DMERCURY_USE_CHECKSUMS:BOOL=OFF',
'-DMERCURY_ENABLE_COUNTERS:BOOL=ON',
'-DNA_USE_SM:BOOL=ON',
'-DNA_USE_OFI:BOOL=ON',
'-DNA_USE_UCX:BOOL=ON',
Expand Down
39 changes: 27 additions & 12 deletions src/cart/crt_context.c
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
/*
* (C) Copyright 2016-2024 Intel Corporation.
* (C) Copyright 2025 Hewlett Packard Enterprise Development LP
* (C) Copyright 2025 Google LLC
*
* SPDX-License-Identifier: BSD-2-Clause-Patent
*/
Expand Down Expand Up @@ -274,8 +275,8 @@ crt_context_provider_create(crt_context_t *crt_ctx, crt_provider_t provider, boo

D_RWLOCK_UNLOCK(&crt_gdata.cg_rwlock);

/** initialize sensors */
if (crt_gdata.cg_use_sensors) {
/** initialize sensors for servers */
if (crt_gdata.cg_use_sensors && crt_is_service()) {
int ret;
char *prov;

Expand All @@ -285,50 +286,47 @@ crt_context_provider_create(crt_context_t *crt_ctx, crt_provider_t provider, boo
"reqs", "net/%s/req_timeout/ctx_%u",
prov, ctx->cc_idx);
if (ret)
D_WARN("Failed to create timed out req counter: "DF_RC
"\n", DP_RC(ret));
DL_WARN(ret, "Failed to create timed out req counter");

ret = d_tm_add_metric(&ctx->cc_timedout_uri, D_TM_COUNTER,
"Total number of timed out URI lookup "
"requests", "reqs",
"net/%s/uri_lookup_timeout/ctx_%u",
prov, ctx->cc_idx);
if (ret)
D_WARN("Failed to create timed out uri req counter: "
DF_RC"\n", DP_RC(ret));
DL_WARN(ret, "Failed to create timed out uri req counter");

ret = d_tm_add_metric(&ctx->cc_failed_addr, D_TM_COUNTER,
"Total number of failed address "
"resolution attempts", "reqs",
"net/%s/failed_addr/ctx_%u",
prov, ctx->cc_idx);
if (ret)
D_WARN("Failed to create failed addr counter: "DF_RC
"\n", DP_RC(ret));
DL_WARN(ret, "Failed to create failed addr counter");

ret = d_tm_add_metric(&ctx->cc_net_glitches, D_TM_COUNTER,
"Total number of network glitch errors", "errors",
"net/%s/glitch/ctx_%u", prov, ctx->cc_idx);
if (ret)
DL_WARN(rc, "Failed to create network glitch counter");
DL_WARN(ret, "Failed to create network glitch counter");

ret = d_tm_add_metric(&ctx->cc_swim_delay, D_TM_STATS_GAUGE,
"SWIM delay measurements", "delay",
"net/%s/swim_delay/ctx_%u", prov, ctx->cc_idx);
if (ret)
DL_WARN(rc, "Failed to create SWIM delay gauge");
DL_WARN(ret, "Failed to create SWIM delay gauge");

ret = d_tm_add_metric(&ctx->cc_quotas.rpc_waitq_depth, D_TM_GAUGE,
"Current count of enqueued RPCs", "rpcs",
"net/%s/waitq_depth/ctx_%u", prov, ctx->cc_idx);
if (ret)
DL_WARN(rc, "Failed to create rpc waitq gauge");
DL_WARN(ret, "Failed to create rpc waitq gauge");

ret = d_tm_add_metric(&ctx->cc_quotas.rpc_quota_exceeded, D_TM_COUNTER,
"Total number of exceeded RPC quota errors", "errors",
"net/%s/quota_exceeded/ctx_%u", prov, ctx->cc_idx);
if (ret)
DL_WARN(rc, "Failed to create quota exceeded counter");
DL_WARN(ret, "Failed to create quota exceeded counter");
}

if (crt_is_service() && crt_gdata.cg_auto_swim_disable == 0 &&
Expand Down Expand Up @@ -1224,6 +1222,9 @@ crt_context_timeout_check(struct crt_context *crt_ctx)
d_list_t timeout_list;
uint64_t ts_now;
bool print_once = false;
#ifdef HG_HAS_DIAG
bool should_republish = false;
#endif

D_ASSERT(crt_ctx != NULL);

Expand All @@ -1249,6 +1250,14 @@ crt_context_timeout_check(struct crt_context *crt_ctx)
"already on timeout list\n");
d_list_add_tail(&rpc_priv->crp_tmp_link_timeout, &timeout_list);
}

#ifdef HG_HAS_DIAG
/* piggy-back on the timeout processing so that we don't need to do another gettime() */
if (ts_now - crt_ctx->cc_hg_ctx.chc_diag_pub_ts > CRT_HG_TM_PUB_INTERVAL_US) {
should_republish = true;
crt_ctx->cc_hg_ctx.chc_diag_pub_ts = ts_now;
}
#endif
D_MUTEX_UNLOCK(&crt_ctx->cc_mutex);

/* handle the timeout RPCs */
Expand Down Expand Up @@ -1276,6 +1285,12 @@ crt_context_timeout_check(struct crt_context *crt_ctx)
crt_req_timeout_hdlr(rpc_priv);
RPC_DECREF(rpc_priv);
}

#ifdef HG_HAS_DIAG
/* periodically republish Mercury-level counters as DAOS metrics */
if (should_republish)
crt_hg_republish_diags(&crt_ctx->cc_hg_ctx);
#endif
}

/*
Expand Down
111 changes: 111 additions & 0 deletions src/cart/crt_hg.c
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
/*
* (C) Copyright 2016-2024 Intel Corporation.
* (C) Copyright 2025 Google LLC
* (C) Copyright 2025 Hewlett Packard Enterprise Development LP
*
* SPDX-License-Identifier: BSD-2-Clause-Patent
Expand Down Expand Up @@ -915,6 +916,79 @@ crt_hg_class_init(crt_provider_t provider, int ctx_idx, bool primary, int iface_
return rc;
}

static void
crt_hg_ctx_init_tm(struct crt_hg_context *hg_ctx, int idx)
{
struct crt_hg_metrics *metrics;
char *prov;
int rc = 0;

if (hg_ctx == NULL) {
D_ERROR("hg_ctx is NULL.\n");
return;
}

if (!crt_gdata.cg_use_sensors)
return;

prov = crt_provider_name_get(hg_ctx->chc_provider);
metrics = &hg_ctx->chc_metrics;

rc = d_tm_add_metric(&metrics->chm_bulks, D_TM_COUNTER,
"Mercury-layer count of bulk transfers", "bulks",
"net/%s/hg/bulks/ctx_%u", prov, idx);
if (rc)
DL_WARN(rc, "Failed to create hg bulk counter");

rc = d_tm_add_metric(&metrics->chm_mr_copies, D_TM_COUNTER,
"Mercury-layer count of multi-recv RPC requests requiring a copy",
"rpc", "net/%s/hg/mr_copies/ctx_%u", prov, idx);
if (rc)
DL_WARN(rc, "Failed to create hg multi recv copy counter");

rc = d_tm_add_metric(&metrics->chm_active_rpcs, D_TM_GAUGE,
"Mercury-layer count of active RPCs", "rpcs",
"net/%s/hg/active_rpcs/ctx_%u", prov, idx);
if (rc)
DL_WARN(rc, "Failed to create hg active RPC gauge");

rc = d_tm_add_metric(&metrics->chm_extra_bulk_req, D_TM_COUNTER,
"Mercury-layer count of RPCs with extra bulk request", "rpcs",
"net/%s/hg/extra_bulk_req/ctx_%u", prov, idx);
if (rc)
DL_WARN(rc, "Failed to create hg extra bulk req counter");

rc = d_tm_add_metric(&metrics->chm_extra_bulk_resp, D_TM_COUNTER,
"Mercury-layer count of RPCs with extra bulk response", "rpcs",
"net/%s/hg/extra_bulk_resp/ctx_%u", prov, idx);
if (rc)
DL_WARN(rc, "Failed to create hg extra bulk resp counter");

rc = d_tm_add_metric(&metrics->chm_req_sent, D_TM_COUNTER,
"Mercury-layer count of RPC requests sent", "requests",
"net/%s/hg/req_sent/ctx_%u", prov, idx);
if (rc)
DL_WARN(rc, "Failed to create hg req sent counter");

rc = d_tm_add_metric(&metrics->chm_resp_recv, D_TM_COUNTER,
"Mercury-layer count of RPC responses received", "responses",
"net/%s/hg/resp_recv/ctx_%u", prov, idx);
if (rc)
DL_WARN(rc, "Failed to create hg resp recv counter");

rc = d_tm_add_metric(&metrics->chm_req_recv, D_TM_COUNTER,
"Mercury-layer count of RPC requests received", "requests",
"net/%s/hg/req_recv/ctx_%u", prov, idx);
if (rc)
DL_WARN(rc, "Failed to create hg req recv counter");

rc = d_tm_add_metric(&metrics->chm_resp_sent, D_TM_COUNTER,
"Mercury-layer count of RPC responses sent", "responses",
"net/%s/hg/resp_sent/ctx_%u", prov, idx);
if (rc)
DL_WARN(rc, "Failed to create hg resp sent counter");
}

int
crt_hg_ctx_init(struct crt_hg_context *hg_ctx, crt_provider_t provider, int idx,
bool primary, int iface_idx)
Expand Down Expand Up @@ -960,6 +1034,7 @@ crt_hg_ctx_init(struct crt_hg_context *hg_ctx, crt_provider_t provider, int idx,
D_GOTO(out, rc = -DER_HG);
}

hg_ctx->chc_diag_pub_ts = 0;
hg_ctx->chc_hgcla = hg_class;
hg_ctx->chc_shared_hg_class = sep_mode;

Expand Down Expand Up @@ -991,6 +1066,8 @@ crt_hg_ctx_init(struct crt_hg_context *hg_ctx, crt_provider_t provider, int idx,
if (rc != 0)
D_ERROR("crt_hg_pool_init() failed, context idx %d hg_ctx %p, "
"rc: " DF_RC "\n", idx, hg_ctx, DP_RC(rc));

crt_hg_ctx_init_tm(hg_ctx, idx);
out:
return rc;
}
Expand Down Expand Up @@ -1033,6 +1110,40 @@ crt_hg_ctx_fini(struct crt_hg_context *hg_ctx)
return rc;
}

void
crt_hg_republish_diags(struct crt_hg_context *hg_ctx)
{
struct hg_diag_counters diags = {0};
struct crt_hg_metrics *metrics;
int rc = 0;

#ifndef HG_HAS_DIAG
return;
#endif

if (hg_ctx == NULL) {
D_ERROR("hg_ctx is NULL.\n");
return;
}

rc = HG_Class_get_counters(hg_ctx->chc_hgcla, &diags);
if (rc != HG_SUCCESS) {
D_ERROR("HG_Class_get_counters failed, rc: %d.\n", rc);
return;
}

metrics = &hg_ctx->chc_metrics;
d_tm_set_counter(metrics->chm_bulks, diags.bulk_count);
d_tm_set_counter(metrics->chm_mr_copies, diags.rpc_multi_recv_copy_count);
d_tm_set_gauge(metrics->chm_active_rpcs, diags.rpc_req_recv_active_count);
d_tm_set_counter(metrics->chm_extra_bulk_resp, diags.rpc_resp_extra_count);
d_tm_set_counter(metrics->chm_extra_bulk_req, diags.rpc_req_extra_count);
d_tm_set_counter(metrics->chm_resp_recv, diags.rpc_resp_recv_count);
d_tm_set_counter(metrics->chm_resp_sent, diags.rpc_resp_sent_count);
d_tm_set_counter(metrics->chm_req_recv, diags.rpc_req_recv_count);
d_tm_set_counter(metrics->chm_req_sent, diags.rpc_req_sent_count);
}

int
crt_rpc_handler_common(hg_handle_t hg_hdl)
{
Expand Down
Loading

0 comments on commit 860b76e

Please sign in to comment.