-
Notifications
You must be signed in to change notification settings - Fork 306
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16971 cart: add RPC origin address #15820
base: google/2.6
Are you sure you want to change the base?
Conversation
Ticket title is 'bulk cancel causing libfabric progress failure.' |
db7cb67
to
24f7c0c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few comments inline. Perhaps for the error cases we can include 'origin info' as part of RPC_ERROR and RPC_WARN as well - while there is some overhead it might be ok for error cases and help with debug.
src/object/srv_obj.c
Outdated
uint32_t addr_size = 48; | ||
|
||
crt_rpc_get_origin_addr(rpc, addr, &addr_size); | ||
D_ERROR("rpc %p opc %d send reply, pmv %d, epoch " DF_X64 ", status %d from %s\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a case like this it would be better to add a public variant of RPC_ERROR() macro and use it to print additional info, as it will print decoded opcode for example along with rpcid and other fields that are helpful in debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
src/object/srv_obj.c
Outdated
D_ERROR("rpc %p opc %d send reply, pmv %d, epoch " DF_X64 ", status %d from %s\n", | ||
rpc, opc_get(rpc->cr_opc), ioc->ioc_map_ver, orwo->orw_epoch, status, addr); | ||
} else { | ||
D_DEBUG(DB_IO, "rpc %p opc %d send reply, pmv %d, epoch " DF_X64 ", status %d\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here but a variant of RPC_DEBUG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
src/include/cart/api.h
Outdated
@@ -2375,6 +2376,18 @@ int crt_context_quota_limit_get(crt_context_t crt_ctx, crt_quota_type_t quota, i | |||
int | |||
crt_req_get_proto_ver(crt_rpc_t *req); | |||
|
|||
/** | |||
* Get the rpc original address. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/original/origin/ ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
src/include/cart/api.h
Outdated
* | ||
* \param[in] rpc pointer to RPC request | ||
* \param[out] addr pointer to the converted buffer. | ||
* \param[in/out] addr_size the size of the converted buffer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: would be more clear for me for this to be 'addr_len' instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
src/object/srv_obj.c
Outdated
uint32_t addr_size = 48; | ||
|
||
crt_rpc_get_origin_addr(cb_info->bci_bulk_desc->bd_rpc, addr, &addr_size); | ||
D_ERROR("bulk transfer failed: %d from %s\n", cb_info->bci_rc, addr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should then include origin_addr info for any transfer fail inside of crt_bulk_transfer/cb so that it covers other places too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, good suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
src/object/srv_obj.c
Outdated
uint32_t addr_size = 48; | ||
|
||
crt_rpc_get_origin_addr(rpc, addr, &addr_size); | ||
D_ERROR( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, a public variant of RPC_ERROR (e.g. D_RPC_ERROR) would be handy here to add and use
Functional on EL 8.8 Test Results131 tests 127 ✅ 1h 30m 18s ⏱️ Results for commit ee2ba3f. ♻️ This comment has been updated with latest results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am personally not a big fan of spreading calls to crt_rpc_get_origin_addr() everywhere in the source code. We cannot use the TLS or even put it in a field of the cart RPC structure?
yeah, we've been further discussing this and jeff also suggested storing resolved addr on first lookup in rpc_priv |
Functional Hardware Large Test Results64 tests 64 ✅ 29m 32s ⏱️ Results for commit 24f7c0c. |
Functional Hardware Medium Test Results154 tests 127 ✅ 2h 4m 42s ⏱️ Results for commit 24f7c0c. |
Functional Hardware Medium Verbs Provider Test Results55 tests 54 ✅ 4h 36m 37s ⏱️ Results for commit 24f7c0c. |
Add cart RPC origin address to daos engine log. Run-GHA: true Allow-unstable-test: true Signed-off-by: Di Wang <[email protected]>
24f7c0c
to
6d0153b
Compare
yeah, I put it to rpc_private structure, though still get it by this API in D_ERROR or D_DEBUG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm; one nitpick inline and you also might want to merge with the latest master as i've modified RPC_* macros few days ago there as well
src/include/cart/api.h
Outdated
@@ -2375,6 +2376,15 @@ int crt_context_quota_limit_get(crt_context_t crt_ctx, crt_quota_type_t quota, i | |||
int | |||
crt_req_get_proto_ver(crt_rpc_t *req); | |||
|
|||
/** | |||
* Get the rpc original address. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still a typo:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, ok
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15820/3/testReport/ |
fix style issue Signed-off-by: Di Wang <[email protected]>
fix wording. Signed-off-by: Di Wang <[email protected]>
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15820/4/testReport/ |
fix mem leak Signed-off-by: Di Wang <[email protected]>
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15820/5/testReport/ |
Revert some log changes to please CI Run-GHA: true Allow-unstable-test: true Signed-off-by: Di Wang <[email protected]>
Test stage NLT on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15820/6/testReport/ |
@@ -203,7 +205,8 @@ crt_bulk_bind_transfer(struct crt_bulk_desc *bulk_desc, | |||
|
|||
rc = crt_hg_bulk_transfer(bulk_desc, complete_cb, arg, opid, true); | |||
if (rc != 0) | |||
D_ERROR("crt_hg_bulk_transfer() failed, rc: %d.\n", rc); | |||
DL_ERROR(rc, "%p:%s crt_hg_bulk_transfer() failed.\n", bulk_desc->bd_rpc, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to remove "\n" from the message as DL_ERROR will append one on its own
@@ -26,6 +27,34 @@ | |||
#include "crt_self_test.h" | |||
#include "crt_swim.h" | |||
|
|||
static inline char * |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should go into crt_internal_fns.h for declaration and crt_rpc.c for the implementation. I don't think there is much benefit from inlining this function anyway
|
||
rc = HG_Addr_to_string(hg_info->hg_class, addr, (hg_size_t *)&addr_size, hg_info->addr); | ||
if (rc != 0) | ||
return "None"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should either print an error for each case or somehow differentiate None's.
Imagine you see "None" in log, that would be unexpected, but wouldn't tell us which one of those 3 calls failed
* \return the origin address of the RPC | ||
*/ | ||
char * | ||
crt_rpc_get_origin_addr(crt_rpc_t *rpc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The api functions follow format of crt_[module][verb] (see crt_context_quota_limit_get, crt_context_create and others)
Also in public APIs we refer to 'rpc' as 'req' (e.g. crt_req_create(), crt_req_get_proto_ver() etc) in order to differentiate with crt_rpc functions that are internal.
So based on those 2 things, this should be renamed to something like crt_req_origin_addr_get()
Add cart RPC origin address to error handling.
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: