-
Notifications
You must be signed in to change notification settings - Fork 850
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bugfix:Avoid deadlock issues caused by logging in a loop #1265
base: master
Are you sure you want to change the base?
Conversation
friendly ping @AddyLaddy @sjeaugey . |
16c2080
to
7ed5d25
Compare
Signed-off-by: wangfakang <[email protected]>
7ed5d25
to
bfdb006
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your submission!
I left one technical comment for you, but overall, I'm skeptical whether this is the best approach to solving your problem, although it's difficult to tell for sure since you didn't include the actual changes that motivate this PR (which I'm guessing is adding timestamps to the log lines?).
The problem is that adding a dependency in the debug code on the auto-generated ncclParamEnableLogTime
function has created a circular lock dependency, leading to a deadlock. Your PR avoids the deadlock by moving the updating of the cache ahead of the debug output. It works, but it's fragile.
Instead, I recommend that you add the check for the presence of NCCL_ENABLE_LOG_TIME
to the ncclDebugInit
function, alongside the existing code over there that checks for NCCL_DEBUG
and NCCL_DEBUG_SUBSYS
. That should be more robust and thus more maintainable.
Also, I'm wondering whether NCCL_DEBUG_TIMESTAMPS
wouldn't be a better name for this new config, but that's just bikeshedding...
src/misc/param.cc
Outdated
if (__builtin_expect(__atomic_load_n(cache, __ATOMIC_RELAXED) != uninitialized, true)) { | ||
return; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be necessary. You are basically duplicating what's already done by the caller of this function, which is:
Lines 24 to 26 in ab2b89c
if (__builtin_expect(__atomic_load_n(&cache, __ATOMIC_RELAXED) == uninitialized, false)) { \ | |
ncclLoadParam("NCCL_" env, deftVal, uninitialized, &cache); \ | |
} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be necessary. You are basically duplicating what's already done by the caller of this function, which is:
Lines 24 to 26 in ab2b89c
if (__builtin_expect(__atomic_load_n(&cache, __ATOMIC_RELAXED) == uninitialized, false)) { \ ncclLoadParam("NCCL_" env, deftVal, uninitialized, &cache); \ } \
It seems so. However, as an basically API, wouldn't it be better to have this logic inside the API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maintainability-wise, I agree with you that it would make sense to hide it inside the function. However, NCCL has these calls sprinkled all over the codebase, including in performance-critical places. Looking at the code history, I can see that the code was deliberately structured this way to minimize the runtime impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maintainability-wise, I agree with you that it would make sense to hide it inside the function. However, NCCL has these calls sprinkled all over the codebase, including in performance-critical places. Looking at the code history, I can see that the code was deliberately structured this way to minimize the runtime impact.
Yes, I see what you mean, and addressed your comment.
Thank you very much for your prompt reply. I added a switch through
Yes, I am in Lines 178 to 183 in ab2b89c
BTW, if we moving the location of
that's a good tips. |
That's what I guessed. Are you planning to submit that functionality in a future PR?
Yes, I see what you mean. It currently works, but it's fragile. I will propose internally a change to make it more robust, i.e., implementing custom querying code in |
Signed-off-by: wangfakang <[email protected]>
Ok, I can submit this feature separately to another PR.
Yeah. IMO, this lock is to prevent only one thread from reading the environment variable when missing cache, which aims to improve performance rather than concurrent safety. In the scenario where there is already a cache, the overhead of the lock itself is questionable in terms of performance improvement. Therefore, I think the lock at this point can actually be removed. |
To prevent deadlock issues caused by logging in a loop, so cache the value before the log operation.
Here are the related function call paths:
nccl/src/misc/param.cc
Lines 62 to 76 in ab2b89c
The following stack is happend local development.