-
Notifications
You must be signed in to change notification settings - Fork 595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
overlord/state, tests: measure state lock #14874
overlord/state, tests: measure state lock #14874
Conversation
0d1c64f
to
f46e537
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
question
overlord/state/state.go
Outdated
@@ -139,6 +142,7 @@ func (s *State) Modified() bool { | |||
|
|||
// Lock acquires the state lock. | |||
func (s *State) Lock() { | |||
s.lockStart = osutil.GetLockStart() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be done with the lock taken? otherwise just the tentative themselves to lock will race/dirty lockStart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, I'll update this
These are the times after the fix proposed by @pedronis I see it locks about 820ms when the snapd daemon is started (this happens in all the tests) |
overlord/state/state.go
Outdated
@@ -159,6 +163,7 @@ func (s *State) writing() { | |||
func (s *State) unlock() { | |||
atomic.AddInt32(&s.muC, -1) | |||
s.mu.Unlock() | |||
osutil.MaybeSaveLockTime(s.lockStart) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this only needs to happen with the lock taken I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be ok to do with the state lock released if we flock the log file. With the state lock held, there would likely be no contention for flock on the log file. OTOH with the lock released, we'd rely on flock to serialize writes, but this would not keep the state lock unnecessarily held.
FWIW, unlocking may reschedule, so assuming we use flock for serializing writes, the timing may end up being incorrect given that the time is captured inside MaybeSaveLockTime(). The end time will likely need to be captured like so:
lockEnd := time.Now()
s.mu.Unlock()
osutil.MaybeSaveLockTime(s.lockStart, lockEnd)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note, I missed that s.lockStart is used after releasing the lock, it should ofc be read into a variable before the lock is released
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
udpated
@pedronis This is the result of the main suite 145 tests (using a machine with 2 cores and threshold of 800ms), I see some worst times doing: should we specially test those cases? |
osutil/statelock.go
Outdated
// MaybeSaveLockTime allows to save lock times when this overpass the threshold | ||
// defined by through the SNAPD_STATE_LOCK_THRESHOLD_MS environment settings. | ||
func MaybeSaveLockTime(lockStart int64) { | ||
lockEnd := time.Now().UnixNano() / int64(time.Millisecond) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can prob be done after both getenvs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update
overlord/state/state.go
Outdated
@@ -159,6 +163,7 @@ func (s *State) writing() { | |||
func (s *State) unlock() { | |||
atomic.AddInt32(&s.muC, -1) | |||
s.mu.Unlock() | |||
osutil.MaybeSaveLockTime(s.lockStart) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be ok to do with the state lock released if we flock the log file. With the state lock held, there would likely be no contention for flock on the log file. OTOH with the lock released, we'd rely on flock to serialize writes, but this would not keep the state lock unnecessarily held.
FWIW, unlocking may reschedule, so assuming we use flock for serializing writes, the timing may end up being incorrect given that the time is captured inside MaybeSaveLockTime(). The end time will likely need to be captured like so:
lockEnd := time.Now()
s.mu.Unlock()
osutil.MaybeSaveLockTime(s.lockStart, lockEnd)
osutil/statelock.go
Outdated
f := runtime.FuncForPC(pc[i]) | ||
file, line := f.FileLine(pc[i]) | ||
formattedLine = fmt.Sprintf("%s:%d %s\n", file, line, f.Name()) | ||
if _, err = lockfile.WriteString(formattedLine); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could maybe use runtime.Profile
to achieve the same thing but do it in more standard way.
osutil/statelock.go
Outdated
|
||
// MaybeSaveLockTime allows to save lock times when this overpass the threshold | ||
// defined by through the SNAPD_STATE_LOCK_THRESHOLD_MS environment settings. | ||
func MaybeSaveLockTime(lockStart int64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func MaybeSaveLockTime(lockStart int64) { | |
func MaybeSaveLockTime(lockWait, lockStart, lockEnd int64) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
overlord/state/state.go
Outdated
@@ -159,6 +163,7 @@ func (s *State) writing() { | |||
func (s *State) unlock() { | |||
atomic.AddInt32(&s.muC, -1) | |||
s.mu.Unlock() | |||
osutil.MaybeSaveLockTime(s.lockStart) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note, I missed that s.lockStart is used after releasing the lock, it should ofc be read into a variable before the lock is released
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #14874 +/- ##
=========================================
Coverage ? 78.26%
=========================================
Files ? 1154
Lines ? 153335
Branches ? 0
=========================================
Hits ? 120015
Misses ? 25920
Partials ? 7400
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
@bboozzoo @pedronis these are the results for the smoke suite after the latest changes Those were executed in a vm with 2 cpus |
osutil/statelock.go
Outdated
"time" | ||
) | ||
|
||
func traceCallers(description string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func traceCallers(description string) { | |
func traceCallers(description string) error { |
and then you can update MaybeSave to do:
if heldMs > threshold || waitMs > threshold {
formattedLine := fmt.Sprintf("lock: held %d ms wait %d ms", heldMs, waitMs)
if err := traceCallers(formattedLine); err != {
logger.Infof("cannot trace callers: %v", err)
}
}
and change fmt.Fprintf()
to use return fmt.Errorf("cannot this or that: %w", err)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried that initially bet realized that we cannot use logger in utils, is there any other way to use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented using fmt.Fprintf() because I saw faultinject does that too.
f4fa741
to
431e59d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you, some questions, some comments
osutil/statelock.go
Outdated
}() | ||
|
||
if err != nil { | ||
fmt.Fprintf(os.Stderr, "cannot enable state lock tracing: %v\n", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it make sense to have all the testing snapds print this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dropped
osutil/statelock.go
Outdated
return | ||
} | ||
|
||
now := LockTimestamp() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wasn't there a discussion that should be passed in too?
osutil/statelock.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the reason of having this in osutil? at face value is a bit of a odd place for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per mm discussion, moved the code to overlord/state
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the changes, one question still
overlord/state/state.go
Outdated
s.mu.Unlock() | ||
osutil.MaybeSaveLockTime(lockWait, lockStart) | ||
osutil.MaybeSaveLockTime(lockWaitStart, lockHoldStart, osutil.LockTimestamp()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't the LockTimestamp be captured before the Unlock, assuming the Unlock anyway is quick?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, I made a similar comment on the original code and then forgot to update it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks everybody, can we produce the result with the latest changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, just a few nitpicks
//go:build statelocktrace | ||
|
||
/* | ||
* Copyright (C) 2021 Canonical Ltd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Copyright (C) 2021 Canonical Ltd | |
* Copyright (C) 2025 Canonical Ltd |
} | ||
|
||
pc := make([]uintptr, 10) | ||
// avoid 3 first callers on the stack: runtime.Callers(), this function and the parent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick
// avoid 3 first callers on the stack: runtime.Callers(), this function and the parent | |
// ignore the first 3 callers on the stack: runtime.Callers(), this function and the parent |
//go:build !statelocktrace | ||
|
||
/* | ||
* Copyright (C) 2021 Canonical Ltd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Copyright (C) 2021 Canonical Ltd | |
* Copyright (C) 2024 Canonical Ltd |
func lockTimestamp() int64 { | ||
return int64(0) | ||
} | ||
|
||
func maybeSaveLockTime(lockWaitStart, lockHoldStart, now int64) { | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tiny nitpick, we can simplify the no-ops
func lockTimestamp() int64 { | |
return int64(0) | |
} | |
func maybeSaveLockTime(lockWaitStart, lockHoldStart, now int64) { | |
} | |
func lockTimestamp() int64 { return 0 } | |
func maybeSaveLockTime(int64, int64, int64) {} |
//go:build !statelocktrace | ||
|
||
/* | ||
* Copyright (C) 2021 Canonical Ltd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* Copyright (C) 2021 Canonical Ltd | |
* Copyright (C) 2024 Canonical Ltd |
// maybeSaveLockTime allows to save lock times when this overpass the threshold | ||
// defined by through the SNAPD_STATE_LOCK_THRESHOLD_MS environment settings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor suggestion for clarity's sake
// maybeSaveLockTime allows to save lock times when this overpass the threshold | |
// defined by through the SNAPD_STATE_LOCK_THRESHOLD_MS environment settings. | |
// maybeSaveLockTime logs the lock hold and wait times when either exceeds the | |
// threshold defined in the SNAPD_STATE_LOCK_THRESHOLD_MS environment variable. |
|
this is an example of the smoke suite with threshold=150ms: https://pastebin.canonical.com/p/T3bDG25f5q/ |
Thu Jan 23 14:49:31 UTC 2025 Failures: |
@sergiocazzolato thanks, the result look interesting/great, it seems maybe we should try the main suite run with threshold 200ms, also it would be good to have a tool that goes over the result and find repeated call stacks and present them only once with range of times instead of having them repeated many times and thenn also sort the call stacks by the one taking the longest time, probably having a first the long held and then the long waits |
I'll push the changes suggested by @miguelpires |
3e28324
to
c5a0532
Compare
Updated and rebased. @sergiocazzolato please fetch before pushing any changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good
@sergiocazzolato @maykathm to be clear the tool I mentioned can come later, is not for this PR |
This will prevent raises.
Signed-off-by: Maciej Borzecki <[email protected]>
Signed-off-by: Maciej Borzecki <[email protected]>
Signed-off-by: Maciej Borzecki <[email protected]>
Signed-off-by: Maciej Borzecki <[email protected]>
…uild tag Move state lock tracing from osutil into overlord/state. Tweak the code to not export functions since there aren't any external consumers. Signed-off-by: Maciej Borzecki <[email protected]>
Signed-off-by: Maciej Borzecki <[email protected]>
Signed-off-by: Maciej Borzecki <[email protected]>
The code is already executed in spread tests. Signed-off-by: Maciej Borzecki <[email protected]>
c5a0532
to
d13f58d
Compare
This change allows to collect the time the snapd state file is locked. To collect the results it is required to include the test snapd-state-lock and make sure just 1 worker is used.
SPREAD_SNAPD_STATE_LOCK_TRACE_THRESHOLD_MS=150 SPREAD_USE_SNAPD_SNAP_URL= SPREAD_USE_PREBUILT_SNAPD_SNAP=false ./run-spread -artifacts .artifacts google:ubuntu-22.04-64:tests/main/snapd-state-lock google:ubuntu-22.04-64:tests/smoke/
this is an example of the smoke suite with threshold=150ms: https://pastebin.canonical.com/p/T3bDG25f5q/
this is an example of the main suite with threshold=150ms: https://pastebin.canonical.com/p/sZjvcGdZFD/