Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

overlord/state, tests: measure state lock #14874

Merged

Conversation

sergiocazzolato
Copy link
Collaborator

@sergiocazzolato sergiocazzolato commented Dec 18, 2024

This change allows to collect the time the snapd state file is locked. To collect the results it is required to include the test snapd-state-lock and make sure just 1 worker is used.

SPREAD_SNAPD_STATE_LOCK_TRACE_THRESHOLD_MS=150 SPREAD_USE_SNAPD_SNAP_URL= SPREAD_USE_PREBUILT_SNAPD_SNAP=false ./run-spread -artifacts .artifacts google:ubuntu-22.04-64:tests/main/snapd-state-lock google:ubuntu-22.04-64:tests/smoke/

this is an example of the smoke suite with threshold=150ms: https://pastebin.canonical.com/p/T3bDG25f5q/
this is an example of the main suite with threshold=150ms: https://pastebin.canonical.com/p/sZjvcGdZFD/

@sergiocazzolato sergiocazzolato marked this pull request as draft December 18, 2024 02:34
@sergiocazzolato sergiocazzolato force-pushed the tests-measure-state-lock branch from 0d1c64f to f46e537 Compare December 18, 2024 17:40
@sergiocazzolato sergiocazzolato added the Needs Samuele review Needs a review from Samuele before it can land label Dec 19, 2024
@sergiocazzolato sergiocazzolato marked this pull request as ready for review December 19, 2024 16:51
Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question

@@ -139,6 +142,7 @@ func (s *State) Modified() bool {

// Lock acquires the state lock.
func (s *State) Lock() {
s.lockStart = osutil.GetLockStart()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be done with the lock taken? otherwise just the tentative themselves to lock will race/dirty lockStart

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, I'll update this

@pedronis pedronis requested a review from bboozzoo December 20, 2024 14:03
@sergiocazzolato
Copy link
Collaborator Author

sergiocazzolato commented Jan 6, 2025

These are the times after the fix proposed by @pedronis

lock-times-smoke.txt

I see it locks about 820ms when the snapd daemon is started (this happens in all the tests)

@pedronis pedronis self-requested a review January 6, 2025 15:54
@@ -159,6 +163,7 @@ func (s *State) writing() {
func (s *State) unlock() {
atomic.AddInt32(&s.muC, -1)
s.mu.Unlock()
osutil.MaybeSaveLockTime(s.lockStart)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only needs to happen with the lock taken I think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be ok to do with the state lock released if we flock the log file. With the state lock held, there would likely be no contention for flock on the log file. OTOH with the lock released, we'd rely on flock to serialize writes, but this would not keep the state lock unnecessarily held.

FWIW, unlocking may reschedule, so assuming we use flock for serializing writes, the timing may end up being incorrect given that the time is captured inside MaybeSaveLockTime(). The end time will likely need to be captured like so:

lockEnd := time.Now()
s.mu.Unlock()
osutil.MaybeSaveLockTime(s.lockStart, lockEnd)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, I missed that s.lockStart is used after releasing the lock, it should ofc be read into a variable before the lock is released

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

udpated

@sergiocazzolato
Copy link
Collaborator Author

sergiocazzolato commented Jan 6, 2025

@pedronis This is the result of the main suite 145 tests (using a machine with 2 cores and threshold of 800ms),
https://pastebin.canonical.com/p/Km4DrSskYD/

I see some worst times doing:
github.com/snapcore/snapd/daemon.(*Daemon).Start
github.com/snapcore/snapd/overlord/assertstate.doValidateSnap

should we specially test those cases?

osutil/statelock.go Outdated Show resolved Hide resolved
osutil/statelock.go Outdated Show resolved Hide resolved
// MaybeSaveLockTime allows to save lock times when this overpass the threshold
// defined by through the SNAPD_STATE_LOCK_THRESHOLD_MS environment settings.
func MaybeSaveLockTime(lockStart int64) {
lockEnd := time.Now().UnixNano() / int64(time.Millisecond)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can prob be done after both getenvs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update

@@ -159,6 +163,7 @@ func (s *State) writing() {
func (s *State) unlock() {
atomic.AddInt32(&s.muC, -1)
s.mu.Unlock()
osutil.MaybeSaveLockTime(s.lockStart)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be ok to do with the state lock released if we flock the log file. With the state lock held, there would likely be no contention for flock on the log file. OTOH with the lock released, we'd rely on flock to serialize writes, but this would not keep the state lock unnecessarily held.

FWIW, unlocking may reschedule, so assuming we use flock for serializing writes, the timing may end up being incorrect given that the time is captured inside MaybeSaveLockTime(). The end time will likely need to be captured like so:

lockEnd := time.Now()
s.mu.Unlock()
osutil.MaybeSaveLockTime(s.lockStart, lockEnd)

osutil/statelock.go Outdated Show resolved Hide resolved
f := runtime.FuncForPC(pc[i])
file, line := f.FileLine(pc[i])
formattedLine = fmt.Sprintf("%s:%d %s\n", file, line, f.Name())
if _, err = lockfile.WriteString(formattedLine); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could maybe use runtime.Profile to achieve the same thing but do it in more standard way.

osutil/statelock.go Outdated Show resolved Hide resolved
osutil/statelock.go Outdated Show resolved Hide resolved
overlord/state/state.go Outdated Show resolved Hide resolved
overlord/state/state.go Outdated Show resolved Hide resolved

// MaybeSaveLockTime allows to save lock times when this overpass the threshold
// defined by through the SNAPD_STATE_LOCK_THRESHOLD_MS environment settings.
func MaybeSaveLockTime(lockStart int64) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func MaybeSaveLockTime(lockStart int64) {
func MaybeSaveLockTime(lockWait, lockStart, lockEnd int64) {

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

overlord/state/state.go Outdated Show resolved Hide resolved
@@ -159,6 +163,7 @@ func (s *State) writing() {
func (s *State) unlock() {
atomic.AddInt32(&s.muC, -1)
s.mu.Unlock()
osutil.MaybeSaveLockTime(s.lockStart)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, I missed that s.lockStart is used after releasing the lock, it should ofc be read into a variable before the lock is released

osutil/statelock.go Outdated Show resolved Hide resolved
Copy link

codecov bot commented Jan 8, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Please upload report for BASE (master@79632c6). Learn more about missing BASE report.
Report is 17 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##             master   #14874   +/-   ##
=========================================
  Coverage          ?   78.26%           
=========================================
  Files             ?     1154           
  Lines             ?   153335           
  Branches          ?        0           
=========================================
  Hits              ?   120015           
  Misses            ?    25920           
  Partials          ?     7400           
Flag Coverage Δ
unittests 78.26% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@sergiocazzolato
Copy link
Collaborator Author

sergiocazzolato commented Jan 8, 2025

@bboozzoo @pedronis these are the results for the smoke suite after the latest changes
https://pastebin.canonical.com/p/wFdxyZYQ48/

Those were executed in a vm with 2 cpus

osutil/flock.go Show resolved Hide resolved
osutil/flock.go Show resolved Hide resolved
osutil/statelock.go Outdated Show resolved Hide resolved
osutil/statelock.go Outdated Show resolved Hide resolved
"time"
)

func traceCallers(description string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func traceCallers(description string) {
func traceCallers(description string) error {

and then you can update MaybeSave to do:

	if heldMs > threshold || waitMs > threshold {
		formattedLine := fmt.Sprintf("lock: held %d ms wait %d ms", heldMs, waitMs)
		if err := traceCallers(formattedLine); err != {
			logger.Infof("cannot trace callers: %v", err)
		}
	}

and change fmt.Fprintf() to use return fmt.Errorf("cannot this or that: %w", err)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that initially bet realized that we cannot use logger in utils, is there any other way to use it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented using fmt.Fprintf() because I saw faultinject does that too.

@pedronis pedronis self-requested a review January 8, 2025 14:04
@bboozzoo bboozzoo force-pushed the tests-measure-state-lock branch from f4fa741 to 431e59d Compare January 10, 2025 15:59
Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, some questions, some comments

}()

if err != nil {
fmt.Fprintf(os.Stderr, "cannot enable state lock tracing: %v\n", err)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it make sense to have all the testing snapds print this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dropped

return
}

now := LockTimestamp()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wasn't there a discussion that should be passed in too?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the reason of having this in osutil? at face value is a bit of a odd place for this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per mm discussion, moved the code to overlord/state

@bboozzoo bboozzoo changed the title tests: measure state lock overlord/state, tests: measure state lock Jan 13, 2025
Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the changes, one question still

s.mu.Unlock()
osutil.MaybeSaveLockTime(lockWait, lockStart)
osutil.MaybeSaveLockTime(lockWaitStart, lockHoldStart, osutil.LockTimestamp())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't the LockTimestamp be captured before the Unlock, assuming the Unlock anyway is quick?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, I made a similar comment on the original code and then forgot to update it.

Copy link
Collaborator

@pedronis pedronis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks everybody, can we produce the result with the latest changes?

Copy link
Contributor

@miguelpires miguelpires left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a few nitpicks

//go:build statelocktrace

/*
* Copyright (C) 2021 Canonical Ltd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Copyright (C) 2021 Canonical Ltd
* Copyright (C) 2025 Canonical Ltd

}

pc := make([]uintptr, 10)
// avoid 3 first callers on the stack: runtime.Callers(), this function and the parent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick

Suggested change
// avoid 3 first callers on the stack: runtime.Callers(), this function and the parent
// ignore the first 3 callers on the stack: runtime.Callers(), this function and the parent

//go:build !statelocktrace

/*
* Copyright (C) 2021 Canonical Ltd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Copyright (C) 2021 Canonical Ltd
* Copyright (C) 2024 Canonical Ltd

Comment on lines +23 to +28
func lockTimestamp() int64 {
return int64(0)
}

func maybeSaveLockTime(lockWaitStart, lockHoldStart, now int64) {
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny nitpick, we can simplify the no-ops

Suggested change
func lockTimestamp() int64 {
return int64(0)
}
func maybeSaveLockTime(lockWaitStart, lockHoldStart, now int64) {
}
func lockTimestamp() int64 { return 0 }
func maybeSaveLockTime(int64, int64, int64) {}

//go:build !statelocktrace

/*
* Copyright (C) 2021 Canonical Ltd
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Copyright (C) 2021 Canonical Ltd
* Copyright (C) 2024 Canonical Ltd

Comment on lines +107 to +108
// maybeSaveLockTime allows to save lock times when this overpass the threshold
// defined by through the SNAPD_STATE_LOCK_THRESHOLD_MS environment settings.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor suggestion for clarity's sake

Suggested change
// maybeSaveLockTime allows to save lock times when this overpass the threshold
// defined by through the SNAPD_STATE_LOCK_THRESHOLD_MS environment settings.
// maybeSaveLockTime logs the lock hold and wait times when either exceeds the
// threshold defined in the SNAPD_STATE_LOCK_THRESHOLD_MS environment variable.

@sergiocazzolato
Copy link
Collaborator Author

this is an example of the smoke suite with threshold=150ms: https://pastebin.canonical.com/p/T3bDG25f5q/
this is an example of the main suite with threshold=150ms: https://pastebin.canonical.com/p/sZjvcGdZFD/

@sergiocazzolato
Copy link
Collaborator Author

@pedronis

this is an example of the smoke suite with threshold=150ms: https://pastebin.canonical.com/p/T3bDG25f5q/
this is an example of the main suite (~300 tests) with threshold=150ms: https://pastebin.canonical.com/p/sZjvcGdZFD/

Copy link

github-actions bot commented Jan 16, 2025

Thu Jan 23 14:49:31 UTC 2025
The following results are from: https://github.com/canonical/snapd/actions/runs/12886246871

Failures:

@pedronis
Copy link
Collaborator

pedronis commented Jan 17, 2025

@sergiocazzolato thanks, the result look interesting/great, it seems maybe we should try the main suite run with threshold 200ms, also it would be good to have a tool that goes over the result and find repeated call stacks and present them only once with range of times instead of having them repeated many times and thenn also sort the call stacks by the one taking the longest time, probably having a first the long held and then the long waits

@bboozzoo
Copy link
Contributor

I'll push the changes suggested by @miguelpires

@bboozzoo bboozzoo force-pushed the tests-measure-state-lock branch from 3e28324 to c5a0532 Compare January 17, 2025 09:25
@bboozzoo
Copy link
Contributor

Updated and rebased. @sergiocazzolato please fetch before pushing any changes.

Copy link
Contributor

@maykathm maykathm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@pedronis
Copy link
Collaborator

@sergiocazzolato @maykathm to be clear the tool I mentioned can come later, is not for this PR

sergiocazzolato and others added 16 commits January 21, 2025 08:44
This will prevent raises.
…uild tag

Move state lock tracing from osutil into overlord/state. Tweak the code
to not export functions since there aren't any external consumers.

Signed-off-by: Maciej Borzecki <[email protected]>
The code is already executed in spread tests.

Signed-off-by: Maciej Borzecki <[email protected]>
@sergiocazzolato sergiocazzolato force-pushed the tests-measure-state-lock branch from c5a0532 to d13f58d Compare January 21, 2025 11:45
@sergiocazzolato sergiocazzolato merged commit 953a5d8 into canonical:master Jan 24, 2025
65 of 66 checks passed
@sergiocazzolato sergiocazzolato deleted the tests-measure-state-lock branch January 24, 2025 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Samuele review Needs a review from Samuele before it can land
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants