-
Notifications
You must be signed in to change notification settings - Fork 282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reporter: don't expire actively used executables #247
reporter: don't expire actively used executables #247
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -147,7 +151,7 @@ func NewOTLP(cfg *Config) (*OTLPReporter, error) { | |||
if err != nil { | |||
return nil, err | |||
} | |||
executables.SetLifetime(1 * time.Hour) // Allow GC to clean stale items. | |||
executables.SetLifetime(executableCacheLifetime) // Allow GC to clean stale items. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we also increase the size of this cache as 4096 seems small given we still haven't solved the core issue here, which is that items may be dropped from this cache (whether through time-based expiration or due to the LRU being full) and there's no control or guarantee as to when they'll be re-inserted.
I'd set the cache size to 16384 until we really solve the problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong opinion on the cache size (4096 seems quite large to me).
Ideally, we should have metrics for expiration and eviction, then get numbers from production systems.
We can also make the cache sizes and lifetimes configurable.
And we can create an LRU wrapper that automatically resizes the LRUs, as suggested at #248 (comment)
Maybe better continue at #244 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also not forget that (currently), the reporter implementation in this repository is just for demo/example purposes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No strong opinion on my side either - went ahead and bumped it to 16384 in bf83645
To make it less likely that LRU-full eviction evicts executable metadata for a valuable executable, see open-telemetry#247 (comment)
execInfo, exists := r.executables.Get(traceInfo.files[i]) | ||
// Ensure that actively used executables do not expire. | ||
execInfo, exists := r.executables.GetAndRefresh(traceInfo.files[i], | ||
executableCacheLifetime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the difference here? Does not Get
automatically reset the TTL? If not, the other usage of the lru
should be reviewed and this same change probably applies to (potential multiple) other locations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Get
updates the recentness of an item, which covers eviction when the LRU is full.
The lifetime is only updated/set with Add
and is not touched by Get
.
The lifetime is a concept that allows purging unused items from an LRU (call to PurgeExpired
) to avoid long-term resource leaks.
GetAndRefresh
, updates both, recentness and lifetime of an item.
FreeLRU does not actively purge expired items. For the reporter caches, we recently added a ticker to call to PurgeExpired
regularly. For other caches we don't have an active purging.
* Fix unwinding at syscall on aarch64 (open-telemetry#218) * Add PID as an attribute in each sample (open-telemetry#212) * ebpf: increase number of stack delta buckets (open-telemetry#231) Signed-off-by: Florian Lehner <[email protected]> * reporter: use htlhash attribute for profiling specific hash (open-telemetry#236) Signed-off-by: Florian Lehner <[email protected]> * reporter: drop fifo (open-telemetry#239) Signed-off-by: Florian Lehner <[email protected]> * Drop more unused code (open-telemetry#240) * reporter: do not add empty attributes (open-telemetry#233) Signed-off-by: Florian Lehner <[email protected]> * controller: fix reporter interval mix up with monitor interval (open-telemetry#242) * Extract reporter runloop (open-telemetry#228) Co-authored-by: Christos Kalkanis <[email protected]> * Extract lookupcgroupv2 out of the otlp reporter (open-telemetry#227) * Add CPU id to trace and trace metadata (open-telemetry#249) * reporter: don't expire actively used executables (open-telemetry#247) * Remove legacy code from libpf.UnixTime64 (open-telemetry#252) * Turn kernel module file parsing errors into warnings (open-telemetry#255) Co-authored-by: Florian Lehner <[email protected]> * readatbuf: add missing check when reading from tail chunk (open-telemetry#259) * metrics: Don't send counters with 0 values (open-telemetry#246) --------- Signed-off-by: Florian Lehner <[email protected]> Co-authored-by: umanwizard <[email protected]> Co-authored-by: Bhavna Jindal <[email protected]> Co-authored-by: Florian Lehner <[email protected]> Co-authored-by: Tim Rühsen <[email protected]> Co-authored-by: Nayef Ghattas <[email protected]> Co-authored-by: Damien Mathieu <[email protected]> Co-authored-by: Christos Kalkanis <[email protected]> Co-authored-by: Tommy Reilly <[email protected]> Co-authored-by: Joel Höner <[email protected]>
Ensure that actively used executable info is not expired from the cache, by extending its lifetime using
GetAndRefresh()
Fixes #244