-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Container CPU Utilization Resource Monitor #37792
base: main
Are you sure you want to change the base?
Conversation
Hi @nix1n, welcome and thank you for your contribution. We will try to review your Pull Request as quickly as possible. In the meantime, please take a look at the contribution guidelines if you have not done so already. |
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to |
…edding in K8s environment Signed-off-by: nix1n <[email protected]>
Signed-off-by: nix1n <[email protected]>
Signed-off-by: nix1n <[email protected]>
Signed-off-by: nix1n <[email protected]>
Signed-off-by: nix1n <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this! I think we can try to reuse some of the existing components from the other cpu monitor to make this even better.
namespace EnvoyContainerCpuUtilizationMonitor { | ||
|
||
|
||
struct EnvoyContainerStats { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we fold this into the existing cpu utilization monitor?
For example if we were to modify the proto to have an enum or WKT for the reader to instantiate?
we could then modify:
// In the future, the below can be configurable based on the operating system. |
to create a cpu monitor that uses either one of the stat readers ( the container one or the host one)
This would lead to us having less code to own in general :).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KBaichoo can you check the PR now ? I tried to add container support in the existing cpu_utilization resource monitor. Have NOT rewritten tests and docs yet , will write once we finalise the actual implementation first.
previous_envoy_container_stats_ = envoy_container_stats_reader_->getEnvoyContainerStats(); | ||
} | ||
|
||
void EnvoyContainerCpuUtilizationMonitor::updateResourceUsage(Server::ResourceUpdateCallbacks& callbacks) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we did the folding this file would effectively merge with the existing https://github.com/envoyproxy/envoy/blob/main/source/extensions/resource_monitors/cpu_utilization/cpu_utilization_monitor.h which it looks to me to be very similar.
/wait |
Signed-off-by: nix1n <[email protected]>
Signed-off-by: nix1n <[email protected]>
@@ -12,8 +13,12 @@ option (udpa.annotations.file_status).package_version_status = ACTIVE; | |||
|
|||
// [#protodoc-title: CPU utilization] | |||
// [#extension: envoy.resource_monitors.cpu_utilization] | |||
|
|||
enum UtilizationComputeStrategy { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to put this enum inside of the CpuUtilizationConfig
message?
Also, please document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated docs, and added enum inside the CpuUtilizationConfig message.
Signed-off-by: nix1n <[email protected]>
Signed-off-by: nix1n <[email protected]>
@@ -13,7 +14,16 @@ option (udpa.annotations.file_status).package_version_status = ACTIVE; | |||
// [#protodoc-title: CPU utilization] | |||
// [#extension: envoy.resource_monitors.cpu_utilization] | |||
|
|||
// The CPU utilization resource monitor reports the Envoy process the CPU Utilization of the entire host. | |||
// The CPU utilization resource monitor reports the Envoy process the, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having trouble parsing this sentence. Can you please clarify?
@@ -13,7 +14,16 @@ option (udpa.annotations.file_status).package_version_status = ACTIVE; | |||
// [#protodoc-title: CPU utilization] | |||
// [#extension: envoy.resource_monitors.cpu_utilization] | |||
|
|||
// The CPU utilization resource monitor reports the Envoy process the CPU Utilization of the entire host. | |||
// The CPU utilization resource monitor reports the Envoy process the, | |||
// CPU Utilization of the entire host if HOST mode is selected, reports |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document the enum values right above where each one is defined, rather than in a comment above the entire message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on simplifying this, this looks much better.
/wait
source/extensions/resource_monitors/cpu_utilization/cpu_stats_reader.h
Outdated
Show resolved
Hide resolved
source/extensions/resource_monitors/cpu_utilization/cpu_utilization_monitor.cc
Outdated
Show resolved
Hide resolved
@@ -22,13 +22,26 @@ struct CpuTimes { | |||
uint64_t total_time; | |||
}; | |||
|
|||
struct CgroupStats { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the reason we can't use the existing cpu times structure given it has effectively the same fields?
This would simplify the utilization monitor as it could then just use the CpuStatsReader interface vs the concrete implementation class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KBaichoo actually, for Host CPU Utilization, we just need /proc/stat file to read cpu_work and cpu_times both of the field in CpuTimes.
While to calculate container cpu utilization we need to read two different files to read allocated quota at some time, usage seconds total in nanoseconds at that time, and the time_difference from the timer only I could find to calculate. This method we are using in our ambassador edge stack service also, but to update injected resource pressure from a parallely running python script using same calculation strategy.
ref: google/cadvisor#2026 (comment)
So I created another class to read cgroup stats, which read allocated quota, total cpu time in nanoseconds.
This can be merged with CpuTimes class itself but not sure what meaningful naming would suffice for these two though both are uint64_t data type only. But yeah then the reader class also would need to access config mode and based on config mode it should calculate and return stats accordingly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Time difference can be calculated by timeSource only in case of cgroup metrics . Which we don't need while calculating usage of host from /proc/stat file since all the data in this is time dependent already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, if we were to push down the time source into the cgroup stats reader (along with some other logic) we would be able to produce the same format for CpuTimes
-- e.g. https://github.com/envoyproxy/envoy/pull/37792/files#diff-1183c2c3937672e9d4c85d700d1e54ef13d360b4b517a0c643a8e46c13c3eb79R120
we could then de-dup a lot of the implementation details from being brought up at the monitor layer to avoid the monitoring layer having to have implementations for all of the different readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KBaichoo in that case, in cgroupstatsreader, I have to derive some calculation to incorporate timing also to allocated millicores and usage seconds total metrics to make it equivalent to , cputimes fields total time and work time ? Something like that you are meaning ? And then the resource monitor would still using the original strategy without checking config mode and switching the calculation strategy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't want to try unnecessary derived calculations and lose accuracy. Cputimes would need to have then double data type instead of integer. But let me see this once in daytime. These two fields I couldn't merge logically and definition wise initially that's why I started with another unique resource monitor initially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, if we were to push down the time source into the cgroup stats reader (along with some other logic) we would be able to produce the same format for CpuTimes -- e.g.
have checked the calculation strategy, It should work. But trying now to pushdown timesource from context in stats reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KBaichoo /proc/uptime has precision till 100th of a second. We can use this without using timesource if refresh_interval is not selected below 0.01 seconds. But where to set it's limit ? How much less refresh_interval we can set ? Is it given ? We use 5 second of refresh_interval for loadshedding in our ambassador edgestack. Will there be any usecase that devs would use this for even less than 0.01 second refresh interval ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using proc/uptime metrics to calculate will get us rid of timesource in reader class and much simpler implementaion. Just with 1 limitation, refresh_interval less than 0.01 second won't work properly, but might update resource pressure as soon as the monitor's loop interval crosses 0.01 second. For refresh_interval greater than 0.01 seconds, it would be perfect. Please let me know if you will allow this. For us it should be sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have implemented all tests and functionality using proc/uptime. @KBaichoo . This should be sufficient for us.
source/extensions/resource_monitors/cpu_utilization/linux_cpu_stats_reader.cc
Show resolved
Hide resolved
@@ -20,6 +22,16 @@ class LinuxCpuStatsReader : public CpuStatsReader { | |||
const std::string cpu_stats_filename_; | |||
}; | |||
|
|||
class LinuxContainerCpuStatsReader: public CgroupStatsReader { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned above in https://github.com/envoyproxy/envoy/pull/37792/files#diff-9281e66aafccb8196311602044a32a4ac53a877d7bae55cc591df8f30ae15810R25 if we go that route this can then just inherit directly from CpuStatsReader interface.
source/extensions/resource_monitors/cpu_utilization/cpu_utilization_monitor.h
Outdated
Show resolved
Hide resolved
Signed-off-by: nix1n <[email protected]>
Signed-off-by: nix1n <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/wait
@@ -35,7 +35,7 @@ class CpuUtilizationMonitor : public Server::ResourceMonitor { | |||
std::unique_ptr <CgroupStatsReader> cgroup_stats_reader_; | |||
TimeSource& time_source_; | |||
MonotonicTime last_update_time_; | |||
envoy::extensions::resource_monitors::cpu_utilization::v3::CpuUtilizationConfig_UtilizationComputeStrategy mode_; | |||
int16_t mode_ = -1; // Will be updated in Resource Monitor Class Constructor |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be more specific I meant to use the non-mangled named like:
envoy::extensions::resource_monitors::cpu_utilization::v3::CpuUtilizationConfig::UtilizationComputeStrategy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -22,13 +22,26 @@ struct CpuTimes { | |||
uint64_t total_time; | |||
}; | |||
|
|||
struct CgroupStats { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good point, if we were to push down the time source into the cgroup stats reader (along with some other logic) we would be able to produce the same format for CpuTimes
-- e.g. https://github.com/envoyproxy/envoy/pull/37792/files#diff-1183c2c3937672e9d4c85d700d1e54ef13d360b4b517a0c643a8e46c13c3eb79R120
we could then de-dup a lot of the implementation details from being brought up at the monitor layer to avoid the monitoring layer having to have implementations for all of the different readers.
Signed-off-by: nix1n <[email protected]>
Signed-off-by: nix1n <[email protected]>
Signed-off-by: nix1n <[email protected]>
Signed-off-by: nix1n <[email protected]>
Signed-off-by: nix1n <[email protected]>
Signed-off-by: nix1n <[email protected]>
Signed-off-by: nix1n <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/ready
Signed-off-by: nix1n <[email protected]>
Signed-off-by: nix1n <[email protected]>
Risk Level: Low
Testing: Unit tests
Docs Changes:
Release Notes:
Platform Specific Features: Today this is only implemented for Linux
In my org we use envoy in containerised K8s environment, this PR will allow us to trigger overload actions based on the container cpu utilization of the pod where envoy is running.