-
Notifications
You must be signed in to change notification settings - Fork 480
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Helm chart for pcm-sensor-server #727
base: master
Are you sure you want to change the base?
Conversation
It would be helpful if you could put custom labels on the |
the FreeBSD check failure is unrelated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great start! Your TODO list makes sense to me. Thank you!
@@ -551,7 +551,7 @@ bool PCM::L3CacheOccupancyMetricAvailable() const | |||
|
|||
bool PCM::CoreLocalMemoryBWMetricAvailable() const | |||
{ | |||
if (cpu_model == SKX && cpu_stepping < 5) return false; // SKZ4 errata | |||
//if (cpu_model == SKX && cpu_stepping < 5) return false; // SKZ4 errata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you remove these checks for testing purposes?
@@ -561,7 +561,7 @@ bool PCM::CoreLocalMemoryBWMetricAvailable() const | |||
|
|||
bool PCM::CoreRemoteMemoryBWMetricAvailable() const | |||
{ | |||
if (cpu_model == SKX && cpu_stepping < 5) return false; // SKZ4 errata | |||
//if (cpu_model == SKX && cpu_stepping < 5) return false; // SKZ4 errata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only checked the commit that adds the Helm chart and it looks good to me. The only thing I wondered is do you really expect users to modify all the available values in the values.yaml. To me some looked not very necessary but I can't judge since I'm not familiar with PCM.
LGTM
@jcpunk Thanks for this comment, that is very helpfull indeed - having that there is no longer need to hack prometheus-operator chart to disable podMonitorSelector - I added comments in README/values file about this (check this changes in commit for details (7f2c707#diff-618d3b78482c88190c469bb01731f774bb931bcdc14db7b8980691f5745ba08aR151-R152) or this documentation added in values pcm/deployment/pcm/values.yaml Lines 87 to 91 in 7f2c707
Anyway help for pointing this. |
That is the trade of between flexibility and complexity that I'm finding hard to balance. I see two options here:
and try to handle all possible (both proper and inproper combination) - but now rewrite this using go/helm template language - not very maintanable nor readable - I don't really want chart to be validator of possible PCM envs and I'm terified of hardcoding all proper possible combinations - in other words - if you don't like defaults (or other example value files) - then you're responsible for validating that pcm-sensor-server binary will run - but finding right combination is your job in your case. (there is also another option - e.g. allow to pass *any enviornment/options to pcm - but that is discouraged security practice) I the end, I decided to just allow use of this pcm chart to pass those "all PCM suppored" environment variables directly - and tried to cover possible value of combinations as different values files. One more comment, I agree that value file is alread quite big, but not yet as scary as the I see in prometheus node exporter official chart values.yaml - it is 500 lines vs my so far about 100 (but I miss some feature though RBACs, vertical scaling) which I'm using as example of good practicies :) |
a30b342
to
8edb9dc
Compare
old comments: sys/pci/mcfg mounts are unnessesary for indirect method fix old wrong defaults in README fix formatting possible fix for issue with resctrl remove hacks to handle /pcm/resctrl and unessesary out-of-date files update License to use the same as pcm itself update README, remove out-of-date info links do values formatting + links do values update README an values comments update README address jcfunk comments: interval and extra labels for PodMonitor + refactor readme fix typos readme: reminder about removing msr kernel module after rebasing: point to correct default pcm image from intel organization Refactoring: - explicit values file for privileged direct method, - hide (into docs directory) "unprivileged" direct method (and fixes), - remove unnessesary mounts (mcfg, /dev/cpu/dev/mem for privileged access), - add instructions to collection methods, - fixes (extra builder) for build local development image, - silent mode - move collection methods to the top fix values files for direct privileged method New: support for PERFMON capability, silent mode and some extra env debug variables VPA: v1 - first version of vertical pod autoscaler Grafana dashboard: instructions rename resctrlHostMount to resctrlMount fix dashboard rate interval pcm-sensor-server: add new metrics DRAM Local percantage Fix dockerbuild by using separate Dockerfile + build in dockerignore improve dockerfile.debug extra env PCM_NO_MAIN_EXCEPTION_HANDLER
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
|
||
if (pcm->localMemoryRequestRatioMetricAvailable()) | ||
printCounter( "DRAM Local Percentage", getLocalMemoryRequestRatio( before, after ) ); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the addition of a new metric should be split out into a separate merge request or at least a separate commit.
|
||
if (pcm->localMemoryRequestRatioMetricAvailable()) | ||
printCounter( "DRAM Local Percentage", getLocalMemoryRequestRatio( before, after ) ); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
Hello, I think that I have successfully made all steps before 6) Deploy PCM helm chart in paragraph Validation on local kind cluster |
|
||
- kubectl/kind/helm/jq binaries available in PATH, | ||
- docker service up and running. | ||
- full set of metrics available only bare-metal instance or Cloud .metal instance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could add information on how to check it or enabled it.
helm install ... --set nfd=true --set podMonitor=true --set verticalPodAutoscaler.enabled=true | ||
``` | ||
|
||
### Requirements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you could add information on how to check it or enable all requirements.
I was trying to run e2e test
Please, add information about how to add this env Also, the file |
Initial version of Helm chart for deploying PCM
Features include:
TODO:
Must have:
/pcm
prefix ends up with non-critical warningperf list | grep -i energy
returnspower/energy-pkg/
andpower/energy-ram/
possible development to PCM required - are available through perf subsystem perf_even_open syscall and enumerating/discovering /sys/devices/power/events PMU type=9 or through power cap framework similary to node_exporter/rapl collector, reading data directly from sysfs like thiscat /sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj
Testing:
Follow up tasks (in another PRs ideas):