-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
High CPU usage #131
Comments
Hello and thank you for the report. Indeed on machines with less cores the CPU impact might be higher. Is the reported high CPU usage only present in short "bursts" or is the entire CPU usage constantly higher? We are currently investigating different solutions to decrease the overall impact of the Framework. |
Thank you for all the detailed reports. We are already taking a look on this to figure out how we can reduce the overall impact during plugin execution. The biggest "issue" is that plugins do not remember their last state, which means all Performance Counter and internal objects have to be re-initialised. |
Hi, we also have this problem and on some critical systems we have had to disable the agent and service, if there is anything we can do to help we are available. |
To get a better understanding of the current impact and to provide possible solutions in the future (and for internal testing) it would be helpful get some additionals data:
This will help a lot to build proper test environments to see where we can reduce overall impact. |
Here is the info of the applied service set: These checks are all using the latest powershell framework. This is the current base service set applied to all windows machines. We have many VMs with 1 to 2 vcpu machines which are often at high load therefore these checks really kill the system. The below graph is quite explicit, the cpu and memory initially at 1 minute intervals, then at 5m and finally the last step in the graph we removed the entire above service set from the specific machine. This machine has 2 vcpu running at 2.6mhz. Note. Currently our systems are running the ms scom agent which is consuming 1/3 of the resources while collecting much more data including the information provided in the above graph. Other details of the specific machine without any of the above icinga services: I will add that we have necessarily needed to suspend all monitoring on windows systems, so this is quite critical. Our users will not accept this type of impact on their systems. |
Thanks a lot for the input. We will dig into this and see if we can find a long-term solution to reduce the impact. |
I created the linked PR #142 which addresses this issue a little by adding the experimental feature for caching the entire Framework code. In case possible, it would be great if you could test this and see if it mitigates the current issue you are having. |
I did deploy the linked framework and checks by downloading the zip of the feature branches and installing them by hand and didn't see any difference on my testserver with my set of 9 checks. Sadly I don't have more time this week to do more tests and we found a solution with the Linuxfabrik - Python checks which results in a reduction in CPU load by 3/4 compared to PowerShell. |
Thank you for the feedback. Did you enable the caching handling with If it was and there is no performance uplift at at all, we need to keep tweaking it. |
We are testing the cache on a few servers and it is definately much improved. Will check back next week to see how it goes on low resource VMs. |
You were right, I missed to enable the cache in my last try. The latest test with enabled cache showed a change from 18% down to I don't know PowerShell enaught but I have a fealing that only a miracle can save this approch. My next test will be on what Nuitka can do to optimise the Python checks. BTW on Linux 25 Python checks increased the CPU usage by 2% - so a big part could be Windows and/or antivirus. Edit: Testing error with Python checks - I missed the 2 Scheduled Tasks Checks that were still PowerShell |
Thank you all for the feedback. Yes, the "first" initialisation is more ressource intensive than other solutions. Right now we are working on a way to mitigate the current impact on systems with fewer resources available - so every test and feedback is very helpful. On the other hand we are already working on a long-term solution which will decrease the impact of the plugins by a way bigger margin. |
Just to update from last week and confirm what we are seeing. The cache has definately helped as at least the servers can perform their primary functions and we have not recieved any other negative feedback. Without the cache we needed to suspend the checks completely so i think the cache should definately be the default setting. Resource consumption is still a little on the high side compaired to other solutions so it would be great if it could be tweaked a little more. In any case I think it is almost there, great work. |
Thank you very much for the positive feedback and the tests! It's very much appreciated! Just out of curiosity: What happens if you only run the Framework with the caching enabled and still use the current stabel versions of the plugin? |
Yes this is what we have done for the moment on select servers, maintained the current stable plugins, only activated the cache. Did we need to update the plugins also? We will wait for the 1.3 version of the framework before activating on all servers. |
Thanks, yes thats what I wanted to test. Because I made some experimental changes to the plugins as well. I will keep this issue open for now, in case something occurs. |
Hi, just wanted to add to this thread without opening a new issue. We have also seen some abnormal behaviour with the memory consumption relating to powershell processes. After a few days maybe a week the memory consumption slowly increases causing various problems on the systems. This is prior to the caching feature activation so I don't know yet if this is mitigated. We only see this after a few days so not as immediate as the cpu problem. Hopefully this can be addressed in the 1.4 release. Also to note, this is happening even when no services are active on the system. Just having the icinga and icinga_powershell service running with no plugins active the memory slowly uses all system resources on the powershell processes. We resolved it by disabling the services. |
Any solution on this? |
Just leaving this open for evaluation. Please always use the Api-Check feature and the background daemon for performance improvements. |
I like the idea of using Windows' own Powershell, but I think the way it is installed needs improvement. A step by step guide would be nice. It is not clear to me as a user how to address the Rest API from Icinga2. The powershell module of Icinga 2 is really nice, but unfortunately not usable for us, because the CPU usage is way too high. |
I'm just curious regarding the installation - did you have a look on the installation section? Icinga for Windows - Getting Started There is a step by step guide on how everything installed, including a section for the API Check Forwarder feature |
We also have a significant higher CPU load after installing the Icinga 4 Windows and running 6 checks. Invoke-IcingaCheckCPU, Invoke-IcingaCheckPartitionSpace, Invoke-IcingaCheckDiskHealth and Invoke-IcingaCheckMemory once per minute, Invoke-IcingaCheckTimeSync and Invoke-IcingaCheckUpdates once per hour. We already use v1.8.0 with built-in apichecks and with api-Check Forwarder enabled. Mainly the processes wsmprovhost.exe, powershell.exe and the AV tool. As the test system is just a fileserver, the load has increased from 1-3% to ~23%. |
Could anyone on this issue please provide feedback regarding Icinga for Windows v1.9.1 and compare the performance to the latest version with previous versions on your systems? We recommend using Icinga for Windows with enabled REST-Api feature. With v1.9.0 we made further improvements, on how internal code is executed and also improved Icinga for Windows components, to reduce possible load of AV scanners. |
so far we have not had any issues regarding performance with the new updates. We are currently rolling out the framework to all servers keep you posted if any issues arise. Much lower consumption overall, great work. |
Please see https://icinga.com/blog/2023/08/01/icinga-for-windows / Let’s talk API!. If that doesn't fix your problem – I don't know. |
So just to confirm: this is not normal behavior and we should expect way less CPU load? Thank you for the link, I will have a look! |
I don't know, it depends. |
Can you please check the EventLog for the Icinga for Windows logs and validate, if the API calls are actually going through or if there are any errors?
|
Hi quidditchriddikulus, |
Hi cawoueb5, In the Icinga Management Interface (type icinga in powershell), you can navigate to [2] Settings, [4] JEA and then [2] Update JEA profile. When starting both services, and having the API enabled, the machine should be listening on port 5668 If there is no output, the service did not start properly/is no listening properly and the API won't work. This was the case with an incomplete JEA profile. Note however that while the CPU is now fixed, we are facing other issues with the API: Also make sure to follow this manual closely to get the API working: |
This should be fixed by now with the latest improvements to the Icinga for Windows API and the Icinga Agent API integration. |
I tested a set of Service Checks and got a 18% increase in CPU usage on a 2 core virtual server.
A 18%+ increase in CPU usage for monitoring over the whole cluster is not acceptable for us.
The configured services
Icinga 2 + Icinga PowerShell Service enabled
Icinga 2 + Icinga PowerShell Service disabled
Any Ideas on reducing the impact of the checks?
The text was updated successfully, but these errors were encountered: