High CPU usage #131

slalomsk8er · 2020-10-06T06:56:25Z

I tested a set of Service Checks and got a 18% increase in CPU usage on a 2 core virtual server.

A 18%+ increase in CPU usage for monitoring over the whole cluster is not acceptable for us.

The configured services

Icinga 2 + Icinga PowerShell Service enabled

Icinga 2 + Icinga PowerShell Service disabled

Any Ideas on reducing the impact of the checks?

LordHepipud · 2020-10-06T12:10:20Z

Hello and thank you for the report. Indeed on machines with less cores the CPU impact might be higher.
In general the performance impact should only be a short peek during the call and not exceed a longer period.

Is the reported high CPU usage only present in short "bursts" or is the entire CPU usage constantly higher? We are currently investigating different solutions to decrease the overall impact of the Framework.
Right now it would be important for me to understand the impact on the system itself, if the load is in general higher or if it only increases during the exection of plugins.

slalomsk8er · 2020-10-06T16:05:31Z

The high CPU usage happens in bursts but I think they aren't so short.

Icinga2 & Icinga PowerShell Service disabled

I measured on a different server that had no service checks activated yet and it's flat same as if I disabled the services.

Also I see the peaks coresponding to the PowerShell process

vr255 · 2020-10-13T07:57:12Z

Hello,

we can also see this behaviour in our environment.
Mostly when the CPU check is executed.

This has the unpleasant side effect that we get a lot of false alarms.

LordHepipud · 2020-10-13T08:40:11Z

Thank you for all the detailed reports. We are already taking a look on this to figure out how we can reduce the overall impact during plugin execution.

The biggest "issue" is that plugins do not remember their last state, which means all Performance Counter and internal objects have to be re-initialised.

drapiti · 2020-11-02T10:37:57Z

Hi, we also have this problem and on some critical systems we have had to disable the agent and service, if there is anything we can do to help we are available.

LordHepipud · 2020-11-02T20:40:20Z

To get a better understanding of the current impact and to provide possible solutions in the future (and for internal testing) it would be helpful get some additionals data:

List of checks being assigned to the system
Check interval for each check
Total amount of CPU cores availables
Frequenzy of the CPU cores (or the hypervisor below, in case it's limited the assigned max frequency

This will help a lot to build proper test environments to see where we can reduce overall impact.

drapiti · 2020-11-03T09:38:14Z

Here is the info of the applied service set:

These checks are all using the latest powershell framework. This is the current base service set applied to all windows machines. We have many VMs with 1 to 2 vcpu machines which are often at high load therefore these checks really kill the system.
Cpu's speed is typicall between 2.1-2.6 mhz but cpu usage is apparent on all systems.

The below graph is quite explicit, the cpu and memory initially at 1 minute intervals, then at 5m and finally the last step in the graph we removed the entire above service set from the specific machine. This machine has 2 vcpu running at 2.6mhz.

Note. Currently our systems are running the ms scom agent which is consuming 1/3 of the resources while collecting much more data including the information provided in the above graph.

Other details of the specific machine without any of the above icinga services:

I will add that we have necessarily needed to suspend all monitoring on windows systems, so this is quite critical. Our users will not accept this type of impact on their systems.

slalomsk8er · 2020-11-03T10:33:04Z

the requested list with &addColumns=service_check_interval

2 vCPUs @ 2.1 GHz

LordHepipud · 2020-11-03T16:51:49Z

Thanks a lot for the input. We will dig into this and see if we can find a long-term solution to reduce the impact.

LordHepipud · 2020-11-06T15:46:34Z

I created the linked PR #142 which addresses this issue a little by adding the experimental feature for caching the entire Framework code. In case possible, it would be great if you could test this and see if it mitigates the current issue you are having.

slalomsk8er · 2020-11-12T08:42:20Z

I did deploy the linked framework and checks by downloading the zip of the feature branches and installing them by hand and didn't see any difference on my testserver with my set of 9 checks.
Maybe my understanding of PowerShell is lacking and the version from the PowerShell Gallery was still used even after I moved it to the recycle bin.

Sadly I don't have more time this week to do more tests and we found a solution with the Linuxfabrik - Python checks which results in a reduction in CPU load by 3/4 compared to PowerShell.

LordHepipud · 2020-11-12T18:44:15Z

Thank you for the feedback. Did you enable the caching handling with Enable-IcingaFrameworkCodeCache after testing it?

If it was and there is no performance uplift at at all, we need to keep tweaking it.

drapiti · 2020-11-13T11:51:36Z

We are testing the cache on a few servers and it is definately much improved. Will check back next week to see how it goes on low resource VMs.

slalomsk8er · 2020-11-13T21:55:41Z

You were right, I missed to enable the cache in my last try. The latest test with enabled cache showed a change from 18% down to
14% CPU - still not the ~~3/4~~ ~95% reduction that the switch to python provided.

I don't know PowerShell enaught but I have a fealing that only a miracle can save this approch. My next test will be on what Nuitka can do to optimise the Python checks.

BTW on Linux 25 Python checks increased the CPU usage by 2% - so a big part could be Windows and/or antivirus.

Edit: Testing error with Python checks - I missed the 2 Scheduled Tasks Checks that were still PowerShell

LordHepipud · 2020-11-14T18:09:38Z

Thank you all for the feedback. Yes, the "first" initialisation is more ressource intensive than other solutions. Right now we are working on a way to mitigate the current impact on systems with fewer resources available - so every test and feedback is very helpful.

On the other hand we are already working on a long-term solution which will decrease the impact of the plugins by a way bigger margin.

drapiti · 2020-11-18T12:43:16Z

Just to update from last week and confirm what we are seeing. The cache has definately helped as at least the servers can perform their primary functions and we have not recieved any other negative feedback. Without the cache we needed to suspend the checks completely so i think the cache should definately be the default setting. Resource consumption is still a little on the high side compaired to other solutions so it would be great if it could be tweaked a little more. In any case I think it is almost there, great work.

LordHepipud · 2020-11-18T15:24:54Z

Thank you very much for the positive feedback and the tests! It's very much appreciated!

Just out of curiosity: What happens if you only run the Framework with the caching enabled and still use the current stabel versions of the plugin?
As far as I can tell, the plugins itself are not causing too much of an issue - can you confirm this?

drapiti · 2020-11-18T16:48:25Z

Thank you very much for the positive feedback and the tests! It's very much appreciated!

Just out of curiosity: What happens if you only run the Framework with the caching enabled and still use the current stabel versions of the plugin?
As far as I can tell, the plugins itself are not causing too much of an issue - can you confirm this?

Yes this is what we have done for the moment on select servers, maintained the current stable plugins, only activated the cache. Did we need to update the plugins also? We will wait for the 1.3 version of the framework before activating on all servers.

LordHepipud · 2020-11-18T18:01:02Z

Thanks, yes thats what I wanted to test. Because I made some experimental changes to the plugins as well.
I merged the PR now into the master, as I didnt get any issues as well during testing.

I will keep this issue open for now, in case something occurs.

drapiti · 2020-12-09T10:56:22Z

Hi, just wanted to add to this thread without opening a new issue. We have also seen some abnormal behaviour with the memory consumption relating to powershell processes. After a few days maybe a week the memory consumption slowly increases causing various problems on the systems. This is prior to the caching feature activation so I don't know yet if this is mitigated. We only see this after a few days so not as immediate as the cpu problem. Hopefully this can be addressed in the 1.4 release. Also to note, this is happening even when no services are active on the system. Just having the icinga and icinga_powershell service running with no plugins active the memory slowly uses all system resources on the powershell processes. We resolved it by disabling the services.

granatelbart · 2021-12-22T15:50:39Z

Any solution on this?
I can see since 2020. i installed the newest Framework on some virtual hosts 5-8cpu threads.
Still have high usage

LordHepipud · 2022-01-25T09:05:05Z

Just leaving this open for evaluation. Please always use the Api-Check feature and the background daemon for performance improvements.
Additional improvements will follow, once the Icinga Agent can natively talk to the Icinga for Windows REST-Api

Marco-Total · 2022-03-16T13:44:51Z

I like the idea of using Windows' own Powershell, but I think the way it is installed needs improvement.

A step by step guide would be nice. It is not clear to me as a user how to address the Rest API from Icinga2.

The powershell module of Icinga 2 is really nice, but unfortunately not usable for us, because the CPU usage is way too high.

LordHepipud · 2022-03-17T08:09:46Z

I'm just curious regarding the installation - did you have a look on the installation section? Icinga for Windows - Getting Started

There is a step by step guide on how everything installed, including a section for the API Check Forwarder feature

t3easy · 2022-04-29T07:15:11Z

We also have a significant higher CPU load after installing the Icinga 4 Windows and running 6 checks. Invoke-IcingaCheckCPU, Invoke-IcingaCheckPartitionSpace, Invoke-IcingaCheckDiskHealth and Invoke-IcingaCheckMemory once per minute, Invoke-IcingaCheckTimeSync and Invoke-IcingaCheckUpdates once per hour.

We already use v1.8.0 with built-in apichecks and with api-Check Forwarder enabled.

Mainly the processes wsmprovhost.exe, powershell.exe and the AV tool.
I'll try to configure the av to exclude the calls. Any hints for this?

As the test system is just a fileserver, the load has increased from 1-3% to ~23%.
The system has 2 vCPUs, Intel Xeon Gold 6254 @3,1GHz with 4GB RAM.

LordHepipud · 2022-05-16T13:29:55Z

Could anyone on this issue please provide feedback regarding Icinga for Windows v1.9.1 and compare the performance to the latest version with previous versions on your systems?

We recommend using Icinga for Windows with enabled REST-Api feature. With v1.9.0 we made further improvements, on how internal code is executed and also improved Icinga for Windows components, to reduce possible load of AV scanners.

drapiti · 2022-05-17T06:39:16Z

so far we have not had any issues regarding performance with the new updates. We are currently rolling out the framework to all servers keep you posted if any issues arise. Much lower consumption overall, great work.

t3easy · 2022-05-18T09:17:07Z

With 1.9.0 the CPU load of a 2 vCPUs, Intel Xeon Gold 6254 at 3,1GHz has approximately halved:

quidditchriddikulus · 2023-08-17T08:38:44Z

We are currently running/testing v1.11.0 and still have very high CPU loads during checks:

This is on a 2vCPU machine. It is otherwise nearly idle with ~1-3% CPU load. We have 8 checks running (CPU, Disk Health, Certificates, Network, Partition Space, Process Count, Services, Uptime). The load spikes every 5 minutes, when they run, to 50-100% for over 1 minute which we think is not acceptable at all.
We have already installed the windows service and enabled the API Forwarder, with nearly no change (the screenshot above is with API Forwarder enabled).

What have other people done to improve the performance this much? We simply do not see a way to reduce the load, of course except making checks extremely infrequently which defeats the purpose for a few of the checks.

Al2Klimov · 2023-08-17T08:53:10Z

Please see https://icinga.com/blog/2023/08/01/icinga-for-windows / Let’s talk API!. If that doesn't fix your problem – I don't know.

quidditchriddikulus · 2023-08-17T08:59:06Z

So just to confirm: this is not normal behavior and we should expect way less CPU load?

Thank you for the link, I will have a look!

Al2Klimov · 2023-08-17T09:00:45Z

I don't know, it depends.

LordHepipud · 2023-08-17T10:59:09Z

Can you please check the EventLog for the Icinga for Windows logs and validate, if the API calls are actually going through or if there are any errors?

EventViewer -> Application and Service Logs -> Icinga for Windows

quidditchriddikulus · 2023-08-17T16:32:00Z

Thanks both - we were able to fix it and indeed the API was not yet set up correctly. We did not rebuild the JEA Profile after installing the service, which caused the host not to listen on the correct port (among other things probably) - there were no errors when starting the service though, which was a bit surprising. Rebuilding the JEA Profile worked and the difference is massive! A check basically looks now like this at most with 2vCPUs:

cawoueb5 · 2023-08-18T09:26:48Z

Hi quidditchriddikulus,
sorry for this question but how did you "rebuild the JEA"? Can you shortly explain?
I'm completely new to icinga but we suffer from the same issue you described.
Thx

quidditchriddikulus · 2023-08-18T15:50:36Z

Hi quidditchriddikulus, sorry for this question but how did you "rebuild the JEA"? Can you shortly explain? I'm completely new to icinga but we suffer from the same issue you described. Thx

Hi cawoueb5,

In the Icinga Management Interface (type icinga in powershell), you can navigate to [2] Settings, [4] JEA and then [2] Update JEA profile.
The problem is, that when you do not install the service at the point when you enable JEA/install icinga, it will not consider it to include it when building the profile. When you install components later on, you need to ensure to update the JEA profile.

When starting both services, and having the API enabled, the machine should be listening on port 5668
You can check this with the following command:
netstat -aon | ForEach-Object { if ($_ -like '*5668*') { $_ } }

If there is no output, the service did not start properly/is no listening properly and the API won't work. This was the case with an incomplete JEA profile.

Note however that while the CPU is now fixed, we are facing other issues with the API:
Icinga/icinga-powershell-plugins#364

Also make sure to follow this manual closely to get the API working:
https://icinga.com/docs/icinga-for-windows/latest/doc/110-Installation/30-API-Check-Forwarder/#icinga-communication-to-api

LordHepipud · 2024-03-25T19:29:19Z

This should be fixed by now with the latest improvements to the Icinga for Windows API and the Icinga Agent API integration.

LordHepipud self-assigned this Oct 6, 2020

LordHepipud added the Enhancement New feature or request label Oct 6, 2020

LordHepipud added this to the v1.4.0 milestone Oct 6, 2020

LordHepipud mentioned this issue Nov 4, 2020

Experimental: Adds code caching for faster framework loading #142

Merged

LordHepipud linked a pull request Nov 4, 2020 that will close this issue

Experimental: Adds code caching for faster framework loading #142

Merged

LordHepipud added Mitigated An issue is not fully resolved but mitigated Performance Performance issue while using the solution and removed Enhancement New feature or request labels Nov 18, 2020

LordHepipud closed this as completed in #142 Nov 18, 2020

LordHepipud reopened this Nov 18, 2020

This was linked to pull requests Oct 27, 2021

Fix: Module not loaded exception #384

Merged

Feature: Move REST-Api and Api-Checks into Framework #383

Merged

LordHepipud modified the milestones: v1.7.0, v1.8.0 Nov 5, 2021

LordHepipud linked a pull request Jan 12, 2022 that will close this issue

Fix: Rewrite Icinga for Windows service check daemon #414

Merged

LordHepipud closed this as completed in #414 Jan 25, 2022

LordHepipud reopened this Jan 25, 2022

LordHepipud removed this from the v1.8.0 milestone Jan 26, 2022

haxtibal mentioned this issue Mar 17, 2022

Optimize-IcingaForWindowsMemory causes high load #491

Closed

This was linked to pull requests May 16, 2022

Features: Adds module isolation support #514

Merged

Fix: GC collection on every REST call #494

Merged

LordHepipud mentioned this issue Jun 9, 2022

Feature: Rewrite check kernel #533

Open

LordHepipud closed this as completed Mar 25, 2024

High CPU usage #131

High CPU usage #131

Comments

slalomsk8er commented Oct 6, 2020 • edited Loading

The configured services

Icinga 2 + Icinga PowerShell Service enabled

Icinga 2 + Icinga PowerShell Service disabled

LordHepipud commented Oct 6, 2020

slalomsk8er commented Oct 6, 2020

vr255 commented Oct 13, 2020

LordHepipud commented Oct 13, 2020

drapiti commented Nov 2, 2020

LordHepipud commented Nov 2, 2020

drapiti commented Nov 3, 2020 • edited Loading

slalomsk8er commented Nov 3, 2020

LordHepipud commented Nov 3, 2020

LordHepipud commented Nov 6, 2020

slalomsk8er commented Nov 12, 2020

LordHepipud commented Nov 12, 2020

drapiti commented Nov 13, 2020

slalomsk8er commented Nov 13, 2020 • edited Loading

LordHepipud commented Nov 14, 2020

drapiti commented Nov 18, 2020

LordHepipud commented Nov 18, 2020

drapiti commented Nov 18, 2020

LordHepipud commented Nov 18, 2020

drapiti commented Dec 9, 2020 • edited Loading

granatelbart commented Dec 22, 2021

LordHepipud commented Jan 25, 2022

Marco-Total commented Mar 16, 2022

LordHepipud commented Mar 17, 2022

t3easy commented Apr 29, 2022

LordHepipud commented May 16, 2022

drapiti commented May 17, 2022

t3easy commented May 18, 2022

quidditchriddikulus commented Aug 17, 2023

Al2Klimov commented Aug 17, 2023

quidditchriddikulus commented Aug 17, 2023 • edited Loading

Al2Klimov commented Aug 17, 2023

LordHepipud commented Aug 17, 2023

quidditchriddikulus commented Aug 17, 2023 • edited Loading

cawoueb5 commented Aug 18, 2023

quidditchriddikulus commented Aug 18, 2023 • edited Loading

LordHepipud commented Mar 25, 2024

slalomsk8er commented Oct 6, 2020 •

edited

Loading

drapiti commented Nov 3, 2020 •

edited

Loading

slalomsk8er commented Nov 13, 2020 •

edited

Loading

drapiti commented Dec 9, 2020 •

edited

Loading

quidditchriddikulus commented Aug 17, 2023 •

edited

Loading

quidditchriddikulus commented Aug 17, 2023 •

edited

Loading

quidditchriddikulus commented Aug 18, 2023 •

edited

Loading