Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High CPU usage #131

Closed
slalomsk8er opened this issue Oct 6, 2020 · 67 comments · Fixed by #142, #204, #206, #322 or #384
Closed

High CPU usage #131

slalomsk8er opened this issue Oct 6, 2020 · 67 comments · Fixed by #142, #204, #206, #322 or #384
Assignees
Labels
Mitigated An issue is not fully resolved but mitigated Performance Performance issue while using the solution

Comments

@slalomsk8er
Copy link

slalomsk8er commented Oct 6, 2020

I tested a set of Service Checks and got a 18% increase in CPU usage on a 2 core virtual server.

A 18%+ increase in CPU usage for monitoring over the whole cluster is not acceptable for us.

The configured services

image

Icinga 2 + Icinga PowerShell Service enabled

image

Icinga 2 + Icinga PowerShell Service disabled

image

Any Ideas on reducing the impact of the checks?

@LordHepipud LordHepipud self-assigned this Oct 6, 2020
@LordHepipud LordHepipud added the Enhancement New feature or request label Oct 6, 2020
@LordHepipud LordHepipud added this to the v1.4.0 milestone Oct 6, 2020
@LordHepipud
Copy link
Collaborator

Hello and thank you for the report. Indeed on machines with less cores the CPU impact might be higher.
In general the performance impact should only be a short peek during the call and not exceed a longer period.

Is the reported high CPU usage only present in short "bursts" or is the entire CPU usage constantly higher? We are currently investigating different solutions to decrease the overall impact of the Framework.
Right now it would be important for me to understand the impact on the system itself, if the load is in general higher or if it only increases during the exection of plugins.

@slalomsk8er
Copy link
Author

The high CPU usage happens in bursts but I think they aren't so short.

image
image
image

Icinga2 & Icinga PowerShell Service disabled

image
image
image
image

I measured on a different server that had no service checks activated yet and it's flat same as if I disabled the services.

Also I see the peaks coresponding to the PowerShell process

image
image

@vr255
Copy link

vr255 commented Oct 13, 2020

Hello,

we can also see this behaviour in our environment.
Mostly when the CPU check is executed.

grafik

This has the unpleasant side effect that we get a lot of false alarms.

@LordHepipud
Copy link
Collaborator

Thank you for all the detailed reports. We are already taking a look on this to figure out how we can reduce the overall impact during plugin execution.

The biggest "issue" is that plugins do not remember their last state, which means all Performance Counter and internal objects have to be re-initialised.

@drapiti
Copy link

drapiti commented Nov 2, 2020

Hi, we also have this problem and on some critical systems we have had to disable the agent and service, if there is anything we can do to help we are available.

@LordHepipud
Copy link
Collaborator

To get a better understanding of the current impact and to provide possible solutions in the future (and for internal testing) it would be helpful get some additionals data:

  • List of checks being assigned to the system
  • Check interval for each check
  • Total amount of CPU cores availables
  • Frequenzy of the CPU cores (or the hypervisor below, in case it's limited the assigned max frequency

This will help a lot to build proper test environments to see where we can reduce overall impact.

@drapiti
Copy link

drapiti commented Nov 3, 2020

Here is the info of the applied service set:

image

These checks are all using the latest powershell framework. This is the current base service set applied to all windows machines. We have many VMs with 1 to 2 vcpu machines which are often at high load therefore these checks really kill the system.
Cpu's speed is typicall between 2.1-2.6 mhz but cpu usage is apparent on all systems.

The below graph is quite explicit, the cpu and memory initially at 1 minute intervals, then at 5m and finally the last step in the graph we removed the entire above service set from the specific machine. This machine has 2 vcpu running at 2.6mhz.

image

Note. Currently our systems are running the ms scom agent which is consuming 1/3 of the resources while collecting much more data including the information provided in the above graph.

Other details of the specific machine without any of the above icinga services:

image

I will add that we have necessarily needed to suspend all monitoring on windows systems, so this is quite critical. Our users will not accept this type of impact on their systems.

@slalomsk8er
Copy link
Author

the requested list with &addColumns=service_check_interval
addColumns=_service_check_interval_1

2 vCPUs @ 2.1 GHz

@LordHepipud
Copy link
Collaborator

Thanks a lot for the input. We will dig into this and see if we can find a long-term solution to reduce the impact.

@LordHepipud
Copy link
Collaborator

I created the linked PR #142 which addresses this issue a little by adding the experimental feature for caching the entire Framework code. In case possible, it would be great if you could test this and see if it mitigates the current issue you are having.

@slalomsk8er
Copy link
Author

I did deploy the linked framework and checks by downloading the zip of the feature branches and installing them by hand and didn't see any difference on my testserver with my set of 9 checks.
Maybe my understanding of PowerShell is lacking and the version from the PowerShell Gallery was still used even after I moved it to the recycle bin.

Sadly I don't have more time this week to do more tests and we found a solution with the Linuxfabrik - Python checks which results in a reduction in CPU load by 3/4 compared to PowerShell.

@LordHepipud
Copy link
Collaborator

Thank you for the feedback. Did you enable the caching handling with Enable-IcingaFrameworkCodeCache after testing it?

If it was and there is no performance uplift at at all, we need to keep tweaking it.

@drapiti
Copy link

drapiti commented Nov 13, 2020

We are testing the cache on a few servers and it is definately much improved. Will check back next week to see how it goes on low resource VMs.

@slalomsk8er
Copy link
Author

slalomsk8er commented Nov 13, 2020

You were right, I missed to enable the cache in my last try. The latest test with enabled cache showed a change from 18% down to
14% CPU - still not the 3/4 ~95% reduction that the switch to python provided.

PS_cache_1

I don't know PowerShell enaught but I have a fealing that only a miracle can save this approch. My next test will be on what Nuitka can do to optimise the Python checks.

BTW on Linux 25 Python checks increased the CPU usage by 2% - so a big part could be Windows and/or antivirus.

Edit: Testing error with Python checks - I missed the 2 Scheduled Tasks Checks that were still PowerShell

@LordHepipud
Copy link
Collaborator

Thank you all for the feedback. Yes, the "first" initialisation is more ressource intensive than other solutions. Right now we are working on a way to mitigate the current impact on systems with fewer resources available - so every test and feedback is very helpful.

On the other hand we are already working on a long-term solution which will decrease the impact of the plugins by a way bigger margin.

@drapiti
Copy link

drapiti commented Nov 18, 2020

Just to update from last week and confirm what we are seeing. The cache has definately helped as at least the servers can perform their primary functions and we have not recieved any other negative feedback. Without the cache we needed to suspend the checks completely so i think the cache should definately be the default setting. Resource consumption is still a little on the high side compaired to other solutions so it would be great if it could be tweaked a little more. In any case I think it is almost there, great work.

@LordHepipud
Copy link
Collaborator

Thank you very much for the positive feedback and the tests! It's very much appreciated!

Just out of curiosity: What happens if you only run the Framework with the caching enabled and still use the current stabel versions of the plugin?
As far as I can tell, the plugins itself are not causing too much of an issue - can you confirm this?

@LordHepipud LordHepipud added Mitigated An issue is not fully resolved but mitigated Performance Performance issue while using the solution and removed Enhancement New feature or request labels Nov 18, 2020
@drapiti
Copy link

drapiti commented Nov 18, 2020

Thank you very much for the positive feedback and the tests! It's very much appreciated!

Just out of curiosity: What happens if you only run the Framework with the caching enabled and still use the current stabel versions of the plugin?
As far as I can tell, the plugins itself are not causing too much of an issue - can you confirm this?

Yes this is what we have done for the moment on select servers, maintained the current stable plugins, only activated the cache. Did we need to update the plugins also? We will wait for the 1.3 version of the framework before activating on all servers.

@LordHepipud
Copy link
Collaborator

Thanks, yes thats what I wanted to test. Because I made some experimental changes to the plugins as well.
I merged the PR now into the master, as I didnt get any issues as well during testing.

I will keep this issue open for now, in case something occurs.

@drapiti
Copy link

drapiti commented Dec 9, 2020

Hi, just wanted to add to this thread without opening a new issue. We have also seen some abnormal behaviour with the memory consumption relating to powershell processes. After a few days maybe a week the memory consumption slowly increases causing various problems on the systems. This is prior to the caching feature activation so I don't know yet if this is mitigated. We only see this after a few days so not as immediate as the cpu problem. Hopefully this can be addressed in the 1.4 release. Also to note, this is happening even when no services are active on the system. Just having the icinga and icinga_powershell service running with no plugins active the memory slowly uses all system resources on the powershell processes. We resolved it by disabling the services.

@LordHepipud LordHepipud modified the milestones: v1.7.0, v1.8.0 Nov 5, 2021
@granatelbart
Copy link

Any solution on this?
I can see since 2020. i installed the newest Framework on some virtual hosts 5-8cpu threads.
Still have high usage

@LordHepipud
Copy link
Collaborator

Just leaving this open for evaluation. Please always use the Api-Check feature and the background daemon for performance improvements.
Additional improvements will follow, once the Icinga Agent can natively talk to the Icinga for Windows REST-Api

@LordHepipud LordHepipud removed this from the v1.8.0 milestone Jan 26, 2022
@Marco-Total
Copy link

I like the idea of using Windows' own Powershell, but I think the way it is installed needs improvement.

A step by step guide would be nice. It is not clear to me as a user how to address the Rest API from Icinga2.

The powershell module of Icinga 2 is really nice, but unfortunately not usable for us, because the CPU usage is way too high.

@LordHepipud
Copy link
Collaborator

I'm just curious regarding the installation - did you have a look on the installation section? Icinga for Windows - Getting Started

There is a step by step guide on how everything installed, including a section for the API Check Forwarder feature

@t3easy
Copy link
Contributor

t3easy commented Apr 29, 2022

We also have a significant higher CPU load after installing the Icinga 4 Windows and running 6 checks. Invoke-IcingaCheckCPU, Invoke-IcingaCheckPartitionSpace, Invoke-IcingaCheckDiskHealth and Invoke-IcingaCheckMemory once per minute, Invoke-IcingaCheckTimeSync and Invoke-IcingaCheckUpdates once per hour.

We already use v1.8.0 with built-in apichecks and with api-Check Forwarder enabled.

Mainly the processes wsmprovhost.exe, powershell.exe and the AV tool.
I'll try to configure the av to exclude the calls. Any hints for this?

As the test system is just a fileserver, the load has increased from 1-3% to ~23%.
The system has 2 vCPUs, Intel Xeon Gold 6254 @3,1GHz with 4GB RAM.

@LordHepipud
Copy link
Collaborator

Could anyone on this issue please provide feedback regarding Icinga for Windows v1.9.1 and compare the performance to the latest version with previous versions on your systems?

We recommend using Icinga for Windows with enabled REST-Api feature. With v1.9.0 we made further improvements, on how internal code is executed and also improved Icinga for Windows components, to reduce possible load of AV scanners.

@drapiti
Copy link

drapiti commented May 17, 2022

so far we have not had any issues regarding performance with the new updates. We are currently rolling out the framework to all servers keep you posted if any issues arise. Much lower consumption overall, great work.

@t3easy
Copy link
Contributor

t3easy commented May 18, 2022

With 1.9.0 the CPU load of a 2 vCPUs, Intel Xeon Gold 6254 at 3,1GHz has approximately halved:
fs_cpu

@quidditchriddikulus
Copy link

We are currently running/testing v1.11.0 and still have very high CPU loads during checks:
image

This is on a 2vCPU machine. It is otherwise nearly idle with ~1-3% CPU load. We have 8 checks running (CPU, Disk Health, Certificates, Network, Partition Space, Process Count, Services, Uptime). The load spikes every 5 minutes, when they run, to 50-100% for over 1 minute which we think is not acceptable at all.
We have already installed the windows service and enabled the API Forwarder, with nearly no change (the screenshot above is with API Forwarder enabled).

What have other people done to improve the performance this much? We simply do not see a way to reduce the load, of course except making checks extremely infrequently which defeats the purpose for a few of the checks.

@Al2Klimov
Copy link
Member

Please see https://icinga.com/blog/2023/08/01/icinga-for-windows / Let’s talk API!. If that doesn't fix your problem – I don't know.

@quidditchriddikulus
Copy link

quidditchriddikulus commented Aug 17, 2023

So just to confirm: this is not normal behavior and we should expect way less CPU load?

Thank you for the link, I will have a look!

@Al2Klimov
Copy link
Member

I don't know, it depends.

@LordHepipud
Copy link
Collaborator

Can you please check the EventLog for the Icinga for Windows logs and validate, if the API calls are actually going through or if there are any errors?

EventViewer -> Application and Service Logs -> Icinga for Windows

@quidditchriddikulus
Copy link

quidditchriddikulus commented Aug 17, 2023

Thanks both - we were able to fix it and indeed the API was not yet set up correctly. We did not rebuild the JEA Profile after installing the service, which caused the host not to listen on the correct port (among other things probably) - there were no errors when starting the service though, which was a bit surprising. Rebuilding the JEA Profile worked and the difference is massive! A check basically looks now like this at most with 2vCPUs:
image

@cawoueb5
Copy link

Hi quidditchriddikulus,
sorry for this question but how did you "rebuild the JEA"? Can you shortly explain?
I'm completely new to icinga but we suffer from the same issue you described.
Thx

@quidditchriddikulus
Copy link

quidditchriddikulus commented Aug 18, 2023

Hi quidditchriddikulus, sorry for this question but how did you "rebuild the JEA"? Can you shortly explain? I'm completely new to icinga but we suffer from the same issue you described. Thx

Hi cawoueb5,

In the Icinga Management Interface (type icinga in powershell), you can navigate to [2] Settings, [4] JEA and then [2] Update JEA profile.
The problem is, that when you do not install the service at the point when you enable JEA/install icinga, it will not consider it to include it when building the profile. When you install components later on, you need to ensure to update the JEA profile.

When starting both services, and having the API enabled, the machine should be listening on port 5668
You can check this with the following command:
netstat -aon | ForEach-Object { if ($_ -like '*5668*') { $_ } }

image

If there is no output, the service did not start properly/is no listening properly and the API won't work. This was the case with an incomplete JEA profile.

Note however that while the CPU is now fixed, we are facing other issues with the API:
Icinga/icinga-powershell-plugins#364

Also make sure to follow this manual closely to get the API working:
https://icinga.com/docs/icinga-for-windows/latest/doc/110-Installation/30-API-Check-Forwarder/#icinga-communication-to-api

@LordHepipud
Copy link
Collaborator

This should be fixed by now with the latest improvements to the Icinga for Windows API and the Icinga Agent API integration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment