-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inordinate CPU usage. What is zsys doing? What is it even trying to do? #207
Comments
It doesn't leave any clues about problems in logs, either.
|
One of the issue (and no, it doesn’t trigger on all ZED events or so, if not started), is that the Go libzfs bindings are working well, but doesn’t scale up with a huge amount of datasets. For instance, when you open a pool, the binding itself will scan all datasets from that pool in advance, and so, this is where the high CPU usage is seen. One idea would be to write our own binding, tight to the ZSys needs for use case like yours, when you have a huge number of datasets and it doesn’t scale up well. |
the zfs command suffers the same scalability issues with large numbers of datasets for the same reason. The "solution" is to use a lua script to do the heavy lifting in-kernel. |
Yeah, I'm familiar with "zfs list" being somewhat expensive (znapzend gets hit by this too) but there is something more going on here, like the expensive thing being called many more times than necessary. Very possibly as a kind of side-effect of the bindings layer, where it's less "deliberate" by zsys and harder to avoid. But how often does zsys "open a pool"? As a daemon, can it not keep the pool open across multiple.. whatever-it's-doing.. to avoid this repeated startup cost? Or maybe the maintainers of the go lib are amenable to alternate interfaces that don't have the same startup behaviour? As for writing libzfs bindings, I looked into that at one stage (for rust) but got discouraged because libzfs is apparently not considered a stable interface and seems like a potential maintenance trap, and so I wound up just targeting the zfs commands and 'parser-friendly' text output like many others. Disappointing. Writing a channel program to optimise the operations zsys needs sounds good - but, again, what are these operations? |
As explained in the blog post series (IIRC I wrote about it), the daemon wakes up with socket activation, handle the various requests, then idling. Indeed, that was the idea: rewrite the ZFS operation using ZIL to be in kernel memory and only returns the properties we care about, in a single shot, to avoid many Go-C back and force as the current go binding is doing (Note that we patch is already to cut some of those and only care about the operations we are handling, but that’s still not enough when you have many many All ZFS operations we are using are splitted in its own package: https://github.com/ubuntu/zsys/tree/master/internal/zfs. You can see here that there is even a mock for tests to run against the real libzfs implementation or a mock one (and simulate, that way, failures). |
I've now read through the blog posts. Highly appreciate the effort put into explaining the thought process of the design, and also many of the actual thoughts. This is good, well-intentioned work, and I now have a better understanding of what zsys is trying to do -- that wasn't so far off my original assumptions. I also see some options to enable more logging, so I can investigate its actual activity further. But already, I can see some issues along the lines of what I outlined above: zsys is doing more work on more data than seems to be intended. As I now understand from the blog posts, zsys wants to manage, effectively, [br]pool/(BOOT|ROOT|USERDATA). I'm not using USERDATA (other than for root) so there's little under there. But even with the logging I pasted above, I see some discrepancies. Why is it:
I also don't understand what seems to be waking it up so much, as far as I can tell from the blog descriptions, it should at most be the hourly user snapshot timer. Maybe it's just spending most of an hour (or more) at 100% cpu for each wakeup, and that's why it seems to be running all the time like that? I haven't really looked into it yet, but more logging will be useful to do so. |
Even if you think zsysd should examine all pools by default, for generality or to avoid encoding / enforcing naming conventions, can it at least be configurable (now that I also know there is a config file) so that admins can manually limit the pools zsys will open and thus spend cpu and iops on? |
I disabled the hourly user timer, in case it was taking longer than an hour to process that event. Killed and stopped zsysd and left everything overnight. Current status:
log level doesn't change, logs contain an endless loop of:
So at least one of the failure modes is it's taking too long to even start, systemd is giving up and killing it - likely before it can process any events off the socket, and then repeating this over and over without making any forward progress. Perhaps events are even building up in the socket queue? Even ignoring why opening and enumerating pools is so ludicrously expensive, there are clearly requests that can be serviced without ever touching any pools. So it seems like some reordering of operations can help:
|
I exported the pools on external usb disks. Now:
The system at this point has rpool and bpool imported, from its internal NVME ssd. They're not large, or full (and were created recently as part of a fresh install for 21.04):
There are quite a few snapshots, but most of them are outside the
With respect to optimisation opportunities in the previous comment, basically all of them are outside the scope of a user state save, which was the requested operation for which This is all on an idle system, with plenty of free memory, and no By way of comparison, and establishing the baseline expectations behind my use of "Inordinate" in the issue title, on this same system, when the other pools are mounted,
And do all of this:
Looking at All of the above is intended to be constructive and instructive comparison. I'm going to keep digging for specifics, as time and interest allow, but it is abundantly clear to me that |
To answer some of your questions:
As already said, most of the source of issues is when opening the pool, where the Go bindings iterates over all datasets, load their properties and such. This is doing a lot of Go<->C (libzfs) exchanges, which itself, is doing one syscall by request. This is where the slowliness is going when having thousands of datasets. We can’t limit what dataset in the API. Consequently, the only escape path here is to write in kernel space using ZIL (Lua) and load everything kernel-side, returning it in a single shot for API, and this eliminates this issue. So, in non go world, the bindings have probably a different behavior and this is why you don’t see the same issue (the Go binding though is really useful and big thanks to the developers creating it, and we wouldn’t have reimplemented at first everything). Then, ZSys itself, iterating over all loaded datasets (to be able to write persistent and loaded as part of the system) is not an issue by itself (we already timed it). Those are the reasons why ZFS/ZSys is labelled as experimental in the installer. On limiting the pools we are iterating, we had requests to not having |
Yeah. In my normal configuration, the first of these is failing. In the minimal configuration, the second fails (but the daemon starts before the first expires.
Well.. yes, kind of. This is the scenario that makes it happen so much. But in practice, any invocation (including by apt, or the nightly gc) will get into this state. And at least as far as I can see, once it gets there the only way out is to stop the service, because it doesn't make progress past the first startup timeout.
Yes. For example, znapzend implements bindings on top of zfs commands, pulling data (or additional detail) more or less on demand (with some caching). It pays a different tradeoff: sometimes having to fetch the same data multiple times (but also more current), but only fetching mostly the specific data needed for a given operation (snapshots or children of particular datasets, etc).
Yep. Once it's loaded, zsys reports a whole lot of detail (even about unrelated datasets it should have no interest in), and does so quickly. There's no harm in that, if the data is already collected, but the effort of collecting it is the problem. It now seems clear that this effort originates in the dependency chain. Now I totally empathise and endorse using a good library, there is absolutely no criticism there! But the limits of that library and the assumptions built into its interface are being hit: zsys isn't asking for this data, won't have any use for it (so the cost of fetching it is wasted) and has no interface to control prevent it being fetched (so the cost is unavoidable). Aside: one possible work-around for me is to adjust the idle timeout so large that, basically, once
Ultimately the result impacts zsys, and some changes are going to be needed to progress the experiment. Maybe the library authors can help adapt to this use case, maybe zsys needs to implement something different; that implementation and timing discussion is for you.
That's rather unlikely to come from me, alas I'm no go programmer. At this point, I'm mostly using this issue to journal my own investigations, and seek confirmation of findings (which you've been very helpful with, thanks). Hopefully others with similar questions can find them addressed here too. Along the way, if there are specific recommendations that seem worthwhile and agreeable, I'm happy to raise them as separate smaller feature requests so they're not buried in here, and perhaps other contributors can pick them up. So far, I see two:
|
We can indeed duplicate this bug to the perf one and open smaller tasks.
I’m unsure about that one. I really think we first need to get the bindings address and have ZSys having the whole state. There is not only one clients, but many of them, for different users, requesting potentially different thing. The daemon isn’t there just to answer one request most of the time. So limiting what it sees will be a source of new bugs, and we will need to refresh/reconcile/ensure clones are still detected (which needs the whole dataset maps anyway). I think performance of the bindings is the main blocking point here and we should be able to gather all data in one call. |
I know it's only one example, but this one can be addressed within the original namespace, by looking at the |
Warning and disclaimer I'm total noob when it comes to zfs, it just happens to be installed on my system. Workaround below will destroy snapshots and possibly even more stuff that I'm not aware of. Read and use at your own risk. It solved the symptoms for me, for the cause and a permanent fix you need a more experienced doctor. I had same issue as described above,
Running So i decided to go all nuclear and run following command that removes all snapshots unconditionally
Immediatelly after, zsysd seem to start up ok and cpu usage is no longer pegged.
After this, running |
#224 seems to be another example At this point the issue seems to be well understood by the devs, and needs a rewrite (within the library they're using, or within |
Removing the snapshots:
|
Describe the bug
zsysd
spends a lot of time at 100% cpu (seemingly spinning a single core/thread)To Reproduce
Expected behavior
A small daemon with not much work to do, sitting quietly and updating boot states when packages are upgraded.
Screenshots
Installed versions:
21.04 hirsute (and previous)
🙄 #193
Additional context
Part of the issue is that zsysd is rather .. opaque. I have no idea what it is actually trying to do, so it's difficult to understand what would make it need to try so hard. I have some suspicions, however, about possible contributing factors and unnecessary work:
update-grub
has this ridiculous script that tries to evaluate every snapshot of every pool (even importing pools?!?!) and can take hours, and I wonder if there's something similar going on here.However, none of the above are necessary to produce this condition, even if they maybe make it worse or more frequent. I have seen it on a freshly-installed test machine, with no local data of consequence and no znapzend yet set up. Sometimes zsys just needs to be killed and will otherwise never recover. Sometimes it recovers. Sometimes I nice it down, and it seemingly eventually exits and restarts, because soon enough it's back at 100% cpu and not nice anymore (this is where I assume it's getting triggered by zed). I'm going to look at making systemd start it niced, but TBH I tend to just stop the service sometime after boot when this happens, or just disable it entirely.
Is there a good description somewhere of what zsys is attempting to do (e.g. on each zed event) and why? Perhaps a large amount of this work is unnecessary, or can be configured for narrower scope?
The text was updated successfully, but these errors were encountered: