-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zfs-2.3.1 patchset #17097
Open
ixhamza
wants to merge
44
commits into
openzfs:zfs-2.3.1-staging
Choose a base branch
from
truenas:zfs-2.3.1-patchset
base: zfs-2.3.1-staging
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
zfs-2.3.1 patchset #17097
+3,233
−2,065
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Added in b1e46f8, but empty, so no point keeping it around. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#16931
We should not hardcode 512-byte read size when checking for loader in the boot area before RAIDZ expansion. Disk might be unable to handle that I/O as is, and the code zio_vdev_io_start() handling the padding asserts doing it only for top-level vdev. Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#16942
When building tests with zinject, it can be quite difficult to work out if you're producing the right kind of IO to match the rules you've set up. So, here we extend injection records to count the number of times a handler matched the operation, and how often an error was actually injected (ie after frequency and other exclusions are applied). Then, display those counts in the `zinject` output. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes openzfs#16938
To do a cross-build using only kbuild rather than a full source tree, ARCH= needs to be passed for the kbuild Makefile to find the archspecific Makefile. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#16944
Removed three unnecessary spaces in the definition of the sa_attr_reg_t structure to improve code style consistency and adhere to OpenZFS coding standards. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Peng Liu <[email protected]> Closes openzfs#16955
2.3.0 is out now, so make 2.2.x the LTS release. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: George Melikov <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes openzfs#16945 Closes openzfs#16948
The warning at the end of the second example in the description section was actually inside the options table. Move the El macro to match what is done in the first section for improved readability. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Rob Norris <[email protected]> Signed-off-by: Alexander Ziaee <[email protected]> Closes openzfs#16962
Reviewed-by: George Amanakis <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Tim Smith <[email protected]> Closes openzfs#16965
The old kstat helper function was barely used, I suspect in part because it was very limited in the kinds of kstats it could gather. This adds new functions to replace it, for each kind of thing that can have stats: global, pool and dataset. There's options in there to get a single stat value, or all values within a group. Most importantly, the interface is the same for both platforms. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
Removes other custom helpers and direct accesses to /proc. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
It's now a simple wrapper, so lets just call kstat direct. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
I'm about to add a new "type", and I need somewhere to put it! Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#16947
Injecting a device probe failure is not possible by matching IO types, because probe IO goes to the label regions, which is explicitly excluded from injection. Even if it were possible, it would be awkward to do, because a probe is sequence of reads and writes. This commit adds a new IO "type" to match for injection, which looks for the ZIO_FLAG_PROBE flag instead. Any probe IO will be match the injection record and recieve the wanted error. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#16947
The flag VFCF_FILEREV was recently defined in FreeBSD so that a file system could indicate that it increments va_filerev by one for each change. Since ZFS does do this, set the flag if defined for the kernel being built. This allows the NFSv4.2 server to reply with the correct change_attr_type attribute value. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Rick Macklem <[email protected]> Closed openzfs#16976
Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Alan Somers <[email protected]> Sponsored by: ConnectWise Closes openzfs#16980
Originally openzfs#16856 updated Linux Direct I/O requests to use the new pin_user_pages API. However, it was an oversight that this PR only handled iov_iter's of type ITER_IOVEC and ITER_UBUF. Other iov_iter types may try and use the pin_user_pages API if it is available. This can lead to panics as the iov_iter is not being iterated over correctly in zfs_uio_pin_user_pages(). Unfortunately, generic iov_iter API's that call pin_user_page_fast() are protected as GPL only. Rather than update zfs_uio_pin_user_pages() to account for all iov_iter types, we can simply just call zfs_uio_get_dio_page_iov_iter() if the iov_iter type is not ITER_IOVEC or ITER_UBUF. zfs_uio_get_dio_page_iov_iter() calls the iov_iter_get_pages() calls that can handle any iov_iter type. In the future it might be worth using the exposed iov_iter iterator functions that are included in the header iov_iter.h since v6.7. These functions allow for any iov_iter type to be iterated over and advanced while applying a step function during iteration. This could possibly be leveraged in zfs_uio_pin_user_pages(). A new ZFS test case was added to test that a ITER_BVEC is handled correctly using this new code path. This test case was provided though issue openzfs#16956. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Signed-off-by: Brian Atkinson <[email protected]> Closes openzfs#16956 Closes openzfs#17006
amotin
approved these changes
Feb 26, 2025
robn
approved these changes
Feb 27, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving because I think these are all probably safe, but from my own I note:
- 316988c is a very low form of API break (programs will need to change includes). It's kind of the point, as I want to get the
libzfs_core.h
includes stabilised in 2.3.x before FreeBSD 15, but it's still technically a break in a stable series. - the zinject-related changes in the IO pipeline are probably benign, but it is a change in an important code path to support a dev & test tool that doesn't have much use in a stable series.
Like I say, it's all probably fine, but in a release meeting they're the kind of things I'd bring up :)
…penzfs#16857) Introduced functionality to recursively mount datasets with a new config option `mount_recursively`. Adjusted existing functions to handle the recursive behavior and added tests to validate the feature. This enhances support for managing hierarchical ZFS datasets within a PAM context. Signed-off-by: Jerzy Kołosowski <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
This change will prevent prefetch to perform unnecessary ARC buffer fill when reading from disk. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Jaydeep Kshirsagar <[email protected]> Co-authored-by: Alexander Motin <[email protected]> Closes openzfs#17013
As zios are reexecuted after resume from suspension, their ready and wait states need to be propagated to wait counts on all their parents. It's possible for those parents to have active children passing through READY or DONE, which then end up in zio_notify_parent(), take their parent's lock, and decrement the wait count. Without also taking a lock here, it's possible for an increment race to occur, which leads to either there being no references left (tripping the assert in zio_notify_parent()), or a parent waiting forever for a nonexistent child to complete. To protect against this, we simply take the appropriate zio locks in zio_reexecute() before updating the wait counts. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17016
This is a convenience for filesystems that need the inode of their parent or their own name, as its often complicated to get that information. We don't need those things, so this is just detecting which prototype is expected and adjusting our callback to match. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
According to the upstream change, all callers set it, and all block devices either honoured it or ignored it, so removing it entirely allows a bunch of handling for the "unset" case to be removed, and it becomes effectively implied. We follow suit, and keep setting it for older kernels. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
The current documentation of `zfs destroy` in application to snapshots is particularly difficult to understand. The following changes are made: - Remove circular reference to `zfs destroy` in the documentation of that command. - Remove use of "for example", which implies there are more, undocumented reasons that ZFS may fail to destroy a snapshot immediately. - Mention properties `defer_destroy` and `userrefs`. - Add `zfsprops(8)` to "SEE ALSO" list. - Clarify meaning of `-d` option. Requires-builders: none Signed-off-by: mnrx <[email protected]> Co-authored-by: Alexander Motin <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: George Amanakis <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
Sponsored-by: Wasabi Technology, Inc. Sponsored-by: Klara, Inc. Signed-off-by: Mateusz Piotrowski <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: George Amanakis <[email protected]> Reviewed-by: George Melikov <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
When you are using large recordsizes in conjunction with raidz, with incompressible data, you can pretty reliably be making 21 MB allocations. Unfortunately, the fragmentation metric in ZFS considers any metaslabs with 16 MB free chunks completely unfragmented, so you can have a metaslab report 0% fragmented and be unable to satisfy an allocation. When using the segment-based metaslab weight, this is inconvenient; when using the space-based one, it can seriously degrade performance. We expand the fragmentation table to extend up to 512MB, and redefine the table size based on the actual table, rather than having a static define. We also tweak the one variable that depends on fragmentation directly. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes openzfs#16986
Gang blocks have a significant impact on the long and short term performance of a zpool, but there is not a lot of observability into whether they're being used. This change adds gang-specific kstats to ZFS, to better allow users to see whether ganging is happening. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes openzfs#17003
recv_fix_encryption_hierarchy() in its present state goes through all stream filesystems, and for each one traverses the snapshots in order to find one that exists locally. This happens by calling guid_to_name() for each snapshot, which iterates through all children of the filesystem. This results in CPU utilization of 100% for several minutes (for ~1000 filesystems on a Ryzen 4350G) for 1 thread at the end of a raw receive (-w, regardless whether encrypted or not, dryrun or not). Fix this by following a different logic: using the top_fs name, call gather_nvlist() to gather the nvlists for all local filesystems. For each one filesystem, go through the snapshots to find the corresponding stream's filesystem (since we know the snapshots guid and can search with it in stream_avl for the stream's fs). Then go on to fix the encryption roots and locations as in its present state. Avoiding guid_to_name() iteratively makes recv_fix_encryption_hierarchy() significantly faster (from several minutes to seconds for ~1000 filesystems on a Ryzen 4350G). Another problem is the following: in case we have promoted a clone of the filesystem outside the top filesystem specified in zfs send, zfs receive does not fail but returns an error: recv_incremental_replication() fails to find its origin and errors out with needagain=1. This results in recv_fix_hierarchy() not being called which may render some children of the top fs not mountable since their encryption root was not updated. To circumvent this make recv_incremental_replication() silently ignore this error. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: George Amanakis <[email protected]> Closes openzfs#16929
"DESTDIR=/path/to/target/root/ make install" may fail when installing to a root that contains an existing lib/modules structure. When run as root we may even affect the wrong kernel (the build system's one, or, if running a different version, some other directory in /lib/modules, but not the desired one installed in DESTDIR). Add a missing reference to the INSTALL_MOD_PATH root when calling "depmod" during "make install" Also add a switch "DONT_DELETE_MODULES_FILES=1" that skips the removal of files named "modules.*" prior to running depmod. Signed-off-by: Christian Kohlschütter <[email protected]> Closes openzfs#16994 Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
The purpose of no-op is to simulate a failure between a device cache and its permanent store. We still want it to go through the queue and respond in the same way to everything else. So, inject "success" as the very last thing, and then move on to VDEV_IO_DONE to be dequeued and so any followup work can occur. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17029
For zfs_rename, after the dataset name is successfully updated, the dataset handle that was passed to zfs_rename, still contains the old name, due to which, the dataset handle becomes invalid. The following operations performed using this handle result in error since the dataset with old name cannot be found anymore. changelist_rename does update the names in dataset handles, but those are temporary handles that were created during changelist_gather. The original handle that was used to call zfs_rename is not updated. We should update the name in original ZFS handle after the IOCTL for rename returns success for the operation. Signed-off-by: Umer Saleem <[email protected]> Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
skc->skc_name also needs to be freed in an error path. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Signed-off-by: Vandana Rungta <[email protected]> Closes openzfs#17041
Linux 6.12 has conflicting range_tree_{find,destroy,clear} symbols. Signed-off-by: Ivan Volosyuk <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Rob Norris <[email protected]>
Signed-off-by: Rob Norris <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Rob Norris <[email protected]>
Since we are calculating a free space fragmentation, we should weight metaslabs by the amount of their free space, not a full size. Fragmentation of full metaslabs may not matter in presence empty ones. The old algorithm did not differentiate metaslabs having only one free 4KB block from metaslabs having 50% of space free in 4KB blocks, reporting higher fragmentation. While there, move metaslab_group_alloc_update() call after setting mg_fragmentation, otherwise the effect may be delayed by one TXG. Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Reviewed-by: Paul Dagnelie <[email protected]> Reviewed-by: Tony Nguyen <[email protected]> Reviewed-by: Tony Hutter <[email protected]>
Kernel & userspace specifics are in zfs_file_os.c, so there's no particular reason these have to be separate. The one platform-specific part is in the Linux kernel part, to offload flushes to a taskq if we're already inside a filesystem transaction. This would be normally be an unsatisfying wart, but I'm intending to remove this shortly, so I'm content to leave it gated for the moment. Reviewed-by: Allan Jude <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Rob Norris <[email protected]>
Need to use arc_free_data_abd to free abd type buffer. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Chunwei Chen <[email protected]> Closes openzfs#17079
zfs_file_fsync() and zfs_file_deallocate() are both blocking ops, so the zio_taskq thread is active and blocked both while waiting for the IO call and then while calling zio_execute() for the next stage. This is a particular issue for FLUSH, as the z_flush_iss queue typically only has one thread; multiple flushes arriving at once can cause long delays if the underlying fsync() response is particularly slow. To fix this, we dispatch both FLUSH and TRIM to the z_vdev_file taskq, just as we do for reads and writes. Further, we return all results through zio_interrupt(), so neither the issue nor the file taskqs are blocked. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Tony Hutter <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17064
It's included so it's effectively already part of it, but it's not always installed as a userspace header, making zfs.h effectively useless. Might as well just combine it. Sponsored-by: https://despairlabs.com/sponsor/ Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Tony Hutter <[email protected]> Signed-off-by: Rob Norris <[email protected]> Close openzfs#17066
If the timing is unfortunate, the pool can suspend just as we're failing because it didn't suspend. If we don't resume the pool, we hang trying to destroy it. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#17054
Before this change zfs_metaslab_switch_threshold tunable switched metaslabs each time ones index reduced by two (which means biggest contiguous chunk reduced to 1/4). It is a good idea to balance metaslabs fragmentation. But for empty metaslabs (having power- of-2 sizes) this means switching when they get just below the half of their capacity. Inspection with zdb after filling new pool to half capacity shown most of its metaslabs filled to half capacity. I consider this sub-optimal for pool fragmentation in a long run. This change blocks the metaslabs switching if most of the metaslab free space (15/16) is represented by a single contiguous range. Such metaslab should not be considered fragmented until it actually fail some big allocation. More contiguous filling should improve data locality and increase time before previously filled and partially freed metaslab is touched again, giving it more time to free more contiguous chunks for lower fragmentation. It should also slightly reduce spacemap traffic. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Paul Dagnelie <[email protected]> Signed-off-by: Alexander Motin <[email protected]> Sponsored by: iXsystems, Inc. Closes openzfs#17081
Reviewed-by: Ameer Hamza <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: SHENGYI HONG <[email protected]> Closes openzfs#17088
Don't try to get mg of hole vdev in removal Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Paul Dagnelie <[email protected]> Closes openzfs#17080
In l2arc_evict(), the config lock may be acquired in reverse order (e.g., first the config lock (writer), then a hash lock) unlike in arc_read() during scenarios like L2ARC device removal. To avoid deadlocks, if the attempt to acquire the config lock (reader) fails in arc_read(), release the hash lock, wait for the config lock, and retry from the beginning. Reviewed-by: Alexander Motin <[email protected]> Signed-off-by: Ameer Hamza <[email protected]> Closes openzfs#17071
`zpool create` won't let you use relative paths to disks. This is annoying when you want to do: zpool create tank ./diskfile But have to do.. zpool create tank `pwd`/diskfile This fixes it. Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Tony Hutter <[email protected]> Closes openzfs#17042
Update the META file to reflect compatibility with the 6.13 kernel. Signed-off-by: Tony Hutter <[email protected]> Reviewed-by: Rob Norris <[email protected]> Reviewed-by: Tino Reichardt <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]>
6672594
to
acfd651
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation and Context
Initial proposed patchset for zfs-2.3.1.
Description
Bug fixes, optimizations, ZTS updates, zinject updates, code cleanup.
Nearly all changes pulled from master on top of the 2.3.0 release tag, excluding #16857 unless @jkolo confirms it is safe to be included in 2.3.1.
How Has This Been Tested?
Clean backports from master. Will be retested by the CI.
Types of changes