Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast dedup interacts negatively with variable copies + ganging #17070

Open
pcd1193182 opened this issue Feb 19, 2025 · 2 comments
Open

Fast dedup interacts negatively with variable copies + ganging #17070

pcd1193182 opened this issue Feb 19, 2025 · 2 comments
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)

Comments

@pcd1193182
Copy link
Contributor

pcd1193182 commented Feb 19, 2025

Advisory: If your pool has SPA_FEATURE_FAST_DEDUP active, until this issue is resolved, I would recommend against using dedup in contexts where you might have different references to the same data with varying values of the copies property. To avoid the issue, either disable dedup on datasets with higher copies values, or keep copies at the same value on all deduped datasets. This issue only presents if ganging occurs, but unless zfs_metaslab_try_hard_before_gang is set it's hard to predict when exactly that will happen; however, if you are confident that your pool will not fill up to the point where ganging occurs, this is also not likely to be an issue.

This issue was introduced in (I believe) f4aeb23f5 (#15893), which first appeared in a release in 2.3.

System information

Identified via zloop on ubuntu 24.04, but verified by code inspection to be applicable to all platforms.

Describe the problem you're observing

With the advent of fast dedup, there are no longer separate dedup tables for different copies values. There is now logic that will add DVAs to the dedup table entry if more copies are needed for new writes. However, this interacts poorly with ganging. There are several problematic cases, these are a few of them:

  1. A block is stored in the DDT with copies = 1, unganged. A new entry is added that wants copies = 2. The new allocation gangs, which results in a mixed gang/contiguous BP. This is illegal in ZFS, and will cause an assertion failure on debug bits.
  2. A block is stored in the DDT with copies = 1, but it ganged, so there are two DVAs in the entry. A new write comes along that wants copies = 2. The new allocation sees two DVAs in the entry and is satisfied, even though those entries don't have 2 copies of the gang leaves, just the gang headers.
  3. A block is stored in the DDT with copies = 1, but it ganged, so there are two DVAs in the entry. A new write comes along that wants copies = 3. The new write allocates a single non-ganged entry, and adds it to the dedup entry. This also results in a mixed gang/contiguous BP, which is illegal in ZFS.
  4. A block is stored in the DDT with copies = 1, but it ganged, so there are two DVAs in the entry. A new write comes along that wants copies = 3. The new write gangs, and does one allocation to fill the last slot, but it only has copies = 1, resulting in effectively 2 levels of redundancy instead of the desired 3.

For case 1, we could just cause the new dedup write to fail and make it try again without ganging. This also works for case 3. Ideally, we could detect this in advance and force them to do the correct kind of allocation, though that just turns case 3 into case 4.

Case 2 and 4 are tricky; the fundamental problem is that ganging consumes more DVAs without providing more redundancy, and you can't tell if the redundancy is 1 or 2 without actually reading the gang tree. And even if we know the redundancy that we already have, sometimes you can't increase the redundancy if some of the DVAs are already taken. We might need to replace the existing DDT entry with a new one, but then any existing duplicate references would have problems. Or we can cause any potentially-problematic case to just abandon dedup, but that's not very satisfying. The only other option would be to reintroduce the copies tables in some limited fashion to try to handle this case.

Describe how to reproduce the problem

sudo zpool create testpool $DISKS
sudo zfs create -o recordsize=64k -o dedup=on testpool/fs1
sudo zfs create -o recordsize=64k -o dedup=on -o copies=3 testpool/fs2
echo 1 | sudo tee /sys/module/zfs/parameters/zfs_dedup_log_txg_max
sudo dd if=/dev/urandom of=/testpool/fs/f1 bs=64k count=1
sudo zpool sync testpool
sudo zpool sync testpool
sudo zpool sync testpool
echo 20000 | sudo tee /sys/module/zfs/parameters/metaslab_force_ganging
echo 100 | sudo tee /sys/module/zfs/parameters/metaslab_force_ganging_pct
sudo dd if=/testpool/fs/f1 of=/testpool/fs2/f1 bs=64k count=1
sudo zpool sync testpool # This will panic, case 1
sudo rm /testpool/fs*/f1
sudo zpool sync testpool
sudo dd if=/dev/urandom of=/testpool/fs/f1 bs=64k count=1
sudo zpool sync testpool
sudo zpool sync testpool
sudo zpool sync testpool
sudo zdb -D testpool
echo 2000000 | sudo tee /sys/module/zfs/parameters/metaslab_force_ganging
echo 0 | sudo tee /sys/module/zfs/parameters/metaslab_force_ganging_pct
sudo dd if=/testpool/fs/f1 of=/testpool/fs2/f1 bs=64k count=1
sudo zpool sync testpool # This will panic, case 3

Include any warning/errors/backtraces from the system logs

This bug can result in an assertion trip: BP_COUNT_GANG(zio->io_bp) == 0 || (BP_COUNT_GANG(zio->io_bp) == BP_GET_NDVAS(zio->io_bp))

@pcd1193182 pcd1193182 added the Type: Defect Incorrect behavior (e.g. crash, hang) label Feb 19, 2025
@amotin
Copy link
Member

amotin commented Feb 20, 2025

It is quite a can of worms. Meanwhile, if no other quick ideas, we could at least block copies addition if present pointer in DDT is ganged, or if we can't allocate additional copy without ganging. This way we would still have a correct pool, even if the number of copies could be lower than requested in some cases. Or we could block dedup in conflict cases we can identify.

@pcd1193182
Copy link
Contributor Author

Yeah, I think implementing the fix for the cases that cause invalid BPs should be pretty quick, I mostly have it ready. Once that's in, the advisory is at least downgraded to "hey deduped data may only fully respect the first copies value it gets, if ganging is involved", which is a much less scary situation. I can work on that and get a review up relatively soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Defect Incorrect behavior (e.g. crash, hang)
Projects
None yet
Development

No branches or pull requests

2 participants