Skip to content

2.3.4 staging prep #17595

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 28 commits into
base: zfs-2.3.4-staging
Choose a base branch
from
Open

Conversation

amotin
Copy link
Member

@amotin amotin commented Aug 5, 2025

Most of these commits we've already merged into TrueNAS, few others we'd like to, if there are no objections.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Quality assurance (non-breaking change which makes the code more robust against bugs)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
  • Documentation (a change to man pages or other documentation)

Checklist:

amotin and others added 21 commits August 5, 2025 12:14
ZIL introduced dependencies between its write ZIOs to permit flush
defer, when we flush vdev caches only once all the write ZIOs has
completed.  But it was recently spotted that it serializes not only
ZIO completions handling, but also their ready stage.  It means ZIO
pipeline can't calculate checksums for the following ZIOs until all
the previous are checksumed, even though it is not required.  On a
systems where memory throughput of a single CPU core is limited,
it creates single-core CPU bottleneck, which is difficult to see
due to ZIO pipeline design with many taskqueue threads.

While it would be great to bypass the ready stage waits, it would
require changes to ZIO code, and I haven't found a clean way to do
it.  But I've noticed that we don't need any dependency between
the write ZIOs if the previous one has some waiters, which means
it won't defer any flushes and work as a barrier for the earlier
ones.

Bypassing it won't help large single-thread writes, since all the
write ZIOs except the last in that case won't have waiters, and
so will be dependent.  But in that case the ZIO processing might
not be a bottleneck, since there will be only one thread populating
the write buffers, that will likely be the bottleneck.

But bypassing the ZIO dependency on multi-threaded write workloads
really allows them to scale beyond the checksuming throughput of
one CPU core.

My tests with writing 12 files on a same dataset on a pool with
4 striped NVMes as SLOGs from 12 threads with 1MB blocks on a
system with Xeon Silver 4114 CPU show total throughput increase
from 4.3GB/s to 8.5GB/s, increasing the SLOGs busy from ~30% to
~70%.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#17458
This PR condenses the FDT dedup log syncing into a single sync
pass. This reduces the overhead of modifying indirect blocks for the
dedup table multiple times per txg. In addition, changes were made to
the formula for how much to sync per txg. We now also consider the
backlog we have to clear, to prevent it from growing too large, or
remaining large on an idle system.

Sponsored-by: Klara, Inc.
Sponsored-by: iXsystems, Inc.
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Authored-by: Don Brady <[email protected]>
Authored-by: Paul Dagnelie <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes openzfs#17038
- Don't drop L2ARC header if we have more buffers in this header.
Since we leave them the header, leave them the L2ARC header also.
Honestly we are not required to drop it even if there are no other
buffers, but then we'd need to allocate it a separate header, which
we might drop soon if the old block is really deleted.  Multiple
buffers in a header likely mean active snapshots or dedup, so we
know that the block in L2ARC will remain valid.  It might be rare,
but why not?
 - Remove some impossible assertions and conditions.

Reviewed-by: Tony Hutter <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#17126
The `scn_min_txg` can now be used not only with resilver. Instead
of checking `scn_min_txg` to determine whether it’s a resilver or
a scrub, simply check which function is defined. Thanks to this
change, a scrub_finish event is generated when performing a scrub
from the saved txg.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Mariusz Zaborski <[email protected]>
Closes openzfs#17432
dbuf_verify(): Don't need the lock, since we only compare pointers.

dbuf_findbp(): Don't need the lock, since aside of unneeded assert
we only produce the pointer, but don't de-reference it.

dnode_next_offset_level(): When working on top level indirection
should lock dnode buffer's db_rwlock, since it is our parent.  If
dnode has no buffer, then it is meta-dnode or one of quotas and we
should lock the dataset's ds_bp_rwlock instead.

Reviewed-by: Alan Somers <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#17441
There are still a variety of bugs involving the vdev_nonrot property
that will cause problems if you try to run the test suite with
segment-based weighting disabled, and with other things in the weighting
code. Parents' nonrot property need to be updated when children are
added. When vdevs are expanded and more metaslabs are added, the weights
have to be recalculated (since the number of metaslabs is an input to
the lba bias function). When opening, faulted or unopenable children
should not be considered for whether a vdev is nonrot or not (since the
nonrot property is determined during a successful open, this can cause
false negatives). And draid spares need to have the nonrot property set
correctly.

Sponsored-by: Eshtek, creators of HexOS
Sponsored-by: Klara, Inc.
Reviewed-by: Allan Jude <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes openzfs#17469
Use statx to verify that path-based unmounts proceed only if the
mountpoint reported by statx matches the MNTTAB entry reported by
libzfs, aborting the operation if they differ. Align
`zfs umount /path` behavior with `zfs umount dataset`.

Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#17481
Currently, after a failed allocation, the metaslab code recalculates the
weight for a metaslab. However, for space-based metaslabs, it uses the
maximum free segment size instead of the normal weighting
algorithm. This is presumably because the normal metaslab weight is
(roughly) intended to estimate the size of the largest free segment, but
it doesn't do that reliably at most fragmentation levels. This means
that recalculated metaslabs are forced to a weight that isn't really
using the same units as the rest of them, resulting in undesirable
behaviors. We switch this to use the normal space-weighting function.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Sponsored-by: Wasabi Technology, Inc.
Sponsored-by: Klara, Inc.
Closes openzfs#17531
Under parallel workloads ZIL may delay writes of open LWBs that
are not full enough.  On suspend we do not expect anything new to
appear since zil_get_commit_list() will not let it pass, only
returning TXG number to wait for.  But I suspect that waiting for
the TXG commit without having the last LWB issued may not wait for
its completion, resulting in panic described in openzfs#17509.

Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#17521
On Linux, when doing path lookup with LOOKUP_RCU, dentry and inode can
be dereferenced without refcounts and locks. For this reason, dentry and
inode must only be freed after RCU grace period.

However, zfs currently frees inode in zfs_inode_destroy synchronously
and we can't use GPL-only call_rcu() in zfs directly. Fortunately, on
Linux 5.2 and after, if we define sops->free_inode(), the kernel will do
call_rcu() for us.

This issue may be triggered more easily with init_on_free=1 boot
parameter:

BUG: kernel NULL pointer dereference, address: 0000000000000020
RIP: 0010:selinux_inode_permission+0x10e/0x1c0
Call Trace:
 ? show_trace_log_lvl+0x1be/0x2d9
 ? show_trace_log_lvl+0x1be/0x2d9
 ? show_trace_log_lvl+0x1be/0x2d9
 ? security_inode_permission+0x37/0x60
 ? __die_body.cold+0x8/0xd
 ? no_context+0x113/0x220
 ? exc_page_fault+0x6d/0x130
 ? asm_exc_page_fault+0x1e/0x30
 ? selinux_inode_permission+0x10e/0x1c0
 security_inode_permission+0x37/0x60
 link_path_walk.part.0.constprop.0+0xb5/0x360
 ? path_init+0x27d/0x3c0
 path_lookupat+0x3e/0x1a0
 filename_lookup+0xc0/0x1d0
 ? __check_object_size.part.0+0x123/0x150
 ? strncpy_from_user+0x4e/0x130
 ? getname_flags.part.0+0x4b/0x1c0
 vfs_statx+0x72/0x120
 ? ioctl_has_perm.constprop.0.isra.0+0xbd/0x120
 __do_sys_newlstat+0x39/0x70
 ? __x64_sys_ioctl+0x8d/0xd0
 do_syscall_64+0x30/0x40
 entry_SYSCALL_64_after_hwframe+0x62/0xc7

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Signed-off-by: Chunwei Chen <[email protected]>
Co-authored-by: Chunwei Chen <[email protected]>
Closes openzfs#17546
During hotplug REMOVED events, devid matching fails for partition-based
spares because devid information is not stored in pool config for
partitioned devices. However, when devid is populated by the hotplug
event, the original code skipped the search logic entirely, skipping
vdev_guid matching and resulting in wrong device type detection that
caused spares to be incorrectly identified as l2arc devices.
Additionally, fix zfs_agent_iter_pool() to use the return value from
zfs_agent_iter_vdev() instead of relying on search parameters, which
was previously ignored. Also add pool_guid optimization to enable
targeted pool searching when pool_guid is available.

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Ameer Hamza <[email protected]>
Closes openzfs#17545
Currently, when reading compressed blocks with -R and decompressing
them with :d option and specifying lsize, which is normally bigger
than psize for compressed blocks, the checksum is calculated on
decompressed data. But it makes no sense since zfs always calculates
checksum on physical, i.e. compressed data. So reading the same block
produces different checksum results depending on how we read it,
whether we decompress it or not, which, again, makes no sense.

Fix: use psize instead of lsize when calculating the checksum so that
it is always calculated on the physical block size, no matter was it
compressed or not.

Signed-off-by: Andriy Tkachuk <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Closes openzfs#17547
Update the default FICLONE and FICLONERANGE ioctl behavior to wait
on dirty blocks.  While this does remove some control from the
application, in practice ZFS is better positioned to the optimial
thing and immediately force a TXG sync.

Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: George Melikov <[email protected]>
Signed-off-by: Brian Behlendorf <[email protected]>
Closes openzfs#17455
When we're passivating a metaslab group we start by passivating the 
metaslabs that have been activated for each of the allocators.  To do 
that, we need to provide a weight. However, currently this erroneously 
always uses a segment-based weight, even if segment-based weighting is 
disabled.

Use the normal weight function, which will decide which type of weight 
to use.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes openzfs#17566
While booting, only the needed 256KiB benchmarks are done now.

The delay for checking all checksums occurs when requested via:
- Linux: cat /proc/spl/kstat/zfs/chksum_bench
- FreeBSD: sysctl kstat.zfs.misc.chksum_bench

Reported by: Lahiru Gunathilake <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Tino Reichardt <[email protected]>
Co-authored-by: Colin Percival <[email protected]>
Closes openzfs#17563
Closes openzfs#17560
Sponsored-by: Klara, Inc.
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Igor Ostapenko <[email protected]>
Closes openzfs#17581
…17073)

The redundant_metadata setting in ZFS allows users to trade resilience
for performance and space savings. This applies to all data and metadata
blocks in zfs, with one exception: gang blocks. Gang blocks currently
just take the copies property of the IO being ganged and, if it's 1,
sets it to 2. This means that we always make at least two copies of a
gang header, which is good for resilience. However, if the users care
more about performance than resilience, their gang blocks will be even
more of a penalty than usual.

We add logic to calculate the number of gang headers copies directly,
and store it as a separate IO property. This is stored in the IO
properties and not calculated when we decide to gang because by that
point we may not have easy access to the relevant information about what
kind of block is being stored. We also check the redundant_metadata
property when doing so, and use that to decide whether to store an extra
copy of the gang headers, compared to the underlying blocks.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Paul Dagnelie <[email protected]>
Co-authored-by: Paul Dagnelie <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
Missed in openzfs#17073, probably because that PR was branched before openzfs#17001
was landed and never rebased.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.

Signed-off-by: Rob Norris <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Tony Hutter <[email protected]>
As discussed in the comments of PR openzfs#17004, you can theoretically run
into a case where a gang child has more copies than the gang header,
which can lead to some odd accounting behavior (and even trip a
VERIFY). While the accounting code could be changed to handle this, it
fundamentally doesn't seem to make a lot of sense to allow this to
happen. If the data is supposed to have a certain level of reliability,
that isn't actually achieved unless the gang_copies property is set to
match it.

Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Paul Dagnelie <[email protected]>
Closes openzfs#17484
Loss of one indirect block of the meta dnode likely means loss of
the whole dataset.  It is worse than one file that the man page
promises, and in my opinion is not much better than "none" mode.

This change restores redundancy of the meta-dnode indirect blocks,
while same time still corrects expectations in the man page.

Reviewed-by: Akash B <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#17339
@amotin amotin mentioned this pull request Aug 5, 2025
14 tasks
@shodanshok
Copy link
Contributor

Maybe #17542 also?

@robn
Copy link
Member

robn commented Aug 7, 2025

I'm fine with also these going in. I don't think its the whole list of course, but we usually do a few rounds of these PRs. I'll start daily driving this set and will put together my own list of wants soon (but for sure #17542 is critical imo).

@satmandu
Copy link
Contributor

satmandu commented Aug 7, 2025

Is #17561 too new? Getting as many of the PRS that handle memory pressure issues as are deemed safe into a point version update would be nice.

@satmandu
Copy link
Contributor

satmandu commented Aug 7, 2025

Additionally, having #17578 to indicate official kernel 6.16 compatibility might make sense...

@satmandu
Copy link
Contributor

satmandu commented Aug 7, 2025

Also, dare I suggest that it would be nice to get the rewrite functionality available at some point soon in an official tagged release? Is it too early to suggest a 2.4.0 staging series?

@amotin
Copy link
Member Author

amotin commented Aug 7, 2025

@satmandu The plan is to start 2.4 branching from master nearest time: #17310 .

shodanshok and others added 4 commits August 7, 2025 12:11
Linux kernel shrinker in the context of null/root memcg does not scan
dentry and inode caches added by a task running in non-root memcg. For
ZFS this means that dnode cache routinely overflows, evicting valuable
meta/data and putting additional memory pressure on the system.

This patch restores zfs_prune_aliases as fallback when the kernel
shrinker does nothing, enabling zfs to actually free dnodes. Moreover,
it (indirectly) calls arc_evict when dnode_size > dnode_limit.

Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Gionatan Danti <[email protected]>
Closes openzfs#17487
Closes openzfs#17542
Avoid calling dbuf_evict_one() from memory reclaim contexts (e.g. Linux
kswapd, FreeBSD pagedaemon). This prevents deadlock caused by reclaim
threads waiting for the dbuf hash lock in the call sequence:
dbuf_evict_one -> dbuf_destroy -> arc_buf_destroy

Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Alexander Motin <[email protected]>
Signed-off-by: Kaitlin Hoang <[email protected]>
Closes openzfs#17561
Reviewed-by: Alexander Motin <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Rob Norris <[email protected]>
Closes openzfs#17443
This allows to rewrite content of specified file(s) as-is without
modifications, but at a different location, compression, checksum,
dedup, copies and other parameter values.  It is faster than read
plus write, since it does not require data copying to user-space.
It is also faster for sync=always datasets, since without data
modification it does not require ZIL writing.  Also since it is
protected by normal range range locks, it can be done under any
other load.  Also it does not affect file's modification time or
other properties.

Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Reviewed-by: Tony Hutter <[email protected]>
Reviewed-by: Rob Norris <[email protected]>
@amotin
Copy link
Member Author

amotin commented Aug 7, 2025

#17542 -- Included.
#17561 -- Sure it is new, but should be safe.
#17578 -- It is not even in master. I've included #17443 as its pre-requisite, but according to Brian more work or at least attention might be needed there.
#17246 -- Included logical rewrite with one followup.

It is not right, but there are few examples when TX is aborted
after being assigned in case of error.  To handle it better on
production systems add extra cleanup steps.

While here, replace couple dmu_tx_abort() in simple cases.

Reviewed-by: Rob Norris <[email protected]>
Reviewed-by: Brian Behlendorf <[email protected]>
Reviewed-by: Igor Kozhukhov <[email protected]>
Signed-off-by:	Alexander Motin <[email protected]>
Sponsored by:	iXsystems, Inc.
Closes openzfs#17438
@amotin amotin force-pushed the zfs-2.3.4-staging branch from 62502e7 to 22eb2bd Compare August 7, 2025 16:41
@AttilaFueloep
Copy link
Contributor

I'd suggest #17456 and #17541. It's just a build change and a no-op on Linux kernels not defining CONFIG_OBJTOOL_WERROR. CentOS Stream 9/10 need them to not fail the build already and I'd expect RHEL and derivatives to need them in the support window of 2.3 as well.

Linux 5.16 by default fails the build on objtool warnings. We have
known and understood objtool warnings we can't fix without
involving Linux maintainers.

To work around this we introduce an objtool wrapper script which
removes the `--Werror` flag.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Attila Fülöp <[email protected]>
Closes openzfs#17456
Older kernel versions run make outside of the build directory. This
works since all paths are absolute. Relative paths will fail in such
a scenario.

Use an absolute path to the objtool wrapper as well, since the
relative path breaks the build on older kernels.

Reviewed-by: Brian Behlendorf <[email protected]>
Signed-off-by: Attila Fülöp <[email protected]>
Closes openzfs#17541
@amotin
Copy link
Member Author

amotin commented Aug 7, 2025

I'd suggest #17456 and #17541.

Included.

@stuartthebruce stuartthebruce mentioned this pull request Aug 7, 2025
@amotin amotin added the Status: Code Review Needed Ready for review and testing label Aug 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.