Commit Graph

2486 Commits

Author SHA1 Message Date
406fbe785e gbmap: Reverted attempt at limiting in-use zeroing to unknown window
See previous commit for motivation
2025-10-23 23:58:49 -05:00
d8f3346f13 gbmap: Attempted to limit in-use zeroing to unknown window
Unfortunately this doesn't work and will need to be ripped-out/reverted.

---

The goal was to limit in-use -> free zeroing to the uknown window, which
would allow the gbmap to be updated in-place, saving the extra RAM we
need to maintain the extra gbmap snapshot during traversals and
lfs3_alloc_zerogbmap.

Unfortunately this doesn't seem to work. If we limit zeroing to the
unknown window, blocks can get stuck in the in-use state as long as they
stay in the known window. Since the gbmap's known window encompasses
most of the disk, this can cause the allocators to lock up and be unable
to make progress.

So will revert, but committing the current implementation in case we
revisit the idea.

As a plus, reverting avoids needing to maintain this unknown window
logic, which is tricky and error-prone.
2025-10-23 23:57:53 -05:00
12874bff76 gbmap: Added gc_repoplookahead_thresh and gc_repopgbmap_thresh
To allow relaxing when LFS3_I_REPOPLOOKAHEAD and LFS3_I_REPOPGBMAP will
be set, potentially reducing gc workload after allocating only a couple
blocks.

The relevant cfg comments have quite a bit more info.

Note -1 (not the default, 0, maybe we should explicitly flip this?)
restores the previous functionality of setting these flags on the first
block allocation.

---

Also tweaked gbmap repops during gc/traversals to _not_ try to repop
unless LFS3_I_REPOPGBMAP is set. We probably should have done this from
the beginning since repopulating the gbmap writes to disk and is
potentially destructive.

Adds code, though hopefully we can claw this back with future config
rework:

                 code          stack          ctx
  before:       37176           2352          684
  after:        37208 (+0.1%)   2352 (+0.0%)  688 (+0.6%)

                 code          stack          ctx
  gbmap before: 40024           2368          848
  gbmap after:  40120 (+0.2%)   2368 (+0.0%)  856 (+0.9%)
2025-10-23 23:56:50 -05:00
1dc1a26f11 gc: Added LFS3_GC_ALL to make running all gc work easier
This is an alias for all possible gc work, which is a bit more
complicated than you might think due to compile-time features (example:
LFS3_GC_REPOPGBMAP).

The intention is to make loops like the following easy to write:

  struct lfs3_fsinfo fsinfo;
  lfs3_fs_stat(&lfs3, &fsinfo) => 0;

  lfs3_trv_t trv;
  lfs3_trv_open(&lfs3, &trv, fsinfo.flags & LFS3_GC_ALL) => 0;
  ...

It's possible to do this by explicitly setting all gc flags, but that
requires quite a bit of knowledge from the user.

Another option is allowing -1 for gc/traversal flags, but that loses
assert protection against unknown/misplaced flags.

---

This raises more questions about the prefix naming: it feels a bit weird
to take LFS3_I_* flags, mask with LFS3_GC_* flags, and pass them as
LFS3_T_* flags, but it gets the job done.

Limiting LFS3_GC_ALL to the LFS3_GC_* namespace avoids issues with
opt-out/mode flags such as LFS3_T_RDONLY, LFS3_T_MTREEONLY, etc. For
this reason it probably doesn't make sense to add something similar to
the other namespaces.
2025-10-23 23:55:54 -05:00
1f824a029b Renamed LFS3_T_COMPACT -> LFS3_T_COMPACTMETA (and gc_compactmeta_thresh)
- LFS3_T_COMPACT -> LFS3_T_COMPACTMETA
- gc_compact_thresh -> gc_compactmeta_thresh

And friends:

  LFS3_M_COMPACTMETA   0x00000800  Compact metadata logs
  LFS3_GC_COMPACTMETA  0x00000800  Compact metadata logs
  LFS3_I_COMPACTMETA   0x00000800  Filesystem may have uncompacted metadata
  LFS3_T_COMPACTMETA   0x00000800  Compact metadata logs

---

This does two things:

1. Highlights that LFS3_T_COMPACTMETA only interacts with metadata logs,
   and has no effect on data blocks.

2. Better matches the verb+noun names used for other gc/traversal flags
   (REPOPGBMAP, CKMETA, etc).

It is a bit more of a mouthful, but I'm not sure that's entirely a bad
thing. These are pretty low-level flags.
2025-10-23 23:54:57 -05:00
9bdfb25a09 Renamed LFS3_T_LOOKAHEAD -> LFS3_T_REPOPLOOKAHEAD
And friends:

  LFS3_M_REPOPLOOKAHEAD   0x00000200  Repopulate lookahead buffer
  LFS3_GC_REPOPLOOKAHEAD  0x00000200  Repopulate lookahead buffer
  LFS3_I_REPOPLOOKAHEAD   0x00000200  Lookahead buffer is not full
  LFS3_T_REPOPLOOKAHEAD   0x00000200  Repopulate lookahead buffer

To match LFS3_T_REPOPGBMAP, which is more-or-less the same operation.
Though this does turn into quite the mouthful...
2025-10-23 23:54:02 -05:00
ced63a4c73 Renamed inline_size -> shrub_size
There's a strong argument for naming this inline_size as that's more
likely what users expect, but shrub_size is just the more correct name
and avoids confusion around having multiple names for the same thing.

It also highlights that shrubs in littlefs3 are a bit different than
inline files in littlefs2, and that this config also affects large files
with a shrubbed root.

May rerevert this in the future, but probably only if there is
significant user confusion.
2025-10-23 23:53:02 -05:00
d58205d621 Renamed lfs3_fs_flushgdelta -> lfs3_fs_zerogdelta
This really didn't match the use of "flush" elsewhere in the system.
2025-10-23 23:52:09 -05:00
3b4e1e9e0b gbmap: Renamed gbmap_rebuild_thresh -> gbmap_repop_thresh
And tweaked a few related comments.

I'm still on the fence with this name, I don't think it's great, but it
at least betters describes the "repopulation" operation than
"rebuilding". The important distinction is that we don't throw away
information. Bad/erased block info (future) is still carried over into
the new gbmap snapshot, and persists unless you explicitly call
rmgbmap + mkgbmap.

So, adopting gbmap_repop_thresh for now to see if it's just a habit
thing, but may adopt a different name in the future.

As a plus, gbmap_repop_thresh is two characters shorter.
2025-10-23 23:51:18 -05:00
fb90bf976c trv: Split lfs3_trv_t -> lfs3_trv_t, lfs3_mgc_t, and lfs3_mtrv_t
A big downside of LFS3_T_REBUILDGBMAP is the addition of an lfs3_btree_t
struct to _every_ traversal object.

Unfortunately, I don't see a way around this. We need to track the new
gbmap snapshot _somewhere_, and other options (such as a global gbmap.b_
snapshot) just move the RAM around without actually saving anything.

To at least mitigate this internally, this splits lfs3_trv_t into
distinct lfs3_trv_t, lfs3_mgc_t, and lfs3_mtrv_t structs that capture
only the relevant state for internal traversal layers:

- lfs3_mtree_traverse <- lfs3_mtrv_t
- lfs3_mtree_gc       <- lfs3_mgc_t (contains lfs3_mtrv_t)
- lfs3_trv_read       <- lfs3_trv_t (contains lfs3_mgc_t)

This minimizes the impact of the gbmap rebuild snapshots, and saves a
big chunk of RAM. As a plus it also saves RAM in the default build by
limiting the 2-block block queue to the high-level lfs3_trv_read API:

                 code          stack          ctx
  before:       37176           2360          684
  after:        37176 (+0.0%)   2352 (-0.3%)  684 (+0.0%)

                 code          stack          ctx
  gbmap before: 40060           2432          848
  gbmap after:  40024 (-0.1%)   2368 (-2.6%)  848 (+0.0%)

The main downside? Our field names are continuing in their
ridiculousness:

  lfs3.gc.gc.t.b.h.flags // where else would the global gc flags be?
2025-10-23 23:49:58 -05:00
06bc4dff04 trv: Simplified MUTATED/DIRTY flags, no more swapping
A bit less simplified than I hoped, we don't _strictly_ need both
LFS3_t_DIRTY + LFS3_t_MUTATED if we're ok with either (1) making
multiple passes to confirm fixorphans succeeded or (2) clear the COMPACT
flag after one pass (which may introduce new uncompacted metadata). But
both of these have downsides, and we're not _that_ stressed for flag
space yet...

So keeping all three of:

  LFS3_t_DIRTY      0x04000000  Filesystem modified outside traversal
  LFS3_t_MUTATED    0x02000000  Filesystem modified during traversal
  LFS3_t_CKPOINTED  0x01000000  Filesystem ckpointed during traversal

But I did manage to get rid of the bit swapping by tweaking LFS3_t_DIRTY
to imply LFS3_t_MUTATED instead of being exclusive. This removes the
"failed" gotos in lfs3_mtree_gc and makes things a bit more readable.

---

I also split lfs3_fs/handle_clobber into separate lfs3_fs/handle_clobber
and lfs3_fs/handle_mutate functions. This added a bit of code, but I
think is worth it for a simpler internal API. A confusing internal API
is no good.

In total these simplifications saved a bit of code:

                 code          stack          ctx
  before:       37208           2360          684
  after:        37176 (-0.1%)   2360 (+0.0%)  684 (+0.0%)

                 code          stack          ctx
  gbmap before: 40100           2432          848
  gbmap after:  40060 (-0.1%)   2432 (+0.0%)  848 (+0.0%)
2025-10-23 23:41:43 -05:00
5a7e0c2b58 gbmap: Renamed a couple gbmap/lookahead things to be more consistent
- lfs3_gbmap_set* -> lfs3_gbmap_mark*
- lfs3_alloc_markfree -> lfs3_alloc_adopt
- lfs3_alloc_mark* -> lfs3_alloc_markinuse*

Mainly for consistency, since the gbmap and lookahead buffer are more or
less the same algorithm, ignoring nuances (lookahead only ors inuse
bits, gbmap rebuilding can result in multiple snapshots, etc).

The rename lfs3_gbmap_set* -> lfs3_gbmap_mark* also makes space for
lfs3_gbmap_set* to be used for range assignments with a payload, which
may be useful for erased ranges (gbmap tracked ecksums?)
2025-10-23 23:39:59 -05:00
f5508a1b6c gbmap: Added LFS3_T_REBUILDGBMAP and friends
This adds LFS3_T_REBUILDGBMAP and friends, and enables incremental gbmap
rebuilds as a part of gc/traversal work:

  LFS3_M_REBUILDGBMAP   0x00000400  Rebuild the gbmap
  LFS3_GC_REBUILDGBMAP  0x00000400  Rebuild the gbmap
  LFS3_I_REBUILDGBMAP   0x00000400  The gbmap is not full
  LFS3_T_REBUILDGBMAP   0x00000400  Rebuild the gbmap

On paper, this is more or less identical to repopulating the lookahead
buffer -- traverse the filesystem, mark blocks as in-use, adopt the new
gbmap/lookahead buffer on success -- but a couple nuances make
rebuilding the gbmap a bit trickier:

- Unlike the lookahead buffer, which eagerly zeros in allocation, we
  need an explicit zeroing pass before we start marking blocks as
  in-use. This means multiple traversals can potentially conflict with
  each other, risking the adoption of a clobbered gbmap.

- The gbmap, which stores information on disk, relies on block
  allocation and the temporary "in-flight window" defined by allocator
  ckpoints to avoid circular block states during gbmap rebuilds. This
  makes gbmap rebuilds sensitive to allocator ckpoints, which we
  consider more-or-less a noop in other parts of the system.

  Though now that I'm writing this, it might have been possible to
  instead include gbmap rebuild snapshots in fs traversals... but that
  would probably have been much more complicated.

- Rebuilding the gbmap requires writing to disk and is generally much
  more expensive/destructive. We want to avoid trying to rebuild the
  gbmap when it's not possible to actually make progress.

On top of this, the current trv-clobber system is a delicate,
error-prone mess.

---

To simplify everything related to gbmap rebuilds, I added a new
internal traversal flag: LFS3_t_CKPOINTED:

  LFS3_t_CKPOINTED  0x04000000  Filesystem ckpointed during traversal

LFS3_t_CKPOINTED is set, unconditionally, on all open traversals in
lfs3_alloc_ckpoint, and provides a simple, robust mechanism for checking
if _any_ allocator checkpoints have occured since a traversal was
started. Since lfs3_alloc_ckpoint is required before any block
allocation, this provides a strong guarantee that nothing funny happened
to any allocator state during a traversal.

This makes lfs3_alloc_ckpoint a bit less cheap, but the strong
guarantees that allocator state is unmodified during traversal are well
worth it.

This makes both lookahead and gbmap passes simpler, safer, and easier to
reason about.

I'd like to adopt something similar+stronger for LFs3_t_MUTATED, and
reduce this back to two flags, but that can be a future commit.

---

Unfortunately due to the potential for recursion, this ended up reusing
less logic between lfs3_alloc_rebuildgbmap and lfs3_mtree_gc than I had
hoped, but at like the main chunks (lfs3_alloc_remap,
lfs3_gbmap_setbptr, lfs3_alloc_adoptgbmap) could be split out into
common functions.

The result is a decent chunk of code and stack, but the value is high as
incremental gbmap rebuilds are the only option to reduce the latency
spikes introduced by the gbmap allocator (it's not significantly worse
than the lookahead buffer, but both do require traversing the entire
filesystem):

                 code          stack          ctx
  before:       37164           2352          684
  after:        37208 (+0.1%)   2360 (+0.3%)  684 (+0.0%)

                 code          stack          ctx
  gbmap before: 39708           2376          848
  gbmap after:  40100 (+1.0%)   2432 (+2.4%)  848 (+0.0%)

Note the gbmap build is now measured with LFS3_GBMAP=1, instead of
LFS3_YES_GBMAP=1 (maybe-gbmap) as before. This includes the cost of
mkgbmap, lfs3_f_isgbmap, etc.
2025-10-23 23:39:55 -05:00
5bfa2a1071 gbmap: Added an lfs3_alloc_ckpoint to lfs3_fs_mkconsistent
lfs3_fs_mkconsistent is already limited to call sites where
lfs3_alloc_ckpoint is valid (lfs3_fs_mkconsistent internally relies on
lfs3_mdir_commit), so might as well include an unconditional
lfs3_alloc_ckpoint to populate allocators and save some code:

                       code          stack          ctx
  no-gbmap before:    37168           2352          684
  no-gbmap after:     37164 (-0.0%)   2352 (+0.0%)  684 (+0.0%)

                       code          stack          ctx
  maybe-gbmap before: 39720           2376          848
  maybe-gbmap after:  39708 (-0.0%)   2376 (+0.0%)  848 (+0.0%)

                       code          stack          ctx
  yes-gbmap before:   39208           2376          848
  yes-gbmap after:    39204 (-0.0%)   2376 (+0.0%)  848 (+0.0%)
2025-10-17 14:03:14 -05:00
61dc21ccb7 gbmap: Renamed/moved lookahead.bmapped -> gbmap.known
And:

- Tweaked the behavior of gbmap.window/known to _not_ match disk.
  gbmap.known matching disk is what required a separate
  lookahead.bmapped in the first place, but we never use both fields.

- _Don't_ revert gbmap on failed mdir commits!

  This was broken! If we reverted we risked inheriting outdated
  in-flight block information.

  This could be fixed by also zeroing lookahead.bmapped, but would force
  a gbmap rebuild. And why? The only interaction between mdir commit and
  the gbmap is block allocation, which is intentionally allowed to go
  out-of-sync to relax issues like this.

  Note we still revert in lfs3_fs_grow, the new gbmap we create there is
  incompatible with the previous disk size.

As a part of these changes, gbmap.window now behaves roughly the same as
gbmap.known and updates eagerly on block allocation.

This makes lookahead.window and gbmap.window somewhat redundant, but
simplifies the relevant logic (especially due to how lookahead.window
lags behind lookahead.off).

---

A bunch of bugs fell out-of-this, the interactions with lfs3_fs_mkgbmap
and lfs3_fs_grow being especially tricky, but fortunately our testing is
doing a good job.

At least the code changes were minimal, saves a bit of RAM:

                       code          stack          ctx
  no-gbmap before:    37168           2352          684
  no-gbmap after:     37168 (+0.0%)   2352 (+0.0%)  684 (+0.0%)

                       code          stack          ctx
  maybe-gbmap before: 39688           2392          852
  maybe-gbmap after:  39720 (+0.1%)   2376 (-0.7%)  848 (-0.5%)

                       code          stack          ctx
  yes-gbmap before:   39156           2392          852
  yes-gbmap after:    39208 (+0.1%)   2376 (-0.7%)  848 (-0.5%)
2025-10-17 14:02:47 -05:00
67d3c6ea69 scripts: Ignore errors with compat-disabled gstate
The gbmap introduces quite a bit of complexity with how it interacts
with config: block_count => gbmap weight, and wcompat => gbmap enabled.
On one hand this means fewer sources of truth, on the other hand it
makes the gbmap logic cross subsystems and a bit messy.

To avoid trying to parse a bunch of disabled/garbage gstate, this adds
wcompat/rcompat checks to our Gstate class, exposed via __bool__.

This also means we actually need to parse wcompat/rcompat/ocompat flags,
but that wasn't to difficult (though currently only supports 32-bits).

---

I added conditional repr logic for the grm and gbmap, but didn't bother
with the gcksum. The gcksum is used too many other places in these
scripts to expect a nice rendering when disabled.
2025-10-17 14:02:46 -05:00
b5a94f3397 gbmap: Added mkgbmap and rmgbmap for enabling/disabling the gbmap
These two functions allow changing whether or not the gbmap is in use
after format:

  // Enable the global on-disk block-map
  //
  // Returns a negative error code on failure. Does nothing if a gbmap
  // already exists.
  int lfs3_fs_mkgbmap(lfs3_t *lfs3);

  // Disable the global on-disk block-map
  //
  // Returns a negative error code on failure. Does nothing if no gbmap
  // is found.
  int lfs3_fs_rmgbmap(lfs3_t *lfs3);

rmgbmap was easy enough, but implementing mkgbmap turned out to be
surprisingly tricky due to how gstate permeates the system:

- Even if we zero gstate when we removing the gbmap, mounting the
  image on a driver that doesn't understand the gbmap results in garbage
  gstate over time as mdir compacts drop unknown gdeltas.

  I think this sort of implicit gdelta cleanup is a good thing, but the
  possibility of garbage gstate is a bit annoying.

  Example A: the dbg scripts are currently printing a bunch of warnings
  for corrupt gstate that can be safely ignored.

  To support recovering from garbage gstate in mkgbmap, I changed
  lfs3_fs_commitgdelta to _always_ track p state even when disabled. We
  already needed to do this in lfs3_fs_flush/consumegdelta anyways,
  since we don't know if the gbmap is used until parsing wcompat flags.

- The commit that enables the gbmap is tricky. We need the gbmap enabled
  to calculate the new gdelta, but we also need it disabled so we don't
  traverse the existing gbmap_p (which may be garbage).

  As a workaround I added gbmap.b_p, which is in theory redundant with
  gbmap_p, but (1) avoids needing to decode gbmap_p during traversals,
  and (2) allows the two to temporarily fall out-of-sync in mkgbmap.

  This means we potentially have 5 (!) snaphots flying around when
  rebuilding the gbmap, which is starting to get a bit silly. But this
  was also motivated by gbmap_p decoding adding roughly the same amount
  of RAM to lfs3_mtree_traverse_, so the total RAM usage should in
  theory be roughly the same.

  There might be a better solution, but this at least gets mkgbmap
  working. The gbmap builds are not our most RAM senstive configurations
  anyways.

---

Also added a couple more tests in test_gbmap to test these:

- test_gbmap_files
- test_gbmap_rmgbmap
- test_gbmap_mkgbmap
- test_gbmap_rmmkgbmap
- test_gbmap_mkrmgbmap

And an explicit wraparound test to test_alloc. This was loosely implied
by the nospc tests, but it's probably better to have an explicit test.
The only downside is this implementation is limited to files:

- test_alloc_wraparound_files

---

Note we are currently dealing with three different configurations:
no-gbmap (the default), yes-gbmap (LFS3_YES_GBMAP), and maybe-gbmap
(LFS3_GBMAP + LFS3_F_GBMAP at runtime).

It only makes sense to include these in maybe-gbmap mode, so this is the
only mode with a notable code increase. However these functions are
relatively cheap. The stack/ctx changes also affect yes-gbmap, but
should mostly cancel out, see above:

                       code          stack          ctx
  no-gbmap before:    37168           2352          684
  no-gbmap after:     37168 (+0.0%)   2352 (+0.0%)  684 (+0.0%)

                       code          stack          ctx
  maybe-gbmap before: 39292           2456          800
  maybe-gbmap after:  39688 (+1.0%)   2392 (-2.6%)  852 (+6.5%)

                       code          stack          ctx
  yes-gbmap before:   39116           2456          800
  yes-gbmap after:    39156 (+0.1%)   2392 (-2.6%)  852 (+6.5%)
2025-10-17 14:02:05 -05:00
9e45249b29 gbmap: Added support for gbmap in lfs3_fs_grow
In lfs3_fs_grow, we need to update any gbmaps to match the new disk
size. The actual patch to the gbmap is easy, but it does get a bit
delicate since we need to feed the gbmap with an allocator in the new
disk size.

Fortunately, the opportunistism of the gbmap allocator avoids any
catch-22 issues, as long as we make sure to not trigger any gbmap
rebuilds.

Adds a bit of code, but not much:

                 code          stack          ctx
  before:       37168           2352          684
  after:        37168 (+0.0%)   2352 (+0.0%)  684 (+0.0%)

                 code          stack          ctx
  gbmap before: 39000           2456          800
  gbmap after:  39116 (+0.3%)   2456 (+0.0%)  800 (+0.0%)
2025-10-12 14:24:32 -05:00
24d75a24c5 btree: Moved most btree claims into lfs3_btree_commit_
Highlighted by the gbmap work, the need for every btree commit to claim
(mark as unfetched, forcing erased-state to be rechecked) every possible
btree snapshot is tedious and error prone.

Unfortunately we can't avoid this for in-flight/stack allocated btrees,
but we can at least automatically claim the global/tracked btrees
(mtree, gbmap, and file btrees) in lfs3_btree_commit_. This makes most
btree commits just do the right thing, and hopefully minimizes the
risk of forgetting a necessary btree claim.

It also cleans up the various btree-specific claims we were doing, and
makes the codebase a bit less of a mess.

---

Also fixed bshrubs never claiming cached leaves. We now also claim
bshrubs (not just btrees), but avoid clobbering erased-state with
is-shrub checks in lfs3_btree_claim.

Code changes minor, btree claims are at least a cheap operation:

                 code          stack          ctx
  before:       37172           2352          684
  after:        37168 (-0.0%)   2352 (+0.0%)  684 (+0.0%)

                 code          stack          ctx
  gbmap before: 38996           2456          800
  gbmap after:  39000 (+0.0%)   2456 (+0.0%)  800 (+0.0%)
2025-10-09 14:33:27 -05:00
7bb7d93c9f gbmap: Minimized commits in lfs3_gbmap_set_
This rearranges lfs3_gbmap_set_ a bit to try to minimize the number of
commits necessary for gbmap updates.

By combining the split and range creation, we can reduce the common
no-merge case to a single commit.

This matters quite a bit because rebuilding the gbmap requires a ton of
lfs3_gbmap_set_ calls (~2d).

---

The original idea was to see if adopting a builder pattern (see
lfs3_file_graft_) here would reduce the commits necessary, but I don't
think it can. Worst case we need to delete 3 ranges, and since they can
reside in different btree leaves, this requires 3 separate commits.

And the current implementation uses no worse than 3 commits.

---

Code changes minimal:

                 code          stack          ctx
  before:       37172           2352          684
  after:        37172 (+0.0%)   2352 (+0.0%)  684 (+0.0%)

                 code          stack          ctx
  gbmap before: 38992           2456          800
  gbmap after:  38996 (+0.0%)   2456 (+0.0%)  800 (+0.0%)
2025-10-09 14:33:27 -05:00
633cbe8fd6 gbmap: Reuse old gbmap during rebuilds
This changes the gbmap rebuild strategy to clear in-use ranges from a
snapshot of the old gbmap instead of building a new gbmap from scratch.

The theory of building a new gbmap from scratch is it skips the cost of
clearing in-use ranges, but:

1. This potentially misses out on erased-state still in the gbmap.

2. We would need to copy over any erased/bad state (not yet implemented)
   before traversing, and reusing the old gbmap makes this a bit
   simpler.

To make this a little bit more efficient, I extended lfs3_gbmap_set_ to
accept a weight, however this is limited to modifying only a single
range. Cross-range sets would be quite a bit more complicated (see file
grafting).

We're probably dominated by the per-block set operation during traversal
anyways.

---

Costs a bit of code, but in theory makes erased/bad block tracking
cheaper:

                 code          stack          ctx
  before:       37172           2352          684
  after:        37172 (+0.0%)   2352 (+0.0%)  684 (+0.0%)

                 code          stack          ctx
  gbmap before: 38852           2456          800
  gbmap after:  38992 (+0.4%)   2456 (+0.0%)  800 (+0.0%)
2025-10-09 14:33:27 -05:00
cb9bda5a94 gbmap: Renamed gbmap_scan_thresh -> gbmap_rebuild_thresh
I think a good rule of thumb is if you refer to some variable/config/
field with a different name in comments/writing/etc more often than not,
you should just rename the variable/config/field to match.

So yeah, gbmap_rebuild_thresh controls when the gbmap is rebuilt.

Also touched up the doc comment a bit.
2025-10-09 14:33:27 -05:00
ea05ad04b9 gbmap: Cleanup of gbmap comments, TODOs, code formatting, etc
Just cleaning up a bunch of outdated TODOs and commented out code, as
well as a little bit of code formatting, and scrubbing airspace/gbatc
names as these are no longer used and will just confuse new users.
2025-10-09 14:33:27 -05:00
9b4ee982bc gbmap: Tried to adopt the gbmap name more consistently
Having gbmap/bmap used in different places for the same thing was
confusing. Preferring gbmap as it is consistent with other gstate (grm
queue, gcksums), even if it is a bit noisy.

It's interesting to note what didn't change:

- The BM* range tags: LFS3_TAG_BMFREE, etc. These already differs from
  the GBMAP* prefix enough, and adopting GBM* would risk confusion for
  actual gstate.

- The gbmap revdbg string: "bb~r". We don't have enough characters for
  anything else!

- dbgbmap.py/dbgbmapsvg.py. These aren't actually related to the gbmap,
  so the name difference is a good thing.
2025-10-09 14:33:27 -05:00
9d322741ca bmap: Simplified bmap configs, reduced to one LFS3_F_GBMAP flag
TLDR: This drops the idea of different bmap strategies/modes, and sorts
out most of the compile-time/runtime conditional bmap interactions.

---

Motivation: Benchmarking (at least up to the 32-bit word limit) has
shown the bmap will unlikely be a significant bottleneck, even on large
disks. The largest disks tend to be NAND, and NAND's ridiculous block
size limits pressure on block allocation.

There are still concerns for areas I haven't measured yet:

- SD/eMMC/FTL - Small blocks, so more pressure on block allocation. In
  theory the logical block size can be artificially increased, but this
  comes with a granularity tradeoff.

- I've only measured throughput, latency is a whole other story.

  However, users have reported lfs3_fs_gc is useful for mitigating this,
  so maybe latency is less of a concern now?

But while there may still be room for improvement via alternative bmap
strategies, the risk a concerning amount of complexity. Yes,
configuration gets more complicated, but the real issue is any bmap
strategies that try to track _deallocations_ (the original idea being
treediffing) risk falling leaking blocks if all cases aren't covered.

The current "bmap cache" strategy strikes a really nice balance where it
reduces _amortized_ block allocation -> ~O(log n) without RAM, while
retaining the safe, bug-resistant, single-source-of-truth properties
that come with lookahead-based allocation.

---

So, long story short, dropping other strategies, and now the presence of
the bmap is a boolean flag.

This is also the first format-specific flag:

- Define LFS3_BMAP to enable the bmap logic, but note by default the
  bmap will still not be used.

- Define LFS3_YES_BMAP to force the bmap to be used.

- With LFS3_BMAP, passing LFS3_F_GBMAP to lfs3_format will include the
  on-disk block-map.

- No flag is needed during mount, the presence of the bmap is determined
  by the on-disk wcompat flags (LFS3_WCOMPAT_GBMAP). This also prevents
  rw mounting if the bmap is not supported, but rdonly mounting is
  allowed.

- Users can check if the bmap is in use via lfs3_fs_stat, which reports
  LFS3_I_GBMAP in the flags field.

There's still some missing pieces, but these will be a bit more
involved:

- lfs3_fs_grow needs to be made bmap aware!

- We probably want something like lfs3_fs_mkgbmap and lfs3_fs_rmgbmap to
  allow converting between bmap backed/not-backed filesystem images.

Code changes minimal:

                code          stack          ctx
  before:      37172           2352          684
  after:       37172 (+0.0%)   2352 (+0.0%)  684 (+0.0%)

                code          stack          ctx
  bmap before: 38844           2456          800
  bmap after:  38852 (+0.0%)   2456 (+0.0%)  800 (+0.0%)
2025-10-09 14:33:27 -05:00
38cfa5cc5e scripts: dbgtag.py: Fixed overlooked LFSR -> LFS3 prefix
Not sure how this was missed, but tags should start with LFS3_ now.
2025-10-09 14:33:27 -05:00
4b2bd11393 bmap: Finally fixed embedded directive macro warning with bmap format
It makes sense, nesting directives (#ifdef) in macro arguments invites
all sort of weird parse errors. Unfortunately, this doesn't leave us
with many options for conditionally including rattrs in LFS3_RATTRS
lists... This is especially important for lfs3_format, where we can
expect many rattrs to depend on compile-time configurations.

To fix the warning, I went ahead and adopted a conditionally predefined
LFS3_RATTR_IFDEF_BMAP before the rattr list. I'm not super happy with
this fix (ugh, missing comma), but it at least avoids a warning and
non-portable behavior.
2025-10-09 14:33:27 -05:00
982394305e emubd/kiwibd: Fixed unused path param, dropped disk_path
For some reason emubd had both a path argument to lfs3_emubd_create, and
a disk_path config option, with only the disk_path actually being used.

But the real curiosity is why did GCC only starting warning about it
when copied to kiwibd? path is clearly unused in lfs3_emubd_createcfg,
but no warning...

---

Anyways, not sure which one is a better API, but we definitely don't
need two APIs, so eeny meeny miny moe...

Went ahead and chose the lfs3_emubd_create path param for some
consistency with filebd.
2025-10-09 14:33:27 -05:00
e622656538 bmap: Tweaked bmap ranges, dropped in-flight tag for now
New bmap range tags:

  LFS3_TAG_BMRANGE      0x033u  v--- --11 --11 uuuu
  LFS3_TAG_BMFREE       0x0330  v--- --11 --11 ----
  LFS3_TAG_BMINUSE      0x0331  v--- --11 --11 ---1
  LFS3_TAG_BMERASED     0x0332  v--- --11 --11 --1-
  LFS3_TAG_BMBAD        0x0333  v--- --11 --11 --11

Note 0x334-0x33f are still reserved for future bmap tags, but the new
encoding fits in the surprisingly common 2-bit subfield that may
deduplicate some decoding code.

Fitting in 2-bits is the main reason for this, now that in-flight ranges
look like they won't be worth exploring further. Worst case we can
always add more bm tags in the future. And it may even make sense to use
an entire bit for in-flight tags, since in theory the concept can apply
to more than just in-use blocks.

---

Another benefit of this encoding: In-use vs free is a bit check, and I
like the implication that an in-use + erased block can only be a bad
block.

No code changes:

                code          stack          ctx
  before:      37172           2352          684
  after:       37172 (+0.0%)   2352 (+0.0%)  684 (+0.0%)

                code          stack          ctx
  bmap before: 38844           2456          800
  bmap after:  38844 (+0.0%)   2456 (+0.0%)  800 (+0.0%)
2025-10-09 14:33:24 -05:00
43a6053d5e alloc: Tried to simplify alloc info statements
So now:

  lfs3.c:11482:info: Rebuilding bmap (bmap 37/256)
  lfs3.c:11246:error: No more free space (lookahead 0/256)

Instead of the previously somewhat confusing:

  lfs3.c:11484:info: Rebuilding bmap (bmap 62/256/256)
  lfs3.c:11247:error: No more free space (lookahead 0/0/256)

While the previous info statements did have more info (window +
ckpoint + block count), usually one of these ended up redundant
(window == ckpoint == 0 during ENOSPC, for example).
2025-10-04 13:33:08 -05:00
92620d386f bmap: Recheckpoint the allocator after rebuilding the bmap
If before rebuilding the bmap is a valid checkpoint, after is too.

This lets us realloc any blocks that may have been temporarily allocated
when rebuilding the bmap. This probably doesn't matter much except for
low-storage states when blocks are extremely scarce, but allocator
checkpoints are cheap so better safe than sorry.

Code changes minimal (negative?):

                code          stack          ctx
  before:      37172           2352          684
  after:       37172 (+0.0%)   2352 (+0.0%)  684 (+0.0%)

                code          stack          ctx
  bmap before: 38852           2456          800
  bmap after:  38844 (-0.0%)   2456 (+0.0%)  800 (+0.0%)
2025-10-04 13:30:52 -05:00
052fc200c8 util: More parens in LFS3_MIN/MAX
Previously this had a very naive number of parens, which led to a very
confusing night trying to debug some code that looked roughly like this:

  LFS3_MAX(1, (false) ? 64 : 256) => 1 ???

Fixed by adding more parens.
2025-10-01 17:58:09 -05:00
2a2d3173ce btree: Implemented quick-fetches to try to speed up btree commits
I've had this trick in my back pocket for a while, but didn't think it
would be worth the code cost. Benchmarks suggested this was a
bottleneck, so gave it an impl...

But it turned out to be a red herring...

At least the code cost is ridiculously cheap?

           code          stack          ctx
  before: 37156           2352          684
  after:  37172 (+0.0%)   2352 (+0.0%)  684 (+0.0%)

Oh, sidenote, this also removes shrub trunk fetching, repurposing that
bit as an internal flag for quick-fetches. I don't think fetching shrubs
makes sense anymore? This code was probably leftover from a less-correct
traversal implementation.

---

The basic idea: the most recent trunk contains all the info we need to
fetch a btree node for committing:

- We can infer the rbyd weight from one trunk: The total weight is just
  the sum of alt pointer weights + the leaf weight.

- The checksum tags provide the perturb bit, ecksum, etc.

The only thing we can't find from the most recent trunk is the checksum,
but this is already implicit in our CoW branch pointers! (Technically the
weight is as well, but we have to scan the alts anyways.)

So we don't need to scan the entire rbyd if we know the checksum, just
the most recent trunk + checksum tags.

In theory, quick-fetches drop our btree commit runtime from
O(b log_b n + (log b)(log_b^2 n)) -> O((log b)(log_b^2 n)).

---

In practice, this doesn't seem to matter, even on NAND with 128KiB
blocks. We're still dominated by compaction costs, perhaps due to the
poor granularity of NAND's read size?

I'm going to keep this for now just for the peace-of-mind while
benchmarking, but it may be worth removing in the future (or maybe not?
the code size is much less than I was expecting).

At least it simplifies the runtime complexity...
2025-10-01 17:58:06 -05:00
b94f9fe071 runners: Fixed 64-bit overflow when size_t < bench_io_t
Long story short: %zd != %jd!

This was a simple oversight when writing the bench printing code, and
easy to miss on x86_64 and other modern PCs, but the mistake becomes
very apparent when trying to bench under qemu in thumb mode!
2025-10-01 17:58:05 -05:00
0698c49e1b Allow crystal_thresh to go below prog_size
After sleeping on it, allowing crystal_thresh < prog_size makes more
sense than I initially thought, if only to better support the case where
prog_size = block_size (SD/eMMC).

It's true fragments + crystal_thresh were intended to avoid needing to
write padding to raw data blocks, but this only makes sense up until
block padding is cheaper than rbyd overheads. At ~block_size/4, rbyds vs
padded data blocks have roughly the same cost, and at ~block_size/2
rbyds use ~2x the storage due to logging/splitting. At ~block_size/2 we
definitely want to crystallize even if this is still below prog_size.

And it turns out allowing crystal_thresh < prog_size fixes the 512B
block size issues on SD/eMMC we were running into earlier!

---

Implementing this required some tweaks to lfs3_file_crystallize_:

1. We intentionally do not align down partial crystallizations if we
   can't satisfy prog alignment, as we risk making no progress
   in this case.

2. If we can't satisfy prog alignment, don't mark the bptr as erased.
   Resuming crystallization to an unaligned block is an error.

Unaligned progs should already be implicitly padded by the lower bd
caching logic, so not aligning should be all we need to do to pad data
blocks.

Oh, and also relax the crystal_thresh >= prog_size constraints.

Adds a bit of code, but the improved block usage on SD/eMMC will
hopefully be valuable:

           code          stack          ctx
  before: 37112           2352          684
  after:  37156 (+0.1%)   2352 (+0.0%)  684 (+0.0%)
2025-10-01 17:58:03 -05:00
2f6f7705f1 Limit crystal_thresh to >=prog_size
I confused myself a bit while benchmarking because crystal_thresh <
prog_size was showing some very confusing results. But it turns out the
relevant code was just not written well enough to support this
configuration.

And, to be fair, this configuration really doesn't make sense. The whole
point of the fragment + crystallization system is so we never have to
write unaligned data to blocks. I mean, we could explicitly write
padding in this case, but why?

---

This should probably eventually be either an assert or mutable limit,
but in the meantime I'm just adjusting crystal_thresh at runtime, which
adds a bit of code:

           code          stack          ctx
  before: 37076           2352          684
  after:  37112 (+0.1%)   2352 (+0.0%)  684 (+0.0%)

On the plus side, this prevents crystal_thresh=0 issues much more
elegantly.
2025-10-01 17:58:01 -05:00
8cc91ffa9e Prevent oscillation when crystal_thresh < fragment_size
When crystal_thresh < fragment_size, there was a risk that repeated
write operations would oscillate between crystallizing and fragmenting
every operation. Not only would this wreck performance, it would also
violently wear down blocks as each crystallization would trigger an
erase.

Fortunately all we need to do to prevent this is check both
fragment_size and crystal_thresh before fragmenting. Note this also
affects the fragment checks in truncate/fruncate.

---

crystal_thresh < fragment_size is kind of a weird configuration, to be
honest we should probably just assert if configured this way (we never
write fragments > crystal_thresh, because at that point we would just
crystallize).

But at the moment the extra leniency is useful for benchmarking.

Adds a bit of code, but will probably either assert or mutably limit in
the future:

           code          stack          ctx
  before: 37028           2352          684
  after:  37076 (+0.1%)   2352 (+0.0%)  684 (+0.0%)
2025-10-01 17:57:58 -05:00
eab526ad9f Fixed crystal_thresh=0 bugs
There was a mismatch between the lfs3_cfg comment and the actual
crystal_thresh math where crystal_thresh=0 would break things:

- In lfs3_file_flush_, crystal_thresh=0 meant we would never resume
  crystallization, leading to terrible, _terrible_, linear write
  performance.

- In lfs3_file_sync and lfs3_set, it's unclear if small file commit
  optimizations were working properly. I went ahead and added a
  lfs3_max(lfs3->cfg->crystal_thresh, 1) just to be safe.

The other references to crystal_thresh all check for >= crystal_thresh
conditions, so shouldn't be broken (except for an unrelated bug in
lfs3_file_flushset_).

The reason for this is because crystal_thresh=1 is technically the lower
bound for this math. Allowing crystal_thresh=0 is just a convenience,
and honestly allowing it may have a been a bad idea. Maybe we should
require crystal_thresh=1 at minimum? I added a TODO.

All the new v3 config needs revisiting anyways, for defaults, etc.

---

Curiously, this actually saved code? My best guess is maybe some weird
code path in lfs3_file_flush_ was eliminated:

           code          stack          ctx
  before: 37036           2352          684
  after:  37028 (-0.0%)   2352 (+0.0%)  684 (+0.0%)
2025-10-01 17:57:54 -05:00
2c67fb1ea2 scripts: Dropped -e/--exec shortform flag, now just --exec
Too much room for confusion, and potential flag conflicts in the future.
Note it already conflicted with -e/--error-* flags.

--exec is a rather technical flag anyways, and will probably be wrapped
in other ci/script scaffolding most of the time.
2025-10-01 17:57:52 -05:00
be118ab93d scripts: Fixed -s/-S sorting of .csv/.json outputs
I'm not sure if this was ever implemented, or broken during a refactor,
but we were ignoring -s/-S flags when writing .csv/.json output with
-o/-O.

Curious, because the functionality _was_ implemented in fold, just
unused. All this required was passing -s/-S to fold correctly.

Note we _don't_ sort diff_results, because these are never written to
.csv/.json output.

At some point this behavior may have been a bit more questionable, since
we use to allow mixing -o/-O and table rendering. But now that -o/-O is
considered an exclusive operation, ignoring -s/-S doesn't really make
sense.

---

Why did this come up? Well imagine my frustration when:

1. In tikz/pgfplots, \addplot table only really works with sorted data

2. csv.py has a -s/-S flag for sorting!

3. -s/-S doesn't work!
2025-10-01 17:57:49 -05:00
c33182b49b Relax recrystallization when fruncating/logging
This was a nasty performance hole found while benchmarking.

Basically, any time crystallization is triggered, the crystallization
algorithm tries to pack as much data into as few blocks as possible.
When fruncating (the common, and performance sensitive, use case being
logging), this can lead to the algorithm rewriting fruncated blocks.

What the crystallization algorithm doesn't realize, however, is that
when fruncating/logging, we're probably going to fruncate again on the
next call, so rewriting the block is a waste of effort.

Worst case -- a 1 block file -- this can cause littlefs to rewrite the
entire file on every append.

---

The solution implemented here, which is a bit of a hack, is to use the
actual block start for block alignment instead of the logical
start-of-block referenced by our btree/bshrub.

This solves the fruncating/logging performance hole, with the tradeoff
of using more storage than is strictly necessary. This tradeoff is
probably expected with logging however.

Code changes minimal:

           code          stack          ctx
  before: 37024           2352          684
  after:  37036 (+0.0%)   2352 (+0.0%)  684 (+0.0%)
2025-10-01 17:57:47 -05:00
14d0c4121c bmap: Dropped treediff buffers for now
We're not currently using these (at the moment it's unclear if the
original intention behind the treediff algorithms is worth pursuing),
and they are showing up in our heap benchmarks.

The good news is that means our heap benchmarks are working.

Also saves a bit of code/ctx in bmap mode:

                code          stack          ctx
  before:      37024           2352          684
  after:       37024 (+0.0%)   2352 (+0.0%)  684 (+0.0%)

                code          stack          ctx
  bmap before: 38752           2456          812
  bmap after:  38704 (-0.1%)   2456 (+0.0%)  800 (-1.5%)
2025-10-01 17:57:42 -05:00
232f039ccc kiwibd: Added kiwibd, a lighter-weight variant of emubd
Useful for emulating much larger disks in a file (or in RAM). kiwibd
doesn't have all the features of emubd, but this allows it to prioritize
disk size and speed for benchmarking.

kiwibd still keeps some features useful for benchmarking/emulation:

- Optional erase value emulation, including nor-masking

- Read/prog/erase trackers for measuring bd operations

- Read/prog/erase sleeps for slowing down the simulation to a human
  viewable speed
2025-10-01 17:57:39 -05:00
6ba3204816 scripts: Some csv script tweaks to better interact with other scripts
- Added --small-total. Like --small-header, this omits the first column
  which usually just has the informative text TOTAL.

- Tweaked -Q/--small-table so it renders with --small-total if
  -Y/--summary is provided.

- Added --total as an alias for --summary + --no-header + --small-total,
  i.e. printing only the totals (which may be multiple columns) and no
  other decoration.

  This is useful for scripting, now it's possible to extract just, say,
  the sum of some csv and embed with $():

    echo $(./scripts/code.py lfs3.o --total)

- Tweaked total to always output a number (0) instead of a dash (-),
  even if we have no results.

  This relies on Result() with no args, which risks breaking scripts
  where the Result type expects an argument. To hopefully catch this
  early, the table renderer currently creates a Result() before trying
  to fold the total result.

- If first column is empty (--small-total + --small-header, --no-header,
  etc) collapse width to zero. This avoids a bunch of extra whitespace,
  but still includes the two spaces normal used to separate names from
  fields.

  But I think those spaces are a good thing. It makes it hard to miss
  the implicit padding in the table renderer that risks breaking
  dependent scripts.
2025-10-01 17:57:37 -05:00
3e8f304138 scripts: ctx.py/structs.py: Worked around incomplete structs/unions
Found when trying to measure ctx of yaffs2, which relies on incomplete
structs to hide some internal state (yaffs_summary_tags, yaffs_DIR).
This is less common in microcontroller filesystems since almost all
structs end up statically/stack allocated, and you can't statically
allocate incomplete structs.

It's not too surprising, but incomplete structs have no associated
DW_AT_byte_size in the relevant dwarf info, which broke ctx.py and
structs.py...

As a workaround, I'm now defaulting to size=0 if DW_AT_byte_size is
missing.

---

With this fix, at least structs.py is able to pick up the later internal
definition of yaffs_summary_tags. ctx.py doesn't because it only looks
at the unique dwarf offset referenced by the function definition, but
I'm hesitant to try anything more clever here.

yaffs_DIR is noteworthy in that there is simply no complete definition.
Internally, yaffs_DIR pointers alias yaffsfs_DirSearchContext structs.
In this case I think returning size=0 is the only reasonable option.
2025-10-01 17:57:35 -05:00
c9691503bc scripts: plot[mpl].py: Added --x/ylim-ratio for simpler limits
I've been struggling to keep plots readable with --x/ylim-stddev, it may
have been the wrong tool for the job.

This adds --x/ylim-ratio as an alternative, which just sets the limit to
include x-percent of the data (I avoided "percen"t in the name because
it should be --x/ylim-ratio=0.98, not 98, though I'm not sure "ratio" is
great either...).

Like --x/ylim-stddev, this can be used in both one and two argument
forms:

  $ ./scripts/plot.py --ylim-ratio=0.98
  $ ./scripts/plot.py --ylim-=-0.98,+0.98

So far, --x/ylim-ratio has proven much easier to use, maybe because our
amortized results don't follow a normal distribution? --x/ylim-ratio
seems to do a good job of clipping runaway amortized results without too
much information loss.
2025-10-01 17:57:32 -05:00
92af5de3ca emubd: Added optional nor-masking emulation
This adds NOR-style masking emulation to emubd when erase_value is set
to -2:

  erase     => 0xff
  prog 0xf0 => 0xf0
  prog 0xcc => 0xc0

We do _not_ rely on this property in littlefs, and so this feature will
probably go unused in our tests, but it's useful for running other
filesystems (SPIFFS) on top of emubd.

It may be a bit of a scope violation to merge this into littlefs's core
repo, but it's useful to centralize emubd's features somewhere...
2025-10-01 17:57:28 -05:00
6a57258558 make: Adopted lowercase for foreach variables
This seems to be the common style in other Makefiles, and avoids
confusion with global/env variables.
2025-10-01 17:57:23 -05:00
a1b75497d6 bmap: rdonly: Got LFS3_RDONLY + LFS3_BMAP compiling
Counterintuitively, LFS3_RDONLY + LFS3_BMAP _does_ make sense for cases
where you want to include the bmap in things like ckmeta/ckdata scans.

Though this is another argument for a LFS3_RDONLY + LFS3_NO_TRV build.
Traversals add quite a bit of code to the rdonly build that is probably
not always needed.

---

This just required another bunch of ifdefs.

Current bmap rdonly code size:

                code          stack          ctx
  rdonly:      10616            896          532
  rdonly+bmap: 10892 (+2.6%)    896 (+0.0%)  636 (+19.5%)
2025-10-01 17:57:15 -05:00
60ef118dcd rdonly: Got LFS3_RDONLY compiling again
Just a few alloc/eoff references slipped through in the bmap work.

Current rdonly code size:

            code           stack           ctx
  default: 37024            2352           684
  rdonly:  10616 (-71.3%)    896 (-61.9%)  532 (-22.2%)

This biggest change was tweaking our mtortoise again to use the unused
trunk field for the power-of-two bound. The original intention of using
eoff was an extra precaution to avoid the mtortoise looking like a valid
shrub at any point, but eoff is not available in LFS3_RDONLY.

And we definitely want our mtortoise in LFS3_RDONLY!

---

Note I haven't actually tested LFS3_RDONLY + LFS3_BMAP. Does this config
even make sense? I guess ckmeta/ckdata will need to traverse the bmap,
so, counterintuitively, yes?
2025-10-01 17:57:14 -05:00