And tweaked a few related comments.
I'm still on the fence with this name, I don't think it's great, but it
at least betters describes the "repopulation" operation than
"rebuilding". The important distinction is that we don't throw away
information. Bad/erased block info (future) is still carried over into
the new gbmap snapshot, and persists unless you explicitly call
rmgbmap + mkgbmap.
So, adopting gbmap_repop_thresh for now to see if it's just a habit
thing, but may adopt a different name in the future.
As a plus, gbmap_repop_thresh is two characters shorter.
I think a good rule of thumb is if you refer to some variable/config/
field with a different name in comments/writing/etc more often than not,
you should just rename the variable/config/field to match.
So yeah, gbmap_rebuild_thresh controls when the gbmap is rebuilt.
Also touched up the doc comment a bit.
Having gbmap/bmap used in different places for the same thing was
confusing. Preferring gbmap as it is consistent with other gstate (grm
queue, gcksums), even if it is a bit noisy.
It's interesting to note what didn't change:
- The BM* range tags: LFS3_TAG_BMFREE, etc. These already differs from
the GBMAP* prefix enough, and adopting GBM* would risk confusion for
actual gstate.
- The gbmap revdbg string: "bb~r". We don't have enough characters for
anything else!
- dbgbmap.py/dbgbmapsvg.py. These aren't actually related to the gbmap,
so the name difference is a good thing.
We're not currently using these (at the moment it's unclear if the
original intention behind the treediff algorithms is worth pursuing),
and they are showing up in our heap benchmarks.
The good news is that means our heap benchmarks are working.
Also saves a bit of code/ctx in bmap mode:
code stack ctx
before: 37024 2352 684
after: 37024 (+0.0%) 2352 (+0.0%) 684 (+0.0%)
code stack ctx
bmap before: 38752 2456 812
bmap after: 38704 (-0.1%) 2456 (+0.0%) 800 (-1.5%)
At least at a proof-of-concept level, there's still a lot of cleanup
needed.
To make things work, lfs3_alloc_ckpoint now takes an mdir, which
provides the target for gbmap gstate updates.
When the bmap is close to empty (configurable via bmap_scan_thresh), we
opportunistically rebuild it during lfs3_alloc_ckpoints. The nice thing
about lfs3_alloc_ckpoint is we know the state of all in-flight blocks,
so rebuilding the bmap just requires traversing the filesystem + in-RAM
state.
We might still fall back to the lookahead buffer, but in theory a well
tuned bmap_scan_thresh can prevent this from becoming a bottleneck (at
the cost of more frequent bmap rebuilds).
---
This is also probably a good time to resume measuring code/ram costs,
though it's worth repeating the above note about the bmap work still
needing cleanup:
code stack ctx
before: 36840 2368 684
after: 36920 (+0.2%) 2368 (+0.0%) 684 (+0.0%)
Haha, no, the bmap isn't basically free, it's just an opt-in features.
With -DLFS3_YES_BMAP=1:
code stack ctx
no bmap: 36920 2368 684
yes bmap: 38552 (+4.4%) 2472 (+4.4%) 812 (+18.7%)
Note this includes both the lfs3_config -> lfs3_cfg structs as well as
the LFS3_CONFIG -> LFS3_CFG include define:
- LFS3_CONFIG -> LFS3_CFG
- struct lfs3_config -> struct lfs3_cfg
- struct lfs3_file_config -> struct lfs3_file_cfg
- struct lfs3_*bd_config -> struct lfs3_*bd_cfg
- cfg -> cfg
We were already using cfg as the variable name everywhere. The fact that
these names were different was an inconsistency that should be fixed
since we're committing to an API break.
LFS3_CFG is already out-of-date from upstream, and there's plans for a
config rework, but I figured I'd go ahead and change it as well to lower
the chances it gets overlooked.
---
Note this does _not_ affect LFS3_TAG_CONFIG. Having the on-disk vs
driver-level config take slightly different names is not a bad thing.
So we now keep blocks around until they can be replaced with a single
fragment. This is simpler, cheaper, and reduces the number of commits
needed to graft (though note arbitrary range removals still keep this
unbounded).
---
So, this is a delicate tradeoff.
On one hand, not fully fragmenting blocks risks keeping around bptrs
containing very little data, depending on fragment_size.
On the other hand:
- It's expensive, and disk utilization during random _deletes_ is not
the biggest of concerns.
Note our crystallization algorithm should still clean up partial
blocks _eventually_, so this doesn't really impact random writes.
The main concerns are lfs3_file_truncate/fruncate, and in the future
collapserange/punchhole.
- Fragmenting bptrs introduces more commits, which have their own
prog/erase cost, and it's unclear how this impacts logging operations.
There's no point in fragmenting blocks at the head of a log if we're
going to fruncate them eventually.
I figure lets err on minimizing complexity/code size for now, and if
this turns out to be a mistake, we can always revert or introduce
fragmenting >1 fragment blocks as an optional feature in the future.
---
Saves a big chunk of code, stack, and even some ctx (no more
fragment_thresh):
code stack ctx
before: 37504 2448 656
after: 37024 (-1.3%) 2416 (-1.3%) 652 (-0.6%)
This prevents runaway O(n^2) behavior on devices with extremely large
block sizes (NAND, bs=~128KiB - ~1MiB).
The whole point of shrubs is to avoid this O(n^2) runaway when inline
files become necessarily large. Setting FRAGMENT_SIZE to a factor of the
BLOCK_SIZE humorously defeats this.
The 512 byte cutoff is somewhat arbitrary, it's the natural BLOCK_SIZE/8
FRAGMENT_SIZE on most NOR flash (bs=4096), but it's probably worth
tuning based on actual device performance.
So now crystal_thresh only controls when fragments are compacted into
blocks, while fragment_thresh controls when blocks are broken into
fragments. Setting fragment_thresh=-1 will follow crystal_thresh and
keeps the previous behavior.
These were already two separate pieces of logic, so it makes sense to
provide two separate knobs for tuning.
Setting fragment_thresh lower than crystal_thresh has some potential to
reduce hysteresis in cases where random writes push blocks close to
crystal_thresh. It will be interesting to explore this more when
benchmarking.
---
The additional config option adds a bit of code/ctx, but hopefully that
will go away in the future config rework:
code stack ctx
before: 35584 2480 636
after: 35600 (+0.0%) 2480 (+0.0%) 640 (+0.6%)
And the related config options:
- cfg->file_buffer_size -> cfg->file_cache_size
- file->cfg->buffer_size -> file->cfg->cache_size
- file->cfg->buffer -> file->cfg->cache_buffer
The original motivation to rename this to file->buffer was to better
align with what other filesystems call this, but I think this is a case
where internal consistency is more important than external consistency.
file->cache better matches lfs->pcache and lfs->rcache, and makes it
easier to read code involving both file->cache and other user-provided
buffers.
Keeping the upstream name also helps with continuity.
While I think shrub_size is probably the more correct name at a
technical level, inline_size is probably more what users expect and
doesn't require a deeper understanding of filesystem details.
The only risk is that users may think inline_size has no effect on large
files, when in fact it still controls how much of the btree root can be
inlined.
There's also the point that sticking with inline_size maintains
compatibility with both the upstream version and any future version that
has other file representations.
May revisit this, but renaming to lfs->cfg->inline_size for now.
Now that we no longer have bmoss files, inline_size and shrub_size are
effectively the same thing.
We weren't using this, so no code change, but it does save a word of
ctx:
code stack ctx
before: 36280 2576 640
after: 36280 (+0.0%) 2576 (+0.0%) 636 (-0.6%)
Incremental gc, being stateful and not gc-able (ironic), was always
going to need to be conditionally compilable.
This moves incremental gc behind the LFS_GC define, so that we can focus
on the "default" costs. This cuts lfs_t in nearly half!
lfs_t with LFS_GC: 308
lfs_t without LFS_C: 168 (-45.5%)
This does save less code than one might expect though. We still need
most of the internal traversal/gc logic for things like block allocation
and orphan cleanup, so most of the savings is limited to the RAM storing
the incremental state:
code stack ctx
before: 37916 2608 768
after with LFS_CFG: 37944 (+0.1%) 2608 (+0.0%) 768 (+0.0%)
after without LFS_CFG: 37796 (-0.3%) 2608 (+0.0%) 620 (-19.3%)
On the flip side, this does mean most of the incremental gc
functionality is still availables in the lfsr_traversal_t APIs.
Applications with more advanced gc use-cases may actually benefit from
_not_ enabling the incremental gc APIs, and instead use the
lfsr_traversal_t APIs directly.
Before:
int lfsr_fs_gc(lfs_t *lfs, lfs_soff_t steps, uint32_t flags);
After:
int lfsr_gc(lfs_t *lfs);
int lfsr_gc_setflags(lfs_t *lfs, uint32_t flags);
int lfsr_gc_setsteps(lfs_t *lfs, lfs_soff_t steps);
---
The interesting thing about the lfsr_gc API is that the caller will
often be very different from whoever configures the system. One example
being an OS calling lfsr_gc in a background loop, while leaving
configuration up to the user.
The idea here, is instead of forcing the OS to come up with its own
stateful system to pass flags to lfsr_gc, we just embed this state in
littlefs directly. The whole point of lfsr_gc is that it's a stateful
system anyways.
Unfortunately this state does require a bit more logic to maintain,
which adds code/ctx cost:
code stack ctx
before: 37812 2608 752
after: 37916 (+0.3%) 2608 (+0.0%) 768 (+2.1%)
This was the one piece needed to be able to replace amor.py with csv.py.
The missing feature in csv.py is the ability to keep track of a
running-sum, but this is a bit of a hack in amor.py considering we
otherwise view csv entries as unordered.
We could add a running-sum to csv.py, or instead, just include a running
sum as a part of our bench output. We have all the information there
anyways, and if it simplifies the mess that is our csv scripts, that's a
win.
---
This also replaces the bench "meas", "iter", and "size" fields with the
slightly simpler "m" (measurement? metric?) and "n" fields. It's up to
the specific benchmark exactly how to interpret "n", but one field is
sufficient for existing scripts.
So instead of configuring gc_steps at mount time (or eventually compile
time), lfsr_fs_gc now takes a steps parameter that controls how much gc
work to attempt:
int lfsr_fs_gc(lfs_t *lfs, lfs_soff_t steps, uint32_t flags);
This API was needed internally to better deduplicate on-mount gc, and I
figured it might also be useful for users to be able to easily change
gc_steps per lfsr_fs_gc call.
I realize this could also be accomplished with the theoretical
lfsr_fs_gccfg, but it's a bit easier to not need a struct every call.
Most likely, depending on project/system, users will always call
lfsr_fs_gc with either 1 (minimal work) or -1 (maximal work), or, worst
case, can define a system-wide GC_STEPS somewhere.
---
Deduplicating on-mount gc work better saved some code, though it's worth
noting this could have been done internally and not exposed to users:
code stack
before: 36476 2680 (+0.0%)
after: 36316 (-0.4%) 2680 (+0.0%)
This has been a long-time coming, mount flags are just too useful for
configuring a filesystem at runtime.
Currently this is limited to LFS_M_RDONLY and LFS_M_CKPROGS, but there
are a few more planned in the future:
LFS_M_RDWR = 0x0000, // Mount the filesystem as read and write
LFS_M_RDONLY = 0x0001, // Mount the filesystem as readonly
LFS_M_STRICT* = 0x0002, // Error if on-disk config does not match
LFS_M_FORCE* = 0x0004, // Ignore compat flags, mount readonly
LFS_M_FORCEWITHRECKLESSABANDON*
= 0x0008, // Ignore compat flags, mount read write
LFS_M_CKPROGS = 0x0010, // Check progs by reading back progged data
LFS_M_CKREADS* = 0x0020, // Check reads via checksums
* Hypothetical
As a convenience, we also return mount flags in the struct lfs_fsinfo's
flags field as their relevant LFS_I_* variants. Though only to match
statvfs, and only because it's cheap, littlefs's API is low-level and we
should expect users to know what flags they passed to lfsr_mount.
As for the new mount flags:
- LFS_M_RDONLY - For consistency with existing APIs, this just asserts
on write operations, which makes it a bit useless... But the info flag
LFS_I_RDONLY may be useful for falling back to a readonly mode if
we encounter on-disk compat issues.
At least if implement the theoretical LFS_UNTRUSTED_USER mode
LFS_M_RDONLY could become a runtime error.
- LFS_M_RDWR - This really just exists to compliment LFS_M_RDONLY and to
match LFS_O_RDONLY/LFS_O_RDWR. It's just an alias for 0, and I don't
think there will ever be a reason to make it non-0 (but I can always
be wrong!).
- LFS_M_CKPROGS - This replaces the check_progs config option and avoids
using a full byte to store a bool.
We should probably also have a compile-time option to compile this out
(LFS_NO_CKPROGS?), but that's a future thing to do.
This ended up adding a surprising bit of code, considering we're just
moving flags around, and noise in lfs_alloc added a bit of stack again:
code stack
before: 35880 2672
after: 35932 (+0.1%) 2680 (+0.3%)
Thinking about use case a bit, most lfsr_fs_gc will be to perform
background work, and can benefit from being incremental.
We already support incremental gc and all the mess associated with
traversal invalidation via the traversal API, so we might as well expose
this through lfsr_fs_gc.
The main downside is that we need to store an lfsr_traversal_t object
somewhere, which is not exactly a cheap struct. I was originally
considering limiting incremental gc to the traversal API for this
reason, but I think the value add of an incremental lfsr_fs_gc is too
compelling... Though we really should add a compile-time option
(LFS_NO_GC? LFS_NO_INCRGC?) to allow users to opt-out of this RAM cost
if they're never going to call this function.
Oh, and lfs_t also becomes self-referential, which might become a
problem for higher-level language users...
---
The incremental behavior of lfsr_fs_gc can be controlled by the new
gc_steps config option. This allows more than one step to be performed
at a time, which may allow for more progress when intermixed with
write-heavy filesystem operations. Setting gc_steps=-1 performs a full
traversal every call, which guarantees always making some amount of
progress.
This adds a bit of code, since we now need to check for/resume existing
traversals. But the real cost is the added RAM to lfs_t, which is
unfortunately wasted if you never call lfsr_fs_gc:
code stack lfs_t
before: 35708 2672 164
after: 35756 (+0.1%) 2672 (+0.0%) 296 (+80.5%)
lfs_fs_gc is still not reimplemented, but this is accessible through the
traversal API with LFS_T_COMPACT.
This is also the first traversal operation that can mutate the
filesystem, which brings its own set of problems:
- We need to set LFS_F_DIRTY in lfsr_mtree_gc now, which really
highlights how much of a mess having two flag fields is...
We do _not_ clobber in this case, since we assume lfsr_mtree_gc knows
what it's doing.
- We can now commit to an mroot in the mroot chain outside of the normal
mroot chain update logic.
This is a bit scary, but should just work.
The only issue so far is that we need to allow mdirs to follow the
mroot during mroot splits if mid=-1, even if they aren't lfs_t's mroot
mdir.
This should now be decently tested with the new
test_traversal_compact_* tests.
- It's easy for mtraversal's mdir and mtinfo's mdir to fall out of sync
when mutating... Why do we have two of these?
The actual compaction itself is pretty straightforward: just mark as
unerased, eoff=-1, and call lfsr_mdir_commit with an empty commit. This
is now wrapped up in lfsr_mdir_compact.
Code changes:
code stack
before: 34528 2640
after: 34652 (+0.4%) 2640 (+0.0%)
Though the real hard part will be implementing gc_compact_thresh over
btree nodes...
These emulate powerloss behavior where only some of the bits being
progged are actually progged if there is a powerloss. This behavior was
the original motivation for our ecksums/fcrcs, so it's good to have this
tested.
As a simplification, these only test the extremes:
- LFS_EMUBD_POWERLOSS_SOMEBITS => one bit progged
- LFS_EMUBD_POWERLOSS_MOSTBITS => all-but-one bit progged
Also they flips bits instead of preserving exact partial prog behavior,
but this is allowed (progs can have any intermediate value), has the
same effect as partial progs, and should encourage failed progs.
This required a number of tweaks in emubd: moved powerloss before prog,
moved mutate after powerloss, etc, but these shouldn't affect other
powerloss behaviors. Handling powerloss after prog was only to avoid
power_cycles=1 being useless, it's not strictly required.
Good news is testing so far suggests our ecksum design is sound.
More information upstream (f2a6f45, fc2aa33, 7873d81), but this adds
LFS_EMUBD_POWERLOS_OOO for testing out-of-order block devices that
require sync to be called for things to serialize. It's a simple
implementation, just reverts the first write since last sync on
powerloss, but gets the job done.
Cherry-picking these changes required reverting emubd's scratch buffer,
but carrying around an extra ~block_size of memory isn't a big deal
here.
This configuration option enables the previous behavior of reading back
every prog to check that the data was written correctly.
Unfortunately, this brings a bit of baggage, thanks to our cache
interactions being more complicated now:
- We really want to reuse the rcache for prog validation, despite the
cache performance implications. Unfortunately, we simply can't, thanks
to the new bd utility functions tying up the rcache. lfsr_bd_cpy, for
example, does not expect rcache to be invalidated between a read and
prog, and if it is, things break (I may or may not have found this by
experience).
These bd utilities are valuable, so we really need some other way to
validate our progs.
- Since we can't rely on the rcache, this leaves checksumming as the
only option for validating progs. Checksumming isn't perfect, as there
is a decent chance of false negatives, but to be honest it's probably
good enough for anything that's not malicious.
- This also adds the new constraint that we need to be able to read back
any prog into the pcache, which implies read_size <= prog_size. This
constraint didn't exist when we could clobber our rcache, but this is
not worth throwing away the new bd utilities. Not to mention
clobbering our rcache could hurt cache performance.
Why not make read_size <= prog_size conditional on check_progs?
The main reason is convenience. One very compelling use case for
check_progs is to help debug unknown filesystem/integration failures,
buf if you can't enable check_progs without changing the filesystem
configuration, you can't really rely on check_progs for debugging.
This helps future proof what we expect from block devices, in case
future error detection/correction mechanisms can benefit from our
prog_size always being readable.
Code changes were not that significant, however there was a surprising
stack cost. This seems to be because lfsr_bd_read__ can now be called
from multiple places, causing it to no longer be inlined in
lfsr_bd_read_, costing a bit of stack for the additional function call:
before: 33566 2624
after: 33682 (+0.3%) 2640 (+0.6%)
The original goal here was to restore all of the revision count/
wear-leveling features that were intentionally ignored during
refactoring, but over time a few other ideas to better leverage our
revision count bits crept in, so this is sort of the amalgamation of
that...
Note! None of these changes affect reading. mdir fetch strictly needs
only to look at the revision count as a big 32-bit counter to determine
which block is the most recent.
The interesting thing about the original definition of the revision
count, a simple 32-bit counter, is that it actually only needs 2-bits to
work. Well, three states really: 1. most recent, 2. less recent, 3.
future most recent. This means the remaining bits are sort of up for
grabs to other things.
Previously, we've used the extra revision count bits as a heuristic for
wear-leveling. Here we reintroduce that, a bit more rigorously, while
also carving out space for a nonce to help with commit collisions.
Here's the new revision count breakdown:
vvvvrrrr rrrrrrnn nnnnnnnn nnnnnnnn
'-.''----.----''---------.--------'
'------|---------------|---------- 4-bit relocation revision
'---------------|---------- recycle-bits recycle counter
'---------- pseudorandom nonce
- 4-bit relocation revision
We technically only need 2-bits to tell which block is the most
recent, but I've bumped it up to 4-bits just to be safe and to make
it a bit more readable in hex form.
- recycle-bits recycle counter
A user configurable counter, this counter tracks how many times a
metadata block has been erased. When it overflows we return the block
to the allocator to participate in block-level wear-leveling again.
This implements our copy-on-bounded-write strategy.
- pseudorandom nonce
The remaining bits we fill with a pseudorandom nonce derived from the
filesystem's prng. Note this prng isn't the greatest (it's just the
xor of all mdir cksums), but it gets the job done. It should also be
reproducible, which can be a good thing.
Suggested by ithinuel, the addition of a nonce should help with the
commit collision issue caused by noop erases. It doesn't completely
solve things, since we're only using crc32c cksums not collision
resistant cryptographic hashes, but we still have the existing
valid/perturb bit system to fall back on.
When we allocate a new mdir, we want to zero the recycle counter. This
is where our relocation revision is useful for indicating which block is
the most recent:
initial state: 10101010 10101010 10101010 10101010
'-.'
+1 zero random
v .----'----..---------'--------.
lfsr_rev_init: 10110000 00000011 01110010 11101111
When we increment, we increment recycle counter and xor in a new nonce:
initial state: 10110000 00000011 01110010 11101111
'--------.----''---------.--------'
+1 xor <-- random
v v
lfsr_rev_init: 10110000 00000111 01010100 01000000
And when the recycle counter overflows, we relocate the mdir.
If we aren't wear-leveling, we just increment the relocation revision to
maximize the nonce.
---
Some other notes:
- Renamed block_cycles -> block_recycles.
This is intended to help avoid confusing block_cycles with the actual
physical number of erase cycles supported by the device.
I've noticed this happening a few times, and it's unfortunately
equivalent to disabling wear-leveling completely. This can be improved
with better documentation, but also changing the name doesn't hurt.
- We now relocate both blocks in the mdir at the same time.
Previously we only relocated one block in the mdir per recycle. This
was necessary to keep our threaded linked-list in sync, but the
threaded linked-list is now no more!
Relocating both blocks is simpler, updates the mtree less often,
compatible with metadata redundancy, and avoids aliasing issues that
were a problem when relocating one block.
Note that block_recycles is internally multiplied by 2 so each block
sees the correct number of erase cycles.
- block_recycles is now rounded down to a power-of-2.
This makes the counter logic easier to work with and takes up less RAM
in lfs_t. This is a rough heuristic anyways.
- Moved the lfs->seed updates into lfsr_mountinited + lfsr_mdir_commit.
This avoids readonly operations affecting the seed and should help
reproducibility.
- Changed rev count in dbg scripts to render as hex, similar to cksums.
Now that we using most of the bits in the revision count, the decimal
version is, uh, not helpful...
Code changes:
code stack
before: 33342 2640
after: 33434 (+0.3%) 2640 (+0.0%)
These seem fitting here, even if the test defines aren't "real defines".
The duplicate expressions should still be side-effect free and easy to
optimize out.
This should also avoid future lfs_min32 vs intmax_t issues.
A much requested feature, this allows much finer control of how RAM is
allocated for the system.
It was difficult to introduce this in previous versions of littlefs due
to how we steal caches during certain file operations, but now we don't
do that and treat the caches much more transparently.
Managing separate cache sizes does add a bit of code, but this is well
worth the potential for RAM savings due to increased flexibility:
code stack
before: 33656 2632
after: 33714 (+0.2%) 2640 (+0.3%)
Also interesting to note this reduces alignment requirements for the
rcache/pcache, since they don't need to share alignment, and completely
removes any alignment requirement from the file buffers.
Also added related asserts to lfs_init.
Note the fragment_size <= block_size/8 limit is to avoid wasteful corner
cases where only one fragment can fit in a block. The shrub_size <=
block_size/4 limit is looser because of how shrubs temporarily
overcommit.
As for the other limits, inline_size is bounded by shrub_size, and
crystal_thresh technically doesn't have a limit, though values >
block_size stop having an effect.
Motivation:
- Debuggability. Accessing the current test/bench defines from inside
gdb was basically impossible for some dumb macro-debug-info reason I
can't figure out.
In theory, GCC provides a .debug_macro section when compiled with -g3.
I can see this section with objdump --dwarf=macro, but somehow gdb
can't seem to find any definitions? I'm guess the #line source
remapping is causing things to break somehow...
Though even if macro-debugging gets fixed, which would be valuable,
accessing defines in the current test/bench runner can trigger quite
a bit of hidden machinery. This risks side-effects, which is never
great when debugging.
All of this is quite annoying because the test/bench defines is
usually the most important piece of information when debugging!
This replaces the previous hidden define machinery with simple global
variables, which gdb can access no problem.
- Also when debugging we no longer awkwardly step into the test_define
function all the time!
- In theory, global variables, being a simple memory access, should be
quite a bit faster than the hidden define machinery. This does matter
because running tests _is_ a dev bottleneck.
In practice though, any performance benefit is below the noise floor,
which isn't too surprising (~630s +-~20s).
- Using global variables for defines simplifies the test/bench runner
quite a bit.
Though some of the previous complexity was due to a whole internal
define caching system, which was supposed to lazily evaluate test
defines to avoid evaluating defines we don't use. This all proved to
be useless because the first thing we do when running each test is
evaluate all defines to generate the test id (lol).
So now, instead of lazily evaluating and caching defines, we just
generate global variables during compilation and evaluate all defines
for each test permutation immediately before running.
This relies heavily on __attribute__((weak)) symbols, and lets the
linker really shine.
As a funny perk this also effectively interns all test/bench defines by
the address of the resulting global variable. So we don't even need to
do string comparisons when mapping suite-level defines to the
runner-level defines.
---
Perhaps the more interesting thing to note, is the change in strategy in
how we actually evaluate the test defines.
This ends up being a surprisingly tricky problem, due to the potential
of mutual recursion between our defines.
Previously, because our define machinery was lazy, we could just
evaluate each define on demand. If a define required another define, it
would lazily trigger another evaluation, implicitly recursing through
C's stack. If cyclic, this would eventually lead to a stack overflow,
but that's ok because it's a user error to let this happen.
The "correct" way, at least in terms of being computationally optimal,
would be to topologically sort the defines and evaluate the resulting
tree from the leaves up.
But I ain't got time for that, so the solution here is equal parts
hacky, simple, and effective.
Basically, we just evaluate the defines repeatedly until they stop
changing:
- Initially, mutually recursive defines may read the uninitialized
values of their dependencies, and end up with some arbitrarily wrong
result. But as the defines are repeatedly evaluated, assuming no
cycles, the correct results should eventually bubble up the tree until
all defines converge to the correct value.
- This is O(n*e) vs O(n+e), but our define graph is usually quite
shallow.
- To prevent non-halting, we error after an arbitrary 1000 iterations.
If you hit this, it's likely because there is a cycle in the define
graph.
This is runtime configurable via the new --define-depth flag.
- To keep things consistent and reproducible, we zero initialize all
defines before the first evaluation.
I don't think this is strictly necessary, but it's important for the
test runner to have the exact same results on every run. No one wants
a "works on my machine" situation when the tests are involved.
Experimentation shows we only need an evaluation depth of 2 to
successfully evaluate the current set of defines:
$ ./runners/test_runner --list-defines --define-depth=2
And any performance impact is negligible (~630s +-~20s).
Now, when files are synced, they broadcast their disk changes to any other
opened file handles. In effect, all open files match disk after a sync
call to any opened file handle pointing to that file.
This was a much requested feature, as the previous behavior (multiple
opened file handles maintain independent snapshots) is pretty different
from other filesystems. It's also quite difficult to implement outside
of the filesystem, since you need to track all opened files, requiring
either unbounded RAM or a known upper limit.
---
A bit unrelated, but this commit also changes bshrub estimate
calculation to include all opened file handles. This adds some annoying
complexity, but is necessary to prevent sporadic ERANGE errors when
the same file is opened multiple times.
The current implementation just refetches on-disk metadata. This adds
some maybe unnecessary metadata lookups, but simplifies things by
avoiding the tracking of on-disk sprout/shrub size, which risks falling
out of date. Keep in mind we only recalculate the estimate every
~inline_size/2 bytes written.
Just like lfsr_mdir_estimate, this scales O(n^2) with the number of
opened files (this are basically the same function... hmmm... can they
be deduplicated?). This is unlikely to be a problem for littlefs's use
case, but just something to be aware of.
Code changes:
code stack
before: 32920 3032
after: 33192 (+0.8%) 3048 (+0.5%)
Our crystallization threshold doesn't really describe the bounds of an
object, and I think it's a bit easier to think of it as a threshold for
block compaction.
Heck I've already been calling this the crystallization threshold all
over the code base.
An important change is this bumps the value by 1 bytes, so
crystal_thresh now describes the smallest size of a block our write
strategy will attempt to write.
Heuristically:
- data >= crystal_thresh => compacted into blocks
- data < crystal_thresh => stored as fragments
This turned out to not be all that useful.
Tests already take quite a bit to run, which is a good thing! We have a
lot of tests! 942.68s or ~15 minutes of tests at the time of writing to
be exact. But simply multiplying the number of tests by some number of
geometries is heavy handed and not a great use of testing time.
Instead, tests where different geometries are relevant can parameterize
READ_SIZE/PROG_SIZE/BLOCK_SIZE at the suite level where needed. The
geometry system was just another define parameterization layer anyways.
Testing different geometries can still be done in CI by overriding the
relevant defines anyways, and it _might_ be interesting there.
As a part of the general redesign of files, all files, not just small
files, can inline some data directly in the metadata log. Originally,
this was a single piece of inlined data or an inlined tree (shrub) that
effectively acted as an overlay over the block/btree data.
This is now changed so that when we have a block/btree, the root of the
btree is inlined. In effect making a full btree a sort of extended
shrub.
I'm currently calling this a "geoxylic btree", since that seems to be a
somewhat related botanical term. Geoxylic btrees have, at least on
paper, a number of benefits:
- There is a single lookup path instead of two, this simplifies code a
bit and decreases lookup costs.
- One data structure instead of two also means lfsr_file_t requires
less RAM, since all of the on-disk variants can go into one big union.
Though I'm not sure this is very significant vs stack/buffer costs.
- The write path is much simpler and has less duplication (it was
difficult to deduplicate the shrub/btree code because of how the
shrub goes through the mdir).
In this redesign, lfsr_btree_commit_ leaves root attrs uncommitted,
allowing lfsr_bshrub_commit to finish the job via lfsr_mdir_commit.
- We don't need to maintain a shrub estimate, we just lazily evict trees
during mdir compaction. This has a side-effect of allowing shrubs to
temporarily grow larger than shrub_size before eviction.
NOTE THIS (fundamentally?) DOESN'T WORK
- There is no awkwardly high overhead for small btrees. The btree root
for two-block files should be able to comfortably fit in the shrub
portion of the btree, for example.
- It may be possible to also make the mtree geoxylic, which should
reduce storage overhead of small mtrees and make better use of the
mroot.
All of this being said, things aren't working yet. Shrub eviction during
compaction runs into a problem with a single pcache -- how do we write
the new btrees without dropping the compaction pcache? We can't evict
btrees in a separate pass becauce their number is unbounded...
This is based on how bench.py/bench_runners have actually been used in
practice. The main changes have been to make the output of bench.py more
readibly consumable by plot.py/plotmpl.py without needing a bunch of
hacky intermediary scripts.
Now instead of a single per-bench BENCH_START/BENCH_STOP, benches can
have multiple named BENCH_START/BENCH_STOP invocations to measure
multiple things in one run:
BENCH_START("fetch", i, STEP);
lfsr_rbyd_fetch(&lfs, &rbyd_, rbyd.block, CFG->block_size) => 0;
BENCH_STOP("fetch");
Benches can also now report explicit results, for non-io measurements:
BENCH_RESULT("usage", i, STEP, rbyd.eoff);
The extra iter/size parameters to BENCH_START/BENCH_RESULT also allow
some extra information to be calculated post-bench. This infomation gets
tagged with an extra bench_agg field to help organize results in
plot.py/plotmpl.py:
- bench_meas=<meas>+amor, bench_agg=raw - amortized results
- bench_meas=<meas>+div, bench_agg=raw - per-byte results
- bench_meas=<meas>+avg, bench_agg=avg - average over BENCH_SEED
- bench_meas=<meas>+min, bench_agg=min - minimum over BENCH_SEED
- bench_meas=<meas>+max, bench_agg=max - maximum over BENCH_SEED
---
Also removed all bench.tomls for now. This may seem counterproductive in
a commit to improve benchmarking, but I'm not sure there's actual value
to keeping bench cases committed in tree.
These were alway quick to fall out of date (at the time of this commit
most of the low-level bench.tomls, rbyd, btree, etc, no longer
compiled), and most benchmarks were one-off collections of scripts/data
with results too large/cumbersome to commit and keep updated in tree.
I think the better way to approach benchmarking is a seperate repo
(multiple repos?) with all related scripts/state/code and results
committed into a hopefully reproducible snapshot. Keeping the
bench.tomls in that repo makes more sense in this model.
There may be some value to having benchmarks in CI in the future, but
for that to make sense they would need to actually fail on performance
regression. How to do that isn't so clear. Anyways we can always address
this in the future rather than now.
The original name was a bit of a mouthful.
Also dropped the default crystal_size in the test/bench runners
block_size/4 -> block_size/8. I'm already noticing large amounts of
inflation when blocks are fragmented, though I am experimenting with a
rather small fragment_size right now.
Future benchmarks/experimentation is required to figure out good values
for these.
The attempt to implement in-rbyd data slicing, being lazily coalesced
during rbyd compaction, failed pretty much completely.
Slicing is a very enticing write strategy, getting both minimal overhead
post-compaction and fast random write speeds, but the idea has some
fundamental conflicts with how we play out attrs post-compaction.
This idea might work in a more powerful filesystem, but brings back the
need to simulate rbyds in RAM, which is something I really don't want to
do (complex, bug-prone, likely adds code cost, may not even be tractable).
So, third time's the charm?
---
This new write strategy writes only datas and bptrs, and avoids dagging
by completely rewriting any regions of data larger than a configurable
crystallization threshold.
This loses most of the benefits of data crystallization, random writes
will now usually need to rewrite a full block, but as a tradeoff our
data at rest is always stored with optimal overhead.
And at least data crystallization still saves space when our data isn't
block aligned, or in sparse files. From reading up on some other
filesystem designs it seems this is a desirable optimization sometimes
referred to as "tail-packing" or "block suballocation"
Some other changes from just having more time to think about the
problem:
1. Instead of scanning to figure out our current crystal size, we can
use a simple heuristic of 1. look up left block, 2. look up right
block, 3. assume any data between these blocks contribute to our
current crystal.
This is just a heuristic, so worst case you write the first and last
byte of a block which is enough to trigger compaction into a block.
But on the plus side this avoids issues with small holes preventing
blocks from being formed.
This approach brings the number of btree lookups down from
O(crystallize_size) to 2.
2. I've gone ahead and dropped the previous scheme of coalesce_size
+ fragment_size and instead adopted a single fragment_size that
controls the size of, well, fragments, i.e. data elements stored
directly in trees.
This affects both the inlined shrub as well as fragments stored in
the inner nodes of the btree. I believe it's very similar to what is
often called "pages" in logging filesystems, though I'm going to
avoid that term for now because it's a bit overloaded.
Previously, neighboring writes that, when combined, would exceed our
coalesce_size, they just weren't combined. Now they are combined up
to our fragment size, potentially splitting the right fragment.
Before (fragment_size=8):
.---+---+---+---+---+---+---+---.
| 8 bytes |
'---+---+---+---+---+---+---+---'
+
.---+---+---+---+---.
| 5 bytes |
'---+---+---+---+---'
=
.---+---+---+---+---+---+---+---+---+---.
| 5 bytes | 5 bytes |
'---+---+---+---+---+---+---+---+---+---'
After:
.---+---+---+---+---+---+---+---.
| 8 bytes |
'---+---+---+---+---+---+---+---'
+
.---+---+---+---+---.
| 5 bytes |
'---+---+---+---+---'
=
.---+---+---+---+---+---+---+---+---+---.
| 8 bytes |2 bytes|
'---+---+---+---+---+---+---+---+---+---'
This leads to better fragment alignment (much like our block
strategy), and minimizes tree overhead.
Any neighboring data to the right is only coalesced if it fits in the
current fragment, or would be rewritten (carved) anyways, to avoid
unnecessary data rewriting.
For example (fragment_size=8):
.---+---+---+---+---+---+---+---+---+---+---+---+---+---.
| 6 bytes | 6 bytes |2 bytes|
'---+---+---+---+---+---+---+---+---+---+---+---+---+---'
+
.---+---+---+---+---.
| 5 bytes |
'---+---+---+---+---'
=
.---+---+---+---+---+---+---+---+---+---+---+---+---+---.
| 8 bytes | 4 bytes |2 bytes|
'---+---+---+---+---+---+---+---+---+---+---+---+---+---'
Other than these changes this commit is mostly a bunch of carveshrub
rewriting again, which continues to be nuanced and annoying to get
bug free.
- coalesce_size - The amount of data allowed to coalesce into single
data entries.
- crystallize_size - How much data is allowed to be written to btree
inner nodes before needing to be compacted into a block.
Also deduplicated the test config is something I've been wanting to do
for a while. It doesn't make sense to need to modify several different
instantiations of lfs_config every time a config option is added or
removed...
This will stop being a problem when we actually have btrees, but for now
the fragmentation caused by byte-level syncs was easily enough to
overflow an mdir when cache size is big.
A smaller cache size is also nicer for debugging, since smaller cache
sizes results in data getting flushed to disk earlier, which is easier
to inspect than in-device buffers. And a 16-byte cache still provides
decent test coverage over cache interactions.
---
Also dropped inline_size to block_size/8. I realized while debugging
that opened shrubs take up additional space until we sync, so we need to
expect up to 2 temporary copies of shrubs when writing files.
Currently limited to inlined files and only simpler truncate-writes.
But still this lets us test file creation/deletion.
This is also enough logic to make it clear that, even though we have
some powerful high-level primitives, mapping file operations onto these
is still going to be non-trivial.
The previous system of relying on test name prefixes for ordering was
simple, but organizing tests by dependencies and topologically sorting
during compilation is 1. more flexible and 2. simplifies test names,
which get typed a lot.
Note these are not "hard" dependencies, each test suite should work fine
in isolation. These "after" dependencies just hint an ordering when all
tests are ran.
As such, it's worth noting the tests should NOT error of a dependency is
missing. This unfortunately makes it a bit hard to catch typos, but
allows faster compilation of a subset of tests.
---
To make this work the way tests are linked has changed from using custom
linker section (fun linker magic!) to a weakly linked array appended to
every source file (also fun linker magic!).
At least with this method test.py has strict control over the test
ordering, and doesn't depend on 1. the order in which the linker merges
sections, and 2. the order tests are passed to test.py. I didn't realize
the previous system was so fragile.
This marks internal tests/benches (case.in="lfs.c") with an otherwise-unused
flag that is printed during --summary/--list-*. This just helps identify which
tests/benches are internal.
TEST_PERMUTATION/BENCH_PERMUTATION make it possible to map an integer to
a specific permutation efficiently. This is helpful since our testing
framework really only parameterizes single integers.
The exact implementation took a bit of trial and error. It's based on
https://stackoverflow.com/a/7919887 and
https://stackoverflow.com/a/24257996, but modified to run in O(n) with
no extra memory. In the discussion it seemed like this may not actually
be possible for lexicographic ordering of permutations, but fortunately
we don't care about the specific ordering, only the reproducibility.
Here's how it works:
1. First populate an array with all numbers 0-n.
2. Iterate through each index, selecting only from the remaining
numbers based on our current permutation.
.- i%rem --.
v .----+----.
[p0 p1 |-> r0 r1 r2 r3]
Normally to maintain lexicographic ordering you should have to do a O(n)
shift at this step as you remove each number. But instead we can just swap
the removed number and number under the index. This effectively
shrinks the remaining part of the array, but permutes the numbers
a bit. Fortunately, since each successive permutation swaps
at the same location, the resulting permutations will be both
exhaustive and reproducible, if unintuitive.
Now permutation/fuzz tests can reproduce specific failures by defining
either -DPERMUTATION=x or -DSEED=x.
I wondered if walking in Python 2's footsteps was going to run into the
same issues and sure enough, memory backed iterators became unweildy.
The motivation for this change is that large ranges in tests, such as
iterators over seeds or permutations, became prohibitively expensive to
compile. This meant more iteration moving into tests with more steps to
reproduce failures. This sort of defeats the purpuse of the test
framework.
The solution here is to move test permutation generation out of test.py
and into the test runner itself. The allows defines to generate their
values programmatically.
This does conflict with the test frameworks support of sets of explicit
permutations, but this is fixed by also moving these "permutation sets"
down into the test runner.
I guess it turns out the closer your representation matches your
implementation the better everythign works.
Additionally the define caching layer got a bit of tweaking. We can't
precalculate the defines because of mutual recursion, but we can
precalculate which define/permutation each define id maps to. This is
necessary as otherwise figuring out each define's define-specific
permutation would be prohibitively expensive.
This turned out to be a bit tricky, and the scheme in bench_rbyd is
broken.
The core issue is that we don't have a distinction between physical and
logical block sizes, so we can't use a block device configured for one
geometry with a littlefs instance operating on a different geometry. For
this and other reasons we should probably have two configuration
variables in the future, but at the moment that is out of scope.
The problem with the approach in bench_rbyd, which changes the
lfs_config at runtime, is that this breaks emubd which also depends on
lfs_config due to a leaky abstraction. This causes unnoticed memory
corruption.
---
To get something working, the tests now change the underlying BLOCK_SIZE
test define before the tests are run. This starts the test with a block
device configured with a large block_size. To keep this from breaking
things the geometry definitions in the test and bench runners no longer
use default dependent definitions, instead defining everything
explicitly.
With block_size being so large, this makes some of the emubd operations
less performant, notably the --disk option for exposing block device
state during testing.
It would also be nice to use the copy-on-write backend of emubd for some
of the permutation testing, but since it operates on a block-by-block
basis, it doesn't really work when the block device is just one big
block.
When you add a function to every benchmark suite, you know if should
probably be provided by the benchmark runner itself. That being said,
randomness in tests/benchmarks is a bit tricky because it needs to be
strictly controlled and reproducible.
No global state is used, allowing tests/benches to maintain multiple
randomness stream which can be useful for checking results during a run.
There's an argument for having global prng state in that the prng could
be preserved across power-loss, but I have yet to see a use for this,
and it would add a significant requirement to any future test/bench runner.
- Fixed prettyasserts.py parsing when '->' is in expr
- Made prettyasserts.py failures not crash (yay dynamic typing)
- Fixed the initial state of the emubd disk file to match the internal
state in RAM
- Fixed true/false getting changed to True/False in test.py/bench.py
defines
- Fixed accidental substring matching in plot.py's --by comparison
- Fixed a missed LFS_BLOCk_CYCLES in test_superblocks.toml that was
missed
- Changed test.py/bench.py -v to only show commands being run
Including the test output is still possible with test.py -v -O-, making
the implicit inclusion redundant and noisy.
- Added license comments to bench_runner/test_runner
These are really just different flavors of test.py and test_runner.c
without support for power-loss testing, but with support for measuring
the cumulative number of bytes read, programmed, and erased.
Note that the existing define parameterization should work perfectly
fine for running benchmarks across various dimensions:
./scripts/bench.py \
runners/bench_runner \
bench_file_read \
-gnor \
-DSIZE='range(0,131072,1024)'
Also added a couple basic benchmarks as a starting point.