This change is necessary to handle out-of-order writes found by pjsg's
fuzzing work.
The problem is that it is possible for (non-NOR) block devices to write
pages in any order, or to even write random data in the case of a
power-loss. This breaks littlefs's use of the first bit in a page to
indicate the erase-state.
pjsg notes this behavior is documented in the W25Q here:
https://community.cypress.com/docs/DOC-10507
---
The basic idea here is to CRC the next page, and use this "erase-state CRC" to
check if the next page is erased and ready to accept programs.
.------------------. \ commit
| metadata | |
| | +---.
| | | |
|------------------| | |
| erase-state CRC -----. |
|------------------| | | |
| commit CRC ---|-|-'
|------------------| / |
| padding | | padding (doesn't need CRC)
| | |
|------------------| \ | next prog
| erased? | +-'
| | | |
| v | /
| |
| |
'------------------'
This is made a bit annoying since littlefs doesn't actually store the
page (prog_size) in the superblock, since it doesn't need to know the
size for any other operation. We can work around this by storing both
the CRC and size of the next page when necessary.
Another interesting note is that we don't need to any bit tweaking
information, since we read the next page every time we would need to
know how to clobber the erase-state CRC. And since we only read
prog_size, this works really well with our caching, since the caches
must be a multiple of prog_size.
This also brings back the internal lfs_bd_crc function, in which we can
use some optimizations added to lfs_bd_cmp.
Needs some cleanup but the idea is passing most relevant tests.
The previous cycle detection algorithm (a naive check against the largest
possible tail list) is simple and gets the job done, but has the potential to
take a very long time on disks with many blocks. Brent's algorithm, on
the other hand, takes at most 2x the number of blocks in the tail list.
Originally naive cycle detection was chosen over Floyd's algorithm to
avoid the extra complexity of managing two desynced traversals for every
traversal of the tail list, but Brent's algorithm is very well suited for our
use case, requiring only we keep track of an additional mdir pointer on the
stack as we traverse.
Despite the comment being correct, the calculation is somehow off by a word,
meaning something must have been missed. Maybe the space for the move-delete
was missed since that was added later to avoid losing move-deletes during
relocations.
This was found with the new exhaustive power-loss searching added to the
test framework with -P2. The exact failure was
test_dirs_many_reentrant:2gg2cb:k4o6. This must be the first test that
ends up with all possible extra state in a single mdir block.
This happens in rare situations where there is a failed mdir relocation,
interrupted by a power-loss, containing the destination of a directory
rename operation, where the directory being renamed preceded the
relocating mdir in the mdir tail-list. This requires at some point for a
previous directory rename to create a cycle.
If this happens, it's possible for the half-orphan to contain the only
reference to the renamed directory. Since half-orphans contain outdated
state when viewed through the mdir tail-list, the renamed directory
appears to be a full-orphan until we fix the relocating half-orphan.
This causes littlefs to incorrectly remove the renamed directory from
the mdir tail-list, causes catastrophic problems down the line.
The source of the problem is that the two different types of orphans
really operate on two different levels of abstraction: half-orphans fix
failed mdir commits, while full-orphans fix directory removes/renames.
Conflating the two leads to situations where we attempt to fix assumed
problems about the directory tree before we have fixed problems with the
mdir state.
The fix here is to separate out the deorphan search into two passes: one
to fix half-orphans and correct any mdir-commits, restoring the mdirs
and gstate to a known good state, then two to fix failed
removes/renames.
---
This was found with the -Plinear heuristic powerloss testing, which now
runs on more geometries. The failing case was:
test_relocations_reentrant_renames:112gg261dk1e3f3:123456789abcdefg1h1i1j1k1
l1m1n1o1p1q1r1s1t1u1v1g2h2i2j2k2l2m2n2o2p2q2r2s2t2
Also fixed/tweaked some parts of the test framework as a part of finding
this bug:
- Fixed off-by-one in exhaustive powerloss state encoding.
- Added --gdb-powerloss-before and --gdb-powerloss-after to help debug
state changes through a failing powerloss, maybe this should be
expanded to any arbitrary powerloss number in the future.
- Added lfs_emubd_crc and lfs_emubd_bdcrc to get block/bd crcs for quick
state comparisons while debugging.
- Fixed bd read/prog/erase counts not being copied during exhaustive
powerloss testing.
- Fixed small typo in lfs_emubd trace.
This is possible thanks to invoxiaamo's optimization of compacting
renames to avoid the O(n^3) nested filters. Not only does this
significantly reduce the runtime cost of that operation, but it
reduces the maximum possible depth of recursion to 3 frames.
Deepest lfs_dir_traverse before:
traverse with commit
'-> traverse with filter
'-> traverse with move
'-> traverse with filter
Deepest lfs_dir_traverse after:
traverse with commit
'-> traverse with move
'-> traverse with filter
Rename can be VERY time consuming. One of the reasons is the 4 recursion
level depth of lfs_dir_traverse() seen if a compaction happened during the
rename.
lfs_dir_compact()
size computation
[1] lfs_dir_traverse(cb=lfs_dir_commit_size)
- do 'duplicates and tag update'
[2] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[1])
- Reaching a LFS_FROM_MOVE tag (here)
[3] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[1]) <= on 'from' dir
- do 'duplicates and tag update'
[4] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[3])
followed by the compaction itself:
[1] lfs_dir_traverse(cb=lfs_dir_commit_commit)
- do 'duplicates and tag update'
[2] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[1])
- Reaching a LFS_FROM_MOVE tag (here)
[3] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[1]) <= on 'from' dir
- do 'duplicates and tag update'
[4] lfs_dir_traverse(cb=lfs_dir_traverse_filter, data=tag[3])
Yet, analyse shows that levels [3] and [4] don't perform anything
if the callback is lfs_dir_traverse_filter...
A practical example:
- format and mount a 4KB block FS
- create 100 files of 256 Bytes named "/dummy_%d"
- create a 1024 Byte file "/test"
- rename "/test" "/test_rename"
- create a 1024 Byte file "/test"
- rename "/test" "/test_rename"
This triggers a compaction where lfs_dir_traverse was called 148393 times,
generating 25e6+ lfs_bd_read calls (~100 MB+ of data)
With the optimization, lfs_dir_traverse is now called 3248 times
(589e3 lfs_bds_calls (~2.3MB of data)
=> x 43 improvement...
- nit: Moving brace to end of if statement line for consistency
- mount: add more debug info per CR
- Fix compiler error from extra parentheses
- Fix superblock typo
The basic idea is simple, if we seek to a position in the currently
loaded cache, don't flush the cache. Notably this ensures that seek is
always as fast or faster than just reading the data.
This is a bit tricky since we need to check that our new block and
offset match the cache, fortunately we can skip the block check by
reevaluating the block index for both the current and new positions.
Note this only works whene reading, for writing we need to always flush
the cache, or else we will lose the pending write data.
Interestingly this was introduced by two different PRs which were not tested
together until pre-release testing:
- Fix lfs_file_seek doesn't update cache properties correctly
- Fix compiler warnings when LFS_READONLY defined
BUG CASE:Assume there are 6 blocks in littlefs, block 0,1,2,3 already allocated. 0 has a tail pair of {2, 3}. Now we try to write more into 0.
When writing to block 0, we will split(FIRST SPLIT), thus allocate block 4 and 5. Up to now , everything is as expected.
Then we will try to commit in block 4, during which split(SECOND SPLIT) is triggered again(In our case, some files are large, some are small, one split may not be enough). Still as expected now.
BUG happens when we try to alloc a new block pair for the second split:
As lookahead buffer reaches the end , a new lookahead buffer will be generated from flash content, and block 4, 5 are unused blocks in the new lookahead buffer because they are not programed yet. HOWEVER, block 4,5 should be occupied in the first split!!!!! The result is block 4,5 are allocated again(This is where things are getting wrong).
commit ce2c01f results in this bug. In the commit, a lfs_alloc_ack is inserted in lfs_dir_split, which will cause split to reset lfs->free.ack to block count.
In summary, this problem exists after 2.1.3.
Solution: don't call lfs_alloc_ack in lfs_dir_split.
lfs_dir_traverse is a bit unpleasant in that it is inherently a
recursive function, but without a strict bound of 4 calls (commit -> filter ->
move -> filter), and efforts to unroll the recursion comes at a
signification code cost.
It turns out the best solution I've found so far is to simple create an
explicit stack with an explicit bound of 4 calls (or more accurately,
3 pushed frames).
---
This actually highlights one of the bigger flaws in littlefs right now,
which is that this function, lfs_dir_traverse, takes O(n^2) disk reads
to traverse.
Note that LFS_FROM_MOVE can only occur once per commit, which is why
this code is O(n^2) and not O(n^4).
This mostly just required separate functions for "lfs_file_rawwrite" and
"lfs_file_flushedwrite", since lfs_file_flush recursively invokes
lfs_file_rawread and lfs_file_rawwrite.
This comes at a code cost, but gives us bounded and measurable RAM usage
on this code path.
lfs_dir_commit originally relied heavily on tail-recursion, though at
least one path (through relocations) was not tail-recursive, and could
cause unbounded stack usage in extreme cases of bad blocks. (Keep in
mind even extreme cases of bad blocks should be in scope for littlefs).
In order to remove recursion from this code path, several changed were
raequired:
- The lfs_dir_compact logic had to be somewhat inverted. Instead of
first compacting and then resolving issues such as relocations and
orphans, the overarching lfs_dir_commit now contains a state-machine
which after committing or compacting handles the extra changes to the
filesystem in a single, non-recursive loop
- Instead of fixing all relocations recursively, >1 relocation requires
defering to a full deorphan step. This step is unfortunately an
additional n^2 process. It also required some changes to lfs_deorphan
in order to ignore intentional orphans created as an intermediary in
lfs_mkdir. Maybe in the future we should remove these.
- Tail recursion normally found in lfs_fs_deorphan had to be rewritten
as a loop which restarts any time a new commit causes a relocation.
This does show that the algorithm may not terminate, but only if every
block is bad, which will eventually cause littlefs to run out of
blocks to write to.
When the on-disk block size doesn't match the config block size, it is
possible to get file corruption. For instance, if the num blocks was
0x200 and we re-mount with 0x100 files could be corrupt.
If we re-mount with a larger number of blocks things should be safer,
but could be handled with a resize option or perhaps a mount flag to
ignore this parameter.
lfs_mdir_isopen goes unused if asserts are disabled, and this caused an
"unused function" warning on Clang (curiously not on GCC since the
function was static inline, commonly used for header-only functions).
Also removed "inline" from the lfs_mdir_* functions as these involve
linked-list operations and really shouldn't be inlined. And since they
are static, inlining should occur automatically if there is a benefit.
Found by dpgeorge