This was caused by the new lfs_file_rawseek optimization that can skip
flushing when calculated file->pos is unchanged combined with an
implicit expectation in lfs_file_truncate that lfs_file_rawseek
unconditionally sets file->pos.
Because of this assumption, lfs_file_truncate could leave file->pos in
an outdated state while changing the internal file metadata. Humorously,
this was always gauranteed to trigger the skip in lfs_file_rawseek when
we try to restore the file->pos, leaving the file->cache used to do the
CTZ skip-list lookup in a potentially bad state.
The easiest fix is to just update file->pos correctly. Note we don't
want to explicitly flush since we can leverage the same noop
optimization if we truncate to the file position. Which I've added a
test for.
Mostly taken from .travis.yml, biggest changes were around how to get
the status updates to work.
We can't use a token on PRs the same way we could in Travis, so instead
we use a second workflow that checks every pull request for "status"
artifacts, and create the actual statuses in the "workflow_run" event,
where we have full access to repo secrets.
After a bit of tweaking in 9dde5c7 to write out all superblocks
during lfs_format, additional writes were added after the sanity
checking normally done at the end.
This turned out to be a problem when porting littlefs, as it makes it
easy for addressing issues to not get caught during lfs_format.
Found by marekr, tristanclare94, and mjs513
We have seen poor read performance on NAND flashes with 128kB blocks.
The root cause is inline files having to traverse many sets of metadata
pairs inside the current block before being fully reconstructed. Simply
disabling inline files is not enough, as the metadata will still fill up
the block and eventually need to be compacted.
By allowing configuration of how much size metadata takes up, along with
limiting (or disabling) inline file size, we achieve read performance
improvements on an order of magnitude.
This indirectly solves an issue with lfs_file_rawclose asserting
when lfs_file_opencfg errors since appending to the mlist occurs
after open. It also may speed up some of the internal operations such as
the lfs_file_write used to resolve unflushed data.
The idea behind adopting mlist over flags is that realistically it's
unlikely for the user to open a significant number of files (enough for
big O to kick in). That being said, moving the mlist asserts into the
API wrappers does protect some of the internal operations from scaling
based on the number of open files.
This removes quite a bit of extra code needed to entertwine the
LFS_TRACE calls into the original funcions.
Also changed temporary return type to match API declaration where
necessary.
- Stayed on non-system include for lfs_util.h for now
- Named internal functions "lfs_functionraw"
- Merged lfs_fs_traverseraw
- Added LFS_LOCK/UNLOCK macros
- Changed LFS_THREADSAFE from 1/0 to defined/undefined to
match LFS_READONLY
- expand functions
- add comment
- rename functions
- fix locking issue in format and mount
- use global include
- fix ac6 linker issue
- use the global config file
- address review comments
- minor cleanup
- minor cleanup
- review comments
- undef unavailable function declarations altogether
- even less code, assert on write attempts
- remove LFS_O_WRONLY and other flags when compiling with LFS_READONLY
- do not annotate #endif, as requested
- move ifdef before comments blocks, rework dangling opening bracket
- ifdef file flags that are not needed in read-only mode
- slight refactor
- ifdef LFS_F_ERRED out as well
Previously we only bumped the revision count if an eviction would occur
immediately (and possibly corrupt littlefs). This works, but does risk
an unoptimal superblock size if an almost-exhausted superblock was
allocated during lfs_format.
As pointed out by tim-nordell-nimbelink, we can align the revision count
to maximize the number of block cycles without breaking the existing
requirements of increasing revision counts.
As an added benefit, littlefs's wear-leveling should behave more
consistently after this change.
This bug was exposed by the bad-block tests due to changes to block
allocation, but could have been hit before these changes.
In flash, when blocks fail, they don't fail in a predictable manner. To
account for this, the bad-block tests check a number of failure
behaviors. The interesting one here is "LFS_TESTBD_BADBLOCK_ERASENOOP",
in which bad blocks can not be erased or programmed, and are stuck with
the data written at the time the blocks go bad.
This is actually a pretty realistic failure behavior, since flash needs a
large voltage to force the electrons of the floating gates. Though
realistically, such a failure would like corrupt the data a bit, not leave the
underlying data perfectly intact.
LFS_TESTBD_BADBLOCK_ERASENOOP is rather interesting to test for because it
means bad blocks can end up with perfectly valid CRCs after a failed write,
confusing littlefs.
---
In this case, we had the perfect series of operations such that a test
was repeatedly writing the same sequence of metadata commits to the same
block, which eventually goes bad, leaving the block stuck with metadata
that occurs later in the sequence.
What this means is that after the first commit, the metadata block
contained both the first and second commits, even though the loop in the
test hadn't reached that point yet.
expected actual
.----------. .----------.
| commit 1 | | commit 1 |
| crc 1 | | crc 1 |
| | | commit 2 <-- (from previous iteration)
| | | crc 2 |
'----------' '----------'
To protect against this, littlefs normally compares the written CRC
against the expected CRC, but because this was the exact same data that
it was going to write, this CRCs end up the same.
Ah! But doesn't littlefs also encode the state of the next page to keep
track of if the next page has been erased or not? Wouldn't that change
between iterations?
It does! In a single bit in the CRC-tag. But thanks to some incorrect
logic attempting to avoid an extra condition in the loop for writing out
padding commits, the CRC that littlefs checked against was the CRC
immediately before we include the "is-next-page-erased" bit.
Changing the verification check to use the same CRC as what is used to
verify commits on fetch solves this problem.
This rather interesting corner-case arises in lfs_dir_alloc anytime the
uninitialized revision count happens to be a multiple of block_cycles+1.
For example, the source of the bug found by tim-nordell-nimbelink:
rev = 2742492087
block_cycles = 100
2742492087 % (100+1) = 0
The reason for this weird block_cycles+1 case is due to a fix for a
previous bug in fe957de. To avoid aliasing, which would cause metadata
pairs to wear unevenly, block_cycles incremented to the next odd number.
Normally, littlefs tweaks the revision count of blocks during
lfs_dir_alloc in order to make sure evictions can't happen on the first
compact. Otherwise, higher-level logic such as lfs_format would break.
However, this wasn't updated with the aliasing fix in fe957de, so
lfs_dir_alloc was only rounding the revision count to the nearest even
number.
The current fix is to change the logic in lfs_dir_alloc to explicitly
check for the eviction condition and increment if eviction would occur.
Found by tim-nordell-nimbelink
As noted by gtaska, we are sitting on a better hash-combining function
than xor: CRC. Previous issues with xor were solvable, but relying on
xor for this isn't really worth the risk when we already have a CRC
function readily available.
To quote a study found by gtaska:
https://michiel.buddingh.eu/distribution-of-hash-values
> CRC32 seems to score really well, but its graph is skewed by the results
> of Dataset 5 (binary numbers), which may or may not be too synthetic to
> be considered a fair benchmark. But even if you substract the results
> from that test, it does not fare significantly worse than other,
> cryptographic hash functions.
On first read, randomizing the allocators offset may seem appropriate
for lfs_alloc_reset. However, it ends up using the filesystem-fed
pseudorandom seed in situations it wasn't designed for.
As noted by gtaska, the combination of using xors for feeding the seed
and multiple traverses of the same CRCs can cause the seed to flip to
zeros with concerning frequency.
Removed the randomization from lfs_alloc_reset, leaving it in only
lfs_mount.
Found by gtaska
Modulus of the offset by block_size was clearly a typo, and should be
block_count. Interesting to note that later moduluses during alloc
calculations prevents this from breaking anything, but as gtaska notes it
could skew the wear-leveling distribution.
Found by guiserle and gtaska
- Standardized littlefs debug statements to use hex prefixes and
brackets for printing pairs.
- Removed the entry behavior for readtree and made -t the default.
This is because 1. the CTZ skip-list parsing was broken, which is not
surprising, and 2. the entry parsing was more complicated than useful.
This functionality may be better implemented as a proper filesystem
read script, complete with directory tree dumping.
- Changed test.py's --gdb argument to take [init, main, assert],
this matches the names of the stages in C's startup.
- Added printing of tail to all mdir dumps in readtree/readmdir.
- Added a print for if any mdirs are corrupted in readtree.
- Added debug script side-effects to .gitignore.
These two features have been much requested by users, and have even had
several PRs proposed to fix these in several cases. Before this, these
error conditions usually were caught by internal asserts, however
asserts prevented users from implementing their own workarounds.
It's taken me a while to provide/accept a useful recovery mechanism
(returning LFS_ERR_CORRUPT instead of asserting) because my original thinking
was that these error conditions only occur due to bugs in the filesystem, and
these bugs should be fixed properly.
While I still think this is mostly true, the point has been made clear
that being able to recover from these conditions is definitely worth the
code cost. Hopefully this new behaviour helps the longevity of devices
even if the storage code fails.
Another, less important, reason I didn't want to accept fixes for these
situations was the lack of tests that prove the code's value. This has
been fixed with the new testing framework thanks to the additional of
"internal tests" which can call C static functions and really take
advantage of the internal information of the filesystem.
It's very slowly to compare one byte at one time. Here are the
performance I get from 128M spinand with NFTL by sequential writing.
| file size | buffer size | write speed |
| 10 MB | 0 B | 3206.01 KB/s |
| 10 MB | 1 B | 2434.04 KB/s |
| 10 MB | 2 B | 2685.78 KB/s |
| 10 MB | 4 B | 2857.94 KB/s |
| 10 MB | 8 B | 3060.68 KB/s |
| 10 MB | 16 B | 3155.30 KB/s |
| 10 MB | 64 B | 3193.68 KB/s |
| 10 MB | 128 B | 3230.62 KB/s |
| 10 MB | 256 B | 3153.03 KB/s |
| 70 MB | 0 B | 2258.87 KB/s |
| 70 MB | 1 B | 1827.83 KB/s |
| 70 MB | 2 B | 1962.29 KB/s |
| 70 MB | 4 B | 2074.01 KB/s |
| 70 MB | 8 B | 2147.03 KB/s |
| 70 MB | 64 B | 2179.92 KB/s |
| 70 MB | 256 B | 2179.96 KB/s |
The 0 Byte size means no validation and the 1 Byte size is how
littlefs do before. Based on the above table and to save memory,
comparing 8 bytes at one time is more wonderful.
Signed-off-by: WeiXiong Liao <liaoweixiong@allwinnertech.com>
- Added caching to Travis install dirs, because otherwise
pip3 install fails randomly
- Increased size of littlefs-fuse disk because test script has
a larger footprint now
- Skip a couple of reentrant tests under byte-level writes because
the tests just take too long and cause Travis to bail due to no
output for 10m
- Fixed various Valgrind errors
- Suppressed uninit checks for tests where LFS_BLOCK_ERASE_VALUE == -1.
In this case rambd goes uninitialized, which is fine for rambd's
purposes. Note I couldn't figure out how to limit this suppression
to only the malloc in rambd, this doesn't seem possible with Valgrind.
- Fixed memory leaks in exhaustion tests
- Fixed off-by-1 string null-terminator issue in paths tests
- Fixed lfs_file_sync issue caused by revealed by fixing memory leaks
in exhaustion tests. Getting ENOSPC during a file write puts the file
in a bad state where littlefs doesn't know how to write it out safely.
In this case, lfs_file_sync and lfs_file_close return 0 without
writing out state so that device-side resources can still be cleaned
up. To recover from ENOSPC, the file needs to be reopened and the
writes recreated. Not sure if there is a better way to handle this.
- Added some quality-of-life improvements to Valgrind testing
- Fit Valgrind messages into truncated output when not in verbose mode
- Turned on origin tracking
This is a bit of a strange case that can be caused by storage with
very large prog sizes, such as NAND flash. We only have 10 bits to store
the size of our padding, so when the prog_size gets larger than 1024
bytes, we have to use multiple padding tags to commit to the next
prog_size boundary.
This causes some complication for the new logic that checks CRCs in case
our block becomes "readonly" and contains existing commits that just happen
to match our new commit size.
Here we just check the CRC of the first commit. This isn't perfect but
does protect against pure "readonly" blocks.
These should probably have been cleaned up in each commit to allow
cherry-picking, but due to time I haven't been able to.
- Went with creating an mdir copy in lfs_dir_commit. This handles a
number of related cleanup issues in lfs_dir_compact and it does so
more robustly. As a plus we can use the copy to update dependencies
in the mlist.
- Eliminated code left by the ENOSPC file outlining
- Cleaned up TODOs and lingering comments
- Changed the reentrant many directory create/rename/remove test to use
a smaller set of directories because of space issues when
READ/PROG_SIZE=512
This was caused by the previous fix for allocations during
lfs_fs_deorphan in this branch. To catch half-orphans during block
allocations we needed to duplicate all metadata-pairs reported to
lfs_fs_traverse. Unfortunately this causes lfs_fs_size to report 2x the
number of metadata-pairs, which would undoubtably confuse users.
The fix here is inelegantly simple, just do a different traversale for
allocations and size measurements. It reuses the same code but touches
slightly different sets of blocks.
Unfortunately, this causes the public lfs_fs_traverse and lfs_fs_size
functions to split in how they report blocks. This is technically
allowed, since lfs_fs_traverse may report blocks multiple times due to
CoW behavior, however it's undesirable and I'm sure there will be some
confusion.
But I don't have a better solution, so from this point lfs_fs_traverse
will be reporting 2x metadata-blocks and shouldn't be used for finding
the number of available blocks on the filesystem.
This was an interesting issue found during a GitHub discussion with
rmollway and thrasher8390.
Blocks in the metadata-pair are relocated every "block_cycles", or, more
mathy, when rev % block_cycles == 0 as long as rev += 1 every block write.
But there's a problem, rev isn't += 1 every block write. There are two
blocks in a metadata-pair, so looking at it from each blocks
perspective, rev += 2 every block write.
This leads to a sort of aliasing issue, where, if block_cycles is
divisible by 2, one block in the metadata-pair is always relocated, and
the other block is _never_ relocated. Causing a complete failure of
block-level wear-leveling.
Fortunately, because of a previous workaround to avoid block_cycles = 1
(since this will cause the relocation algorithm to never terminate), the
actual math is rev % (block_cycles+1) == 0. This means the bug only
shows its head in the much less likely case where block_cycles is a
multiple of 2 plus 1, or, in more mathy terms, block_cycles = 2n+1 for
some n.
To workaround this we can bitwise or our block_cycles with 1 to force it
to never be a multiple of 2n.
(Maybe we should do this during initialization? But then block_cycles
would need to be mutable.)
---
There's a few unrelated changes mixed into this commit that shouldn't be
there since I added this as part of a branch of bug fixes I'm putting
together rather hastily, so unfortunately this is not easily cherry-pickable.
This was initially added as protection against the case where a file
grew to no longer fit in a metadata-pair. While in most cases this
should be caught by the math in lfs_file_write, it doesn't handle a
problem that can happen if the files metadata is large enough that even
small inline files can't fit. This can happen if you combine a small
block size with large file names and many custom attributes.
But trying to outline on ENOSPC creates creates a lot of problems.
If we are actually low on space, this is one of the worst things we can
do. Inline files take up less space than CTZ skip-lists, but inline
files are rendered useless if we outline inline files as soon as we run
low on space.
On top of this, the outlining logic tries multiple mdir commits if it
gets ENOSPC, which can hide errors if ENOSPC is returned for other
reasons.
In a perfect world, we would be using a different error code for
no-room-in-metadata-pair, and no-blocks-on-disk.
For now I've removed the outlining logic and we will need to figure out
how to handle this situation more robustly.
It's interesting how many ways block devices can show failed writes:
1. prog can error
2. erase can error
3. read can error after writing (ECC failure)
4. prog doesn't error but doesn't write the data correctly
5. erase doesn't error but doesn't erase correctly
Can read fail without an error? Yes, though this appears the same as
prog and erase failing.
These weren't all simulated by testbd since I unintentionally assumed
the block device could always error. Fixed by added additional bad-black
behaviors to testbd.
Note: This also includes a small fix where we can miss bad writes if the
underlying block device contains a valid commit with the exact same
size in the exact same offset.