mirror of
https://github.com/littlefs-project/littlefs.git
synced 2025-12-01 12:20:02 +00:00
Updated SPEC.md and DESIGN.md based on recent changes
- Added math behind CTZ limits - Added documentation over atomic moves
This commit is contained in:
250
DESIGN.md
250
DESIGN.md
@ -200,7 +200,7 @@ Now we could just leave files here, copying the entire file on write
|
||||
provides the synchronization without the duplicated memory requirements
|
||||
of the metadata blocks. However, we can do a bit better.
|
||||
|
||||
## CTZ linked-lists
|
||||
## CTZ skip-lists
|
||||
|
||||
There are many different data structures for representing the actual
|
||||
files in filesystems. Of these, the littlefs uses a rather unique [COW](https://upload.wikimedia.org/wikipedia/commons/0/0c/Cow_female_black_white.jpg)
|
||||
@ -246,19 +246,19 @@ runtime to just _read_ a file? That's awful. Keep in mind reading files are
|
||||
usually the most common filesystem operation.
|
||||
|
||||
To avoid this problem, the littlefs uses a multilayered linked-list. For
|
||||
every block that is divisible by a power of two, the block contains an
|
||||
additional pointer that points back by that power of two. Another way of
|
||||
thinking about this design is that there are actually many linked-lists
|
||||
threaded together, with each linked-lists skipping an increasing number
|
||||
of blocks. If you're familiar with data-structures, you may have also
|
||||
recognized that this is a deterministic skip-list.
|
||||
every nth block where n is divisible by 2^x, the block contains a pointer
|
||||
to block n-2^x. So each block contains anywhere from 1 to log2(n) pointers
|
||||
that skip to various sections of the preceding list. If you're familiar with
|
||||
data-structures, you may have recognized that this is a type of deterministic
|
||||
skip-list.
|
||||
|
||||
To find the power of two factors efficiently, we can use the instruction
|
||||
[count trailing zeros (CTZ)](https://en.wikipedia.org/wiki/Count_trailing_zeros),
|
||||
which is where this linked-list's name comes from.
|
||||
The name comes from the use of the
|
||||
[count trailing zeros (CTZ)](https://en.wikipedia.org/wiki/Count_trailing_zeros)
|
||||
instruction, which allows us to calculate the power-of-two factors efficiently.
|
||||
For a given block n, the block contains ctz(n)+1 pointers.
|
||||
|
||||
```
|
||||
Exhibit C: A backwards CTZ linked-list
|
||||
Exhibit C: A backwards CTZ skip-list
|
||||
.--------. .--------. .--------. .--------. .--------. .--------.
|
||||
| data 0 |<-| data 1 |<-| data 2 |<-| data 3 |<-| data 4 |<-| data 5 |
|
||||
| |<-| |--| |<-| |--| | | |
|
||||
@ -266,6 +266,9 @@ Exhibit C: A backwards CTZ linked-list
|
||||
'--------' '--------' '--------' '--------' '--------' '--------'
|
||||
```
|
||||
|
||||
The additional pointers allow us to navigate the data-structure on disk
|
||||
much more efficiently than in a single linked-list.
|
||||
|
||||
Taking exhibit C for example, here is the path from data block 5 to data
|
||||
block 1. You can see how data block 3 was completely skipped:
|
||||
```
|
||||
@ -285,15 +288,57 @@ The path to data block 0 is even more quick, requiring only two jumps:
|
||||
'--------' '--------' '--------' '--------' '--------' '--------'
|
||||
```
|
||||
|
||||
The CTZ linked-list has quite a few interesting properties. All of the pointers
|
||||
in the block can be found by just knowing the index in the list of the current
|
||||
block, and, with a bit of math, the amortized overhead for the linked-list is
|
||||
only two pointers per block. Most importantly, the CTZ linked-list has a
|
||||
worst case lookup runtime of O(logn), which brings the runtime of reading a
|
||||
file down to O(n logn). Given that the constant runtime is divided by the
|
||||
amount of data we can store in a block, this is pretty reasonable.
|
||||
We can find the runtime complexity by looking at the path to any block from
|
||||
the block containing the most pointers. Every step along the path divides
|
||||
the search space for the block in half. This gives us a runtime of O(log n).
|
||||
To get to the block with the most pointers, we can perform the same steps
|
||||
backwards, which keeps the asymptotic runtime at O(log n). The interesting
|
||||
part about this data structure is that this optimal path occurs naturally
|
||||
if we greedily choose the pointer that covers the most distance without passing
|
||||
our target block.
|
||||
|
||||
Here is what it might look like to update a file stored with a CTZ linked-list:
|
||||
So now we have a representation of files that can be appended trivially with
|
||||
a runtime of O(1), and can be read with a worst case runtime of O(n logn).
|
||||
Given that the the runtime is also divided by the amount of data we can store
|
||||
in a block, this is pretty reasonable.
|
||||
|
||||
Unfortunately, the CTZ skip-list comes with a few questions that aren't
|
||||
straightforward to answer. What is the overhead? How do we handle more
|
||||
pointers than we can store in a block?
|
||||
|
||||
One way to find the overhead per block is to look at the data structure as
|
||||
multiple layers of linked-lists. Each linked-list skips twice as many blocks
|
||||
as the previous linked-list. Or another way of looking at it is that each
|
||||
linked-list uses half as much storage per block as the previous linked-list.
|
||||
As we approach infinity, the number of pointers per block forms a geometric
|
||||
series. Solving this geometric series gives us an average of only 2 pointers
|
||||
per block.
|
||||
|
||||

|
||||
|
||||
Finding the maximum number of pointers in a block is a bit more complicated,
|
||||
but since our file size is limited by the integer width we use to store the
|
||||
size, we can solve for it. Setting the overhead of the maximum pointers equal
|
||||
to the block size we get the following equation. Note that a smaller block size
|
||||
results in more pointers, and a larger word width results in larger pointers.
|
||||
|
||||

|
||||
|
||||
where:
|
||||
B = block size in bytes
|
||||
w = word width in bits
|
||||
|
||||
Solving the equation for B gives us the minimum block size for various word
|
||||
widths:
|
||||
32 bit CTZ skip-list = minimum block size of 104 bytes
|
||||
64 bit CTZ skip-list = minimum block size of 448 bytes
|
||||
|
||||
Since littlefs uses a 32 bit word size, we are limited to a minimum block
|
||||
size of 104 bytes. This is a perfectly reasonable minimum block size, with most
|
||||
block sizes starting around 512 bytes. So we can avoid the additional logic
|
||||
needed to avoid overflowing our block's capacity in the CTZ skip-list.
|
||||
|
||||
Here is what it might look like to update a file stored with a CTZ skip-list:
|
||||
```
|
||||
block 1 block 2
|
||||
.---------.---------.
|
||||
@ -367,7 +412,7 @@ v
|
||||
## Block allocation
|
||||
|
||||
So those two ideas provide the grounds for the filesystem. The metadata pairs
|
||||
give us directories, and the CTZ linked-lists give us files. But this leaves
|
||||
give us directories, and the CTZ skip-lists give us files. But this leaves
|
||||
one big [elephant](https://upload.wikimedia.org/wikipedia/commons/3/37/African_Bush_Elephant.jpg)
|
||||
of a question. How do we get those blocks in the first place?
|
||||
|
||||
@ -653,9 +698,17 @@ deorphan step that simply iterates through every directory in the linked-list
|
||||
and checks it against every directory entry in the filesystem to see if it
|
||||
has a parent. The deorphan step occurs on the first block allocation after
|
||||
boot, so orphans should never cause the littlefs to run out of storage
|
||||
prematurely.
|
||||
prematurely. Note that the deorphan step never needs to run in a readonly
|
||||
filesystem.
|
||||
|
||||
And for my final trick, moving a directory:
|
||||
## The move problem
|
||||
|
||||
Now we have a real problem. How do we move things between directories while
|
||||
remaining power resilient? Even looking at the problem from a high level,
|
||||
it seems impossible. We can update directory blocks atomically, but atomically
|
||||
updating two independent directory blocks is not an atomic operation.
|
||||
|
||||
Here's the steps the filesystem may go through to move a directory:
|
||||
```
|
||||
.--------.
|
||||
|root dir|-.
|
||||
@ -716,18 +769,135 @@ v
|
||||
'--------'
|
||||
```
|
||||
|
||||
Note that once again we don't care about the ordering of directories in the
|
||||
linked-list, so we can simply leave directories in their old positions. This
|
||||
does make the diagrams a bit hard to draw, but the littlefs doesn't really
|
||||
care.
|
||||
We can leave any orphans up to the deorphan step to collect, but that doesn't
|
||||
help the case where dir A has both dir B and the root dir as parents if we
|
||||
lose power inconveniently.
|
||||
|
||||
It's also worth noting that once again we have an operation that isn't actually
|
||||
atomic. After we add directory A to directory B, we could lose power, leaving
|
||||
directory A as a part of both the root directory and directory B. However,
|
||||
there isn't anything inherent to the littlefs that prevents a directory from
|
||||
having multiple parents, so in this case, we just allow that to happen. Extra
|
||||
care is taken to only remove a directory from the linked-list if there are
|
||||
no parents left in the filesystem.
|
||||
Initially, you might think this is fine. Dir A _might_ end up with two parents,
|
||||
but the filesystem will still work as intended. But then this raises the
|
||||
question of what do we do when the dir A wears out? For other directory blocks
|
||||
we can update the parent pointer, but for a dir with two parents we would need
|
||||
work out how to update both parents. And the check for multiple parents would
|
||||
need to be carried out for every directory, even if the directory has never
|
||||
been moved.
|
||||
|
||||
It also presents a bad user-experience, since the condition of ending up with
|
||||
two parents is rare, it's unlikely user-level code will be prepared. Just think
|
||||
about how a user would recover from a multi-parented directory. They can't just
|
||||
remove one directory, since remove would report the directory as "not empty".
|
||||
|
||||
Other atomic filesystems simple COW the entire directory tree. But this
|
||||
introduces a significant bit of complexity, which leads to code size, along
|
||||
with a surprisingly expensive runtime cost during what most users assume is
|
||||
a single pointer update.
|
||||
|
||||
Another option is to update the directory block we're moving from to point
|
||||
to the destination with a sort of predicate that we have moved if the
|
||||
destination exists. Unfortunately, the omnipresent concern of wear could
|
||||
cause any of these directory entries to change blocks, and changing the
|
||||
entry size before a move introduces complications if it spills out of
|
||||
the current directory block.
|
||||
|
||||
So how do we go about moving a directory atomically?
|
||||
|
||||
We rely on the improbableness of power loss.
|
||||
|
||||
Power loss during a move is certainly possible, but it's actually relatively
|
||||
rare. Unless a device is writing to a filesystem constantly, it's unlikely that
|
||||
a power loss will occur during filesystem activity. We still need to handle
|
||||
the condition, but runtime during a power loss takes a back seat to the runtime
|
||||
during normal operations.
|
||||
|
||||
So what littlefs does is unelegantly simple. When littlefs moves a file, it
|
||||
marks the file as "moving". This is stored as a single bit in the directory
|
||||
entry and doesn't take up much space. Then littlefs moves the directory,
|
||||
finishing with the complete remove of the "moving" directory entry.
|
||||
|
||||
```
|
||||
.--------.
|
||||
|root dir|-.
|
||||
| pair 0 | |
|
||||
.--------| |-'
|
||||
| '--------'
|
||||
| .-' '-.
|
||||
| v v
|
||||
| .--------. .--------.
|
||||
'->| dir A |->| dir B |
|
||||
| pair 0 | | pair 0 |
|
||||
| | | |
|
||||
'--------' '--------'
|
||||
|
||||
| update root directory to mark directory A as moving
|
||||
v
|
||||
|
||||
.----------.
|
||||
|root dir |-.
|
||||
| pair 0 | |
|
||||
.-------| moving A!|-'
|
||||
| '----------'
|
||||
| .-' '-.
|
||||
| v v
|
||||
| .--------. .--------.
|
||||
'->| dir A |->| dir B |
|
||||
| pair 0 | | pair 0 |
|
||||
| | | |
|
||||
'--------' '--------'
|
||||
|
||||
| update directory B to point to directory A
|
||||
v
|
||||
|
||||
.----------.
|
||||
|root dir |-.
|
||||
| pair 0 | |
|
||||
.-------| moving A!|-'
|
||||
| '----------'
|
||||
| .-----' '-.
|
||||
| | v
|
||||
| | .--------.
|
||||
| | .->| dir B |
|
||||
| | | | pair 0 |
|
||||
| | | | |
|
||||
| | | '--------'
|
||||
| | .-------'
|
||||
| v v |
|
||||
| .--------. |
|
||||
'->| dir A |-'
|
||||
| pair 0 |
|
||||
| |
|
||||
'--------'
|
||||
|
||||
| update root to no longer contain directory A
|
||||
v
|
||||
.--------.
|
||||
|root dir|-.
|
||||
| pair 0 | |
|
||||
.----| |-'
|
||||
| '--------'
|
||||
| |
|
||||
| v
|
||||
| .--------.
|
||||
| .->| dir B |
|
||||
| | | pair 0 |
|
||||
| '--| |-.
|
||||
| '--------' |
|
||||
| | |
|
||||
| v |
|
||||
| .--------. |
|
||||
'--->| dir A |-'
|
||||
| pair 0 |
|
||||
| |
|
||||
'--------'
|
||||
```
|
||||
|
||||
Now, if we run into a directory entry that has been marked as "moved", one
|
||||
of two things is possible. Either the directory entry exists elsewhere in the
|
||||
filesystem, or it doesn't. This is a O(n) operation, but only occurs in the
|
||||
unlikely case we lost power during a move.
|
||||
|
||||
And we can easily fix the "moved" directory entry. Since we're already scanning
|
||||
the filesystem during the deorphan step, we can also check for moved entries.
|
||||
If we find one, we either remove the "moved" marking or remove the whole entry
|
||||
if it exists elsewhere in the filesystem.
|
||||
|
||||
## Wear awareness
|
||||
|
||||
@ -955,18 +1125,18 @@ So, to summarize:
|
||||
|
||||
1. The littlefs is composed of directory blocks
|
||||
2. Each directory is a linked-list of metadata pairs
|
||||
3. These metadata pairs can be updated atomically by alternative which
|
||||
3. These metadata pairs can be updated atomically by alternating which
|
||||
metadata block is active
|
||||
4. Directory blocks contain either references to other directories or files
|
||||
5. Files are represented by copy-on-write CTZ linked-lists
|
||||
6. The CTZ linked-lists support appending in O(1) and reading in O(n logn)
|
||||
7. Blocks are allocated by scanning the filesystem for used blocks in a
|
||||
5. Files are represented by copy-on-write CTZ skip-lists which support O(1)
|
||||
append and O(n logn) reading
|
||||
6. Blocks are allocated by scanning the filesystem for used blocks in a
|
||||
fixed-size lookahead region is that stored in a bit-vector
|
||||
8. To facilitate scanning the filesystem, all directories are part of a
|
||||
7. To facilitate scanning the filesystem, all directories are part of a
|
||||
linked-list that is threaded through the entire filesystem
|
||||
9. If a block develops an error, the littlefs allocates a new block, and
|
||||
8. If a block develops an error, the littlefs allocates a new block, and
|
||||
moves the data and references of the old block to the new.
|
||||
10. Any case where an atomic operation is not possible, it is taken care of
|
||||
9. Any case where an atomic operation is not possible, mistakes are resolved
|
||||
by a deorphan step that occurs on the first allocation after boot
|
||||
|
||||
That's the little filesystem. Thanks for reading!
|
||||
|
||||
33
SPEC.md
33
SPEC.md
@ -121,13 +121,18 @@ Here's the layout of entries on disk:
|
||||
**Entry type** - Type of the entry, currently this is limited to the following:
|
||||
- 0x11 - file entry
|
||||
- 0x22 - directory entry
|
||||
- 0xe2 - superblock entry
|
||||
- 0x2e - superblock entry
|
||||
|
||||
Additionally, the type is broken into two 4 bit nibbles, with the lower nibble
|
||||
Additionally, the type is broken into two 4 bit nibbles, with the upper nibble
|
||||
specifying the type's data structure used when scanning the filesystem. The
|
||||
upper nibble clarifies the type further when multiple entries share the same
|
||||
lower nibble clarifies the type further when multiple entries share the same
|
||||
data structure.
|
||||
|
||||
The highest bit is reserved for marking the entry as "moved". If an entry
|
||||
is marked as "moved", the entry may also exist somewhere else in the
|
||||
filesystem. If the entry exists elsewhere, this entry must be treated as
|
||||
though it does not exist.
|
||||
|
||||
**Entry length** - Length in bytes of the entry-specific data. This does
|
||||
not include the entry type size, attributes, or name. The full size in bytes
|
||||
of the entry is 4 + entry length + attribute length + name length.
|
||||
@ -175,7 +180,7 @@ Here's the layout of the superblock entry:
|
||||
|
||||
| offset | size | description |
|
||||
|--------|------------------------|----------------------------------------|
|
||||
| 0x00 | 8 bits | entry type (0xe2 for superblock entry) |
|
||||
| 0x00 | 8 bits | entry type (0x2e for superblock entry) |
|
||||
| 0x01 | 8 bits | entry length (20 bytes) |
|
||||
| 0x02 | 8 bits | attribute length |
|
||||
| 0x03 | 8 bits | name length (8 bytes) |
|
||||
@ -208,7 +213,7 @@ Here's an example of a complete superblock:
|
||||
(32 bits) revision count = 3 (0x00000003)
|
||||
(32 bits) dir size = 52 bytes, end of dir (0x00000034)
|
||||
(64 bits) tail pointer = 3, 2 (0x00000003, 0x00000002)
|
||||
(8 bits) entry type = superblock (0xe2)
|
||||
(8 bits) entry type = superblock (0x2e)
|
||||
(8 bits) entry length = 20 bytes (0x14)
|
||||
(8 bits) attribute length = 0 bytes (0x00)
|
||||
(8 bits) name length = 8 bytes (0x08)
|
||||
@ -220,7 +225,7 @@ Here's an example of a complete superblock:
|
||||
(32 bits) crc = 0xc50b74fa
|
||||
|
||||
00000000: 03 00 00 00 34 00 00 00 03 00 00 00 02 00 00 00 ....4...........
|
||||
00000010: e2 14 00 08 03 00 00 00 02 00 00 00 00 02 00 00 ................
|
||||
00000010: 2e 14 00 08 03 00 00 00 02 00 00 00 00 02 00 00 ................
|
||||
00000020: 00 04 00 00 01 00 01 00 6c 69 74 74 6c 65 66 73 ........littlefs
|
||||
00000030: fa 74 0b c5 .t..
|
||||
```
|
||||
@ -262,15 +267,19 @@ Here's an example of a directory entry:
|
||||
|
||||
Files are stored in entries with a pointer to the head of the file and the
|
||||
size of the file. This is enough information to determine the state of the
|
||||
CTZ linked-list that is being referenced.
|
||||
CTZ skip-list that is being referenced.
|
||||
|
||||
How files are actually stored on disk is a bit complicated. The full
|
||||
explanation of CTZ linked-lists can be found in [DESIGN.md](DESIGN.md#ctz-linked-lists).
|
||||
explanation of CTZ skip-lists can be found in [DESIGN.md](DESIGN.md#ctz-skip-lists).
|
||||
|
||||
A terribly quick summary: For every nth block where n is divisible by 2^x,
|
||||
the block contains a pointer that points x blocks towards the beginning of the
|
||||
file. These pointers are stored in order of x in each block of the file
|
||||
immediately before the data in the block.
|
||||
the block contains a pointer to block n-2^x. These pointers are stored in
|
||||
increasing order of x in each block of the file preceding the data in the
|
||||
block.
|
||||
|
||||
The maximum number of pointers in a block is bounded by the maximum file size
|
||||
divided by the block size. With 32 bits for file size, this results in a
|
||||
minimum block size of 104 bytes.
|
||||
|
||||
Here's the layout of a file entry:
|
||||
|
||||
@ -286,7 +295,7 @@ Here's the layout of a file entry:
|
||||
| 0xc+a | name length bytes | directory name |
|
||||
|
||||
**File head** - Pointer to the block that is the head of the file's CTZ
|
||||
linked-list.
|
||||
skip-list.
|
||||
|
||||
**File size** - Size of file in bytes.
|
||||
|
||||
|
||||
Reference in New Issue
Block a user