Hacker Bliss

A blog on programming languages. Mostly - Lisp, Erlang & Ruby.

Tuesday, June 23, 2009

Ext2 File System Internals and a look at storage

[this article is being constantly updated and is far from complete]
I've been thinking for a while now of writing a series of posts as I learn more about the internals of these beautiful things we call file systems. I will be dissecting a real file system in the process, for our purposes, the popular ext2 file system. This article assumes that you do understand the basics of ext2fs including how data is organized on disk and what the kernel data structures of both ext2fs and the VFS look like (you can read about this at depth in 'Understanding the Linux Kernel').

Not having found very detailed docs on how the code works, I decided to write about it and save other fellow mortals some trouble. No I'm selfish, I also did this to revise everything I've learnt, accounting for my volatile memory.

Inode functions (fs/ext2/inode.c)
1. ext2_readpage
For each open file, the kernel maintains a page cache (pointer of which is stored in the 'file' datastructure). Inorder to speed disk access to open files, the kernel divides files into uniform pages and caches the in the file's page cache. All file read/writes WILL go through the page cache first. The fsync system call flushes these to disk . When a read() request is made on a file, the kernel invokes the file system's readpage function to read actual data from disk and store it in the file's page. In doing so, it passes the file's structure and the page descriptor to readpage. The filesystem's readpage function then initiates disk i/o and copies the data into the page based on file's file->f_pos. Ofcourse readpage is file system specific, and the kernel can't do its job here, the file system has to.
The ext2fs readpage function uses mpage_readpage to help it read data from disk like so:
static int ext2_readpage(struct file *file, struct page *page)
{
return mpage_readpage(page, ext2_get_block);
}
What this does is, ext2_readpage tells mpage_readpage to fill 'page' with help from 'ext2_get_block' (explained below):
Basically this is how the whole process of reading and filling a page from disk blocks works:
1. read() is called on an open file.
2. kernel checks the page cache to see if data has already been read, if so the kernel simply copies this data in to the user-space buffer passed to read.
3. If its not there in the page cache, oops! the kernel needs the data and only the file system knows how to get it from disk. so the kernel calls the ext2fs specific readpage function.
4. read page calls takes the help of kernel's mpage_readpage which works like so: Say our block size is 1k, and page size is 4k. Which means when a 4k page is read for a file, 4 blocks have to be read and stored into the page. So, mpage_readpage calls ext2_get_block 4 times, each time passing it a new block number.

A little on block numbers.
The smallest unit of storage at the file system layer is a logical block. The size of the block is decided at the time of creation of the file system. This size is calculated based on the size of the entire block device. Typically 1KB for floppies and 4KB for large disks.
Based on the block size, each block has a block number, which is more precisely called the 'logical block address'. Its called logical because the actual physical block address is usually something different and is used by block device drives at the lower layers, but we don't care.

So mpage_readpage needs the help of ext2_get_block. It gives the function the logical block number relative to the beginning of the file. For example the first block in a file is 1, second 2, etc.. and based on this, ext2_get_block returns the logical block address of the actual block on disk. Once mpage_readpage gets this, it does the actual block read from the block device, and stores the data into one chunk of the page. if the page is not full yet (which could happen if block size is say 1k and page size is 4k), it makes another call to ext2_get_block and the whole cycle repeats. It does all this on behalf of the ext2_readpage function mentioned earlier which was supposed to do all this reading and copying in the first place. Beautiful. file system specific stuff here is only address translation. The rest of it is taken care of by the kernel! I love this.

2. ext2_block_to_path
As mentioned above, the kernel calls readpage which calls mpage_readpage which calls our get_block function who's job is to do the address translation. get_blocks is fairly large and is based on several different smaller functions. I will be discussing each of these as we go, some of them only briefly, starting with ext2_block_to_path
Its job is to translate the file's relative block.no into an array of offsets each of which helps ext2_get_blocks walk through the indirect blocks that ultimately lead to the real block on disk. Unfortunately its very hard to explain this structure of walking, but its its very easy to understand. In a nutshell, the inode structure on disk stores an array of block addresses each pointing to a real block on disk. The first 12 addresses are called direct addresses, they directly point to the first 12 blocks of the file. The 13th block contains the address of an indirect block, the contents of this indirect block can inturn be considered to be a huge array of logical block addresses. For example, if the block size if 4K, and logical block address size if 4bytes, then a single indirect block can store upto 1024 block addresses. If you're still with me, what this means is the 13th block of the inode array points to an indirect block which points to direct blocks. Lets start again, The first 12 elements directly address file Blocks 1-12. The next blocks 13 -> 12 + 1024 are addressed by the 13th block (called the first level of indirection). The 14th block in the inode array points to an indirect block, which points to another 1024 indirect blocks, each of which point to a 1024 real blocks. So the 14th block reference's file-block numbers 12 + 1024 to 12 + 1024*1024 blocks (called the second level of indirection), and the 15th block (the last one in our inode block array) address block numbers 12+1024*1024 -> 12 + 1024*1024*1024 (3rd and final level of indirection). Large files in the order of terabytes will easily fit into this structure.

The prototype of ext2_block_to_path:
staticint ext2_block_to_path(struct inode *inode, long i_block, int offsets[4], int *boundary)
inode is the vfs inode structure, i_block is the block number relative to the beginning of the file, offsets is an array into which have to store the offset into each array of indirection which we'll look at in a bit. boundary - I don't know what this is for, if you do, please drop me a line. :)

Instead of looking at the body of the function, lets look at what it returns, with simple examples:
If i_block = 10. then the function returns:
offset[0] = 10 and depth = 1

If i_block = 100, then function returns:
offset[0] = 12 (remember this means that it is the 13th element in the array - first level of indirection)
offset[1] = 88
and depth = 2

If i_block = 2000,
offset[0] = 13 (second level of indirection)
offset[1] = (2000 - 13) / 1024 = 1
offset[2] = (2000 - 13) % 1024 = 963
depth = 3

And it does all this using only 2 numbers- the block size and the block number. No disk reads or anything, simple math. The function then returns the depth as its value. Mind I haven't covered the code that does the shifting and masking to make the above math possible, that should be fairly easy to understand.

3. ext2_get_branch
This function builds on the offset and depth values returned by the above function, and is the next major function that ext2_get_blocks calls.
Its job is to describe a sequence of chains that lead to our direct and indirect blocks.
A chain is an array of Indirect structures, each element identifying one part of the chain. You can think of them in terms of arrows.
The Indirect structure contains 2 elements, 'key' - the logical block address of the direct or indirect block being addressed. and 'bh' - the buffer head descriptor identifying the indirect block that was read in to figure out the logical block address stored in key.
Actually I lied, it contains a third called 'p' which I wont describe until a bit, for sake of clarity.

This article is constantly being updated, stay tuned.

Saturday, February 28, 2009

Functions with Pattern Matching arguments in Common Lisp.

I recently stumbled upon a cool CL library that makes it possible to define functions with pattern-matching arguments (like how its done in Haskell/Erlang). This is particularly useful and elegant, there are numerous benefits to your Language supporting pattern matching function arguments. The most important according to me are conciseness, elegance and clarity. Slava articulates this very well here.

Why I love pattern matching:
- Pattern matching makes it very easy to write recursive programs. For example, for terminating the recursion, traditionally you'd have to wrap your terminating condition (for example, (equal list1 nil)) in a cond or if, and manually test it during every recursive call. That's something you'd hate doing if you've ever written programs in Erlang or Haskell.
- I frequently write programs that progressively destructure lists, and having to pull out the car of a car of a car, can be a pain sometimes. With pattern matching, I can have these pulled out and bound to variables - automatically.

Unfortunately, Common Lisp doesn't have such a notion built into it (some people talk about Destructuring bind. Please, it works only for macro parameters, not functions. and is very limited in what it can do (in this context)).

To see just how nice pattern matching is and why it would suck if your language didn't or couldn't have it, consider the classical factorial example in Common Lisp and its erlang equivalent:
Common Lisp:

(defun fact(n)
(cond ((equal n 0) 1)
((> n 0) (* n
(fact (- n 1))))))

Erlang:

fact(0) -> 1;
fact(N) -> N * fact(N - 1).

Erlang clearly wins, the code is much more readable and elegant than the common lisp version. Lisp with an evil grin, says, hold it right there! and pulls out a cool library from its arsenal called bpm. Its really cool in that its written entirely in Portable Common Lisp and is yet another example of how powerful lisp can be in extending itself. With bpm, you can now define your lisp functions with arguments that pattern match and bind to variables.. just like how you would in Erlang/Haskell.

Some examples in Common Lisp to show you how clean your code will look:
Factorial:

(def! fact (0) 1)
(def fact (_n) (* _n
(fact (- _n 1))))

Fibonacci of a number:

(def! fib (0) 0)
(def fib (1) 1)
(def fib (_n) (+ (fib (- _n 1))
(fib (- _n 2))))

It gets even better when you have to destructure lists recursively:
A function that adds all the elements of a list:

(def! myf1 () 0)
(def myf1 (_head . _tail)
(+ _head (myf1 _tail)))

Sweet! :)
 


Unique Visitors: web counter(c) Hacker Bliss