Skip to main content

Lab 9 - File Descriptors

Task: My cat

Navigate to chapters/io/file-descriptors/drills/tasks/my-cat/support/src and checkout my_cat.c. We propose to implement the Linux command cat that reads one or more files, concatenates them (hence the name cat), and prints them to standard output.

  1. Inside the tests/ directory, you will need to run checker.sh. The output for a successful implementation should look like this:
./checker.sh
make: Nothing to be done for 'all'.

Test 1: Comparing single file output..........................PASSED (+30 points)
Test 2: Comparing multiple files output.......................PASSED (+30 points)
Test 3: Testing empty file....................................PASSED (+30 points)
----------------------------------------
Final Score: 100/100 points
Good job!
----------------------------------------
  1. Implement rread() wrapper over read().

    read() system call does not guarantee that it will read the requested number of bytes in a single call. This happens when the file does not have enough bytes, or when read() is interrupted by a signal. rread() will handle these situations, ensuring that it reads either num_bytes or all available bytes.

  2. Implement wwrite() as a wrapper for write().

    The write() system call may not write the requested number of bytes in a single call. This happens if write() is interrupted by a signal. wwrite() will guarantee that it wrote the full num_bytes, retrying as necessary until all data is successfully written or an error occurs.

  3. Implement cat().

    Use rread() to read an entire file and wwrite() to write the contents to standard output. Keep in mind that the buffer size may not fit the entire file at once.

If you're having difficulties solving this exercise, go through this reading material.

Task: Copy a File with mmap()

Navigate to file-descriptors/drills/tasks/mmap_cp and run make to generate support. As you know mmap() can map files in memory, perform operations on them, and then write them back to the disk. Let's check how well it performs by comparing it to the cp command. The benchmarking is automated by benchmark_cp.sh so focus on completing mmap_cp.c for now.

Quiz: Checkout what syscalls cp uses

  1. Inside the tests/ directory, you will need to run checker.sh. The output for a successful implementation should look like this:
./checker.sh
make: Nothing to be done for 'all'.

Test PASSED (File copies are identical)
  1. Open mmap_cp.c and complete the TODOs to map the files in memory and copy the contents. Do not forget to clean up by unmapping and closing the files.

    To test, run make test-file to generate a 1MB file with random data, and then run mmap_cp test-file output.txt. Ensure they have the same content with a simple diff: diff test-file.txt output.txt.

  2. Compare your implementation to the cp command. Run make large-file to generate a 1GB file with random data, and then run ./benchmark_cp.sh.

    Quiz: Debunk why cp is winning

    If you want a more generic answer, checkout this guide on mmap vs read()-write().

  3. This demo would not be complete without some live analysis. Uncomment the calls to wait_for_input() and rerun the program. In another terminal, run cat /proc/$(pidof mmap_cp)/maps to see mapped files, and ps -o pid,vsz,rss <PID> to see how demand paging happens.

Task: Anonymous Pipes Communication

Navigate to chapters/io/ipc/drills/tasks/anon-pipes and run make to generate the support/ folder. In this exercise, you'll implement client-server communication between a parent and a child process using an anonymous pipe. The parent will act as the sender, while the child acts as the receiver, with both processes sharing messages through the pipe. Since pipes are unidirectional, each process should close the end of the pipe it does not use.

  1. Inside the tests/ directory, you will need to run checker.sh. The output for a successful implementation should look like this:
./checker.sh
make: Nothing to be done for 'all'.

Test for short string ........... PASSED
Test for long string ........... PASSED
  1. Use the pipe() syscall to create the pipe. Remember, the first file descriptor (fds[0]) is the read end, and the second (fds[1]) is the write end, similar to how stdin and stdout are represented by file descriptors 0 and 1.

    Hint: Use exit to end the program.

    Quiz: Discover why you cannot use either end of the pipe for reading or writing

  2. Solve the TODOs in parent_loop and child_loop so that the application stops on exit. Ensure each process closes the its pipe end before exiting to prevent indefinite blocking.

    Why is closing the pipe ends important?

    The child process checks for the end of communication by reading from the pipe and checking for EOF, which occurs when the write end is closed. Without closing the write end, the child will block indefinitely in read(). As for the parent, it will block indefinitely in wait().

File Descriptors

You've most likely had to deal with files in the past. The most common command that works with files is cat. For a quick refresher, let's write something to a file, and then read its contents.

You’ve likely worked with files before; now it’s time to see what happens behind the scenes. The most common way to read a file in Linux is by using the cat command. For a quick refresher, let’s do a demo by writing some text to a file and then reading it back.

student@os:~/$ echo "OS Rullz!"  # Print 'OS Rullz!'
OS Rullz!
student@os:~/$ echo "OS Rullz!" > newfile.txt # redirect the output to newfile.txt
## Let's check the contents of newfile.txt
student@os:~/$ cat newfile.txt
OS Rullz!

If we were to implement this in C, we would use the FILE structure and write something like this:

FILE *f = fopen("newfile.txt", "r");
if (!f) {...} // handle error

char buf[1024];
int rc = fread(buf, 1, sizeof(buf), f);
if (rc < 0) {...} // handle error

printf("%s\n", buf);

For a complete example, check out this guide on file operations in C, Python, and Java.

FILE Operations Explained

The FILE structure is not the most straightforward method for performing file operations. It is part of libc and functions as a handler for working with files. This is not particular to C, as most programming languages offer similar handlers.

Running strace cat newfile.txt reveals that fopen() wraps open() (or openat), fread() wraps read(), and fclose() wraps close(). As you can see, the FILE-related functions are just syscalls prefixed with f-.

FILE OperationSyscallDescription
fopen()open()Opens a file and returns a file pointer.
fclose()close()Closes the file associated with the pointer.
fread()read()Reads data from the file into a buffer.
fwrite()write()Writes data from a buffer to the file.
fseek()lseek()Moves the file position indicator.
truncate()ftruncate()Truncates the file to a specified length.

The main distinction between FILE operations and their corresponding system calls is that the latter use a file descriptor to reference a file. File descriptors are simply indexes into the process's File Descriptor Table, which is the list of all currently open files for that process.

This concept is not entirely new, as each process has three default channels: stdin, stdout, and stderr. These are, in fact, the first three entries in every process’s File Descriptor Table.

Quiz: Test your intuition by finding the file descriptor of stderr

Let's translate our previous example to illustrate how this change affects the implementation:

int fd = open("newfile.txt", O_RDONLY)
if (fd < 0) {...} // handle error

char buf[1024];
int rc = read(fd, buf, sizeof(buf)); // Not complete, should've used a while loop
if (rc < 0) {...} // handle error

buf[rc] = '\0'; // Null-terminate the buffer
printf("%s\n", buf);

To better understand the file descriptor API, you can either keep reading about file descriptor operations or checkout this guide on reading Linux directories.

If you're interested in understanding how libc utilizes file descriptors to simplify common operations, check out this guide.

File Descriptor Operations

File descriptors are the primary means of referencing files in our system. They are created, deleted, and manipulated through file interface operations, namely open(), close() read(), write(), and lseek(). From a programmer's perspective, file descriptors are simply indexes into the process's File Descriptor Table, which maintains a list of all currently open files for that process.

In this section, we will focus on how to utilize file descriptors to perform the same operations that FILE allows, and more. If you want to delve deeper into file descriptors, we recommend exploring this guide on the File Descriptor Table.

open()

All processes start with three default file descriptors, inherited from the process's parent:

  • stdin (standard input): 0
  • stdout (standard output): 1
  • stderr (standard error): 2

To create new file descriptors (i.e. open new files), a process can use the open() system call. It receives the path to the file, some flags which are akin to the mode string passed to fopen(). An optional mode parameter that denotes the file's permissions if the open must create it can also be provided. If you use O_CREAT, just remember to also pass 0644 (rw-r--r-- in octal, denoted by the first 0), or permissions more restrictive.

Some other useful flags for open() are:

  • O_APPEND: place file cursor at the end
  • O_CLOEXEC: close the file descriptor when exec() is called. This is useful because child processes inherit the file descriptors, and this can lead to security problems.
  • O_TRUNC: truncate the file to length 0.

close()

Once you are done with a file descriptor you should call close() to free its open file structure. This is similar to how you free memory once you are done with it.

read() and write()

read_bytes = read(fd, buf, num_bytes);
written_bytes = write(fd, buf, num_bytes);

As you know, verifying the return code of system calls is the way to go in general. This is even more apparent when dealing with I/O syscalls, namely read() and write(), which return the number of bytes read or written.

Syscalls returning the number of bytes might seem redundant, but once you hear about partial I/O operations, it is of utmost importance. If your process was interrupted by a signal while reading or writing, it is up to you to continue from where it left off.

Remember: It is mandatory that we always use read() and write() inside while loops. Higher-level functions like fread() and fwrite() also use while loops when calling read() and write() respectively. You can practice this by implementing your own cat command.

In the following sections, we'll use file descriptors and read() and write() to interact with some inter-process-communication mechanisms, such as pipes.

lseek()

As you know, reading or writing from a file always continues from where it left off. Most of the time you would read from a file monotonically so it makes sense to keep the interface clean and handle bookkeeping in the back.

For cases when you selectively update the file or jump around fetching data, or making updates, we have lseek.

off_t lseek(int fd, off_t offset, int whence);

Its parameters are pretty intuitive: fd stands for the file descriptor and offset stands for the offset. The whence directive explains what offset is relative to, and has the following values:

  • SEEK_SET: the file offset is set to offset bytes.
  • SEEK_CUR: The file offset is set to its current location plus offset bytes.
  • SEEK_END: the file offset is set to the size of the file plus offset bytes.

Pipes

Anonymous Pipes

In this session, we'll explore a new mean of Inter-Process Communication (IPC), namely the pipes. Pipes are by no means something new, and you most probably played with them already in bash:

cat 'log_*.csv' | tr -s ' ' | cut -d ',' -f 2 | sort -u | head -n 10

Using pipes (denoted as | in the above example) enables linking the stdout and stdin of multiple processes. The stdout of cat is the stdin of tr, whose stdout is the stdin of cut and so on. This "chain" of commands looks like this:

Piped Commands

So here we have a unidirectional stream of data that starts from cat, is modified by each new command, and then is passed to the next one. We can tell from the image above that the communication channel between any 2 adjacent commands allows one process to write to it while the other reads from it. For example, there is no need for cat to read any of tr's output, only vice versa.

In UNIX, the need for such a channel is fulfilled by the pipe() syscall. Imagine there's a literal pipe between any 2 adjacent commands in the image above, where data is what flows through this pipe in only a single way.

Such pipes are known as anonymous pipes because they don’t have identifiers. They are created by a parent process, which shares them with its children. Data written to an anonymous pipe is stored in a kernel-managed circular buffer, where it’s available for related-processes to read.

The following example showcases a typical workflow with anonymous pipes in Unix:

##define EXIT_ON_COND(cond) do { if (cond) exit(EXIT_FAILURE); } while (0)

// pipe_fd[0] -> for reading
// pipe_fd[1] -> for writing
int pipe_fd[2];

EXIT_ON_COND(pipe(pipe_fd) < 0); // Create the pipe

int pid = fork(); // Fork to create a child process
EXIT_ON_COND(pid < 0); // Check for fork() failure

if (pid == 0) { // Child process
EXIT_ON_COND(close(pipe_fd[0]) != 0); // Close the read end
EXIT_ON_COND(write(pipe_fd[1], "hi", 2) < 0); // Write "hi" to the pipe
EXIT_ON_COND(close(pipe_fd[1]) != 0); // Close the write end
} else { // Parent process
char buf[BUFSIZ];

EXIT_ON_COND(close(pipe_fd[1]) != 0); // Close the write end
ssize_t n = read(pipe_fd[0], buf, sizeof(buf)); // Read data from the pipe into buf
EXIT_ON_COND(n < 0); // Check for read() failure

buf[n] = '\0'; // Null-terminate the string
printf("Received: %s\n", buf); // Output the received message
}

In summary, the process creates the pipe and then calls fork() to create a child process. By default, the file descriptors created by pipe() are shared with the child because the (file descriptor table) is copied upon creation. To better understand how this works, please refer to this guide on the File Descriptor Table (FDT).

You can test your understanding of anonymous pipes by completing the Anonymous Pipes Communication task.

Check your understanding by identifying the limitations of anonymous pipes

Named Pipes (FIFOs)

As we discussed, anonymous pipes are named so because they lack identifiers. Named pipes address this limitation by creating a special file on disk that serves as an identifier for the pipe.

You might think that interacting with a file would result in a performance loss compared to anonymous pipes, but this is not the case. The FIFO file acts merely as a handler within the filesystem, which is used to write data to a buffer inside the kernel. This buffer is responsible for holding the data that is passed between processes, not the filesystem itself.

Keep in mind that reading from and writing to a FIFO is not the same as interacting with a regular file - read() will block if the pipe is empty and will return EOF when the peer closes the pipe.

You can practice working with named pipes by completing the Named Pipes Communication task.

Redirections

Although not directly related, redirections (e.g., ls > file.txt) operate similarly to pipes. A process creates a new file descriptor, updates its stdout, and then creates the child process. You can explore the similarities with pipes further in this guide on redirections.

Guide: Simple File Operations

To manipulate the file (read its contents, modify them, change its size etc.), each process must first get a handler to this file. Think of this handler as an object by which the process can identify and refer to the file.

Now take a look at the code examples in file-descriptors/guides/simple-file-operations/support. Each of them reads the contents of file.txt, modifies them, and then reads the previously modified file again. Use make to compile the C code, and make java-file-operations to compile the Java code.

Now run the programs repeatedly in whatever order you wish:

student@os:~/.../simple-file-operations/support$ python3 file_operations.py
File contents are: OS Rullz!
Wrote new data to file
File contents are: Python was here!

student@os:~/.../simple-file-operations/support$ ./file_operations # from the C code
File contents are: Python was here!
Wrote new data to file
File contents are: C was here!

student@os:~/.../simple-file-operations/support$ java FileOperations
File contents are: Python was here!
Wrote new data to file
File contents are: Java was here!

Note that each piece of code creates a variable, which is then used as a handler.

Quiz

Guide: Redirections

In the File Descriptors section, we mentioned redirections such as echo "OS Rullz!" > newfile.txt. We said file.txt has to be opened at some point. Let’s explore the relevant system calls (open(), openat()) to see this in action:

student@os:~/.../guides/redirections$ strace -e trace=open,openat,execve,dup2 -f sh -c "ls > file.txt"
execve("/usr/bin/sh", ["sh", "-c", "ls > file.txt"], 0x7fffe1383e78 /* 36 vars */) = 0
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "file.txt", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
dup2(3, 1) = 1
strace: Process 77547 attached
[pid 77547] execve("/usr/bin/ls", ["ls"], 0x55ebb9b2dbf8 /* 36 vars */) = 0
[...]

Notice that we used sh -c to run ls > file.txt. Running strace -e trace=open,openat,execve,dup2 -f ls > file.txt would instead redirect the strace output to file.txt, hiding any system calls related to file.txt. This happens because, as we discussed earlier, redirection is transparent for the process being redirected. The process still writes to its stdout, but stdout itself is now directed to the specified file.

Remember how processes are created using fork() and exec(), as shown in this diagram:

Launching a new command in Bash

In our case, the main process is sh -c "ls > file.txt". In the strace output, we see it opens file.txt on file descriptor 3, then uses dup2(3, 1) to redirect file descriptor 1 to the same open file structure. It then forks a child process and calls execve().

execve replaces the virtual address space (VAS) of the current process but retains the file descriptor table. This preserve the stdout of the parent process, thus the redirection to file.txt remains effective in the new process as well.

dup()/dup2() - Atomic IO

If you're not familiar with the dup() syscall, it essentially creates a new file descriptor pointing to an existing open file structure. Unlike open(), as discussed in the file descriptor table guide, dup() doesn’t create a fresh open file structure.

The dup2(old_fd, new_fd) variant closes new_fd before making it point to the same open file structure as old_fd. While this might seem like a combination of close(new_fd) and open(old_fd), dup2() is actually atomic, which prevents race conditions.

To see why atomicity matters, review the code in support/redirect_parallel.c, compile it, and run it.

You’ll find that redirect_stderr_file.txt contains Message for STDOUT, and redirect_stdout_file.txt contains Message for STDERR. Investigate the code to understand where the race condition occurred.

While a mutex around the close() and open() sequence could fix this, it can make the code cumbersome. Instead, follow the FIXME comments for a more elegant solution using dup2().

Guide: File Descriptor Table

Just as each process has its own Virtual Address Space for memory access, it also maintains its own File Descriptor Table (FDT) for managing open files. In this section we will explore how the process structures change when executing syscalls like open(), read(), write(), and close().

Upon startup, every process has three file descriptors that correspond to standard input (stdin), standard output (stdout), and standard error (stderr). These descriptors are inherited from the parent process and occupy the first three entries in the process's FDT.

fdOpen File Struct
0stdin
1stdout
2stderr

Each entry points to an open file structure which stores data about the current session:

  • Permissions: define how the file can be accessed (read, write, or execute); these are the options passed to open().
  • Offset: the current position inside the file from which the next read or write operation will occur.
  • Reference Count: The number of file descriptors referencing this open file structure.
  • Inode Pointer: A pointer to the inode structure that contains both the data and metadata associated with the file.

These Open File Structures are held in the Open File Table (OFT), which is global across the system. Whenever a new file is opened, a new entry is created in this table.

To illustrate this, let's consider a code snippet and examine how the File Descriptor Table and Open File Table would appear. We will focus on fd, permissions, offset, and reference count, as the inode pointer is not relevant at this moment. For simplicity, we'll also omit the standard file descriptors, as they remain unchanged.

int fd = open("file.txt", O_RDONLY);
int fd2 = open("file2.txt", O_WRONLY | O_APPEND);
int fd3 = open("file.txt", O_RDWR);
OFT indexPathPermOffRefCount
...............
123file.txtr--01
140file2.txt-w-1501
142file.txtrw-01
fdOpen File Struct (OFT index)
3123
4140
5142

Let's discuss the changes from the OFT and FDT to understand what happened:

  • open("file.txt", O_RDONLY) created a new open file structure in the Open File Table for file.txt. The entry has read-only (O_RDONLY) permissions and offset 0, representing the start of the file. Subsequently, file descriptor 3 was assigned to point to this OFT entry, and the reference counter was set to 1.
  • open("file2.txt", O_WRONLY) created a similar structure in the OFT for file2.txt, but with write-only (O_WRONLY) permissions and an offset of 150, representing the end of the file (O_APPEND). It then assigned this entry to file descriptor 4.
  • open("file.txt", O_RDWR) created a new open file structure for file.txt and assigned it to file descriptor 5.

At this point, one might wonder why the last open() call didn't reuse the entry at file descriptor 3 and increase its reference counter instead. It might seem logical, but doing so would lead to conflicts with the permissions and offset of the two open file structures. Remember: each open() call creates a new open file structure in the Open File Table.

This raises the question about the necessity for a reference counter. The short answer is dup() (or dup2()) syscall, which duplicates a file descriptor. Let's continue our previous example with the following snippet:

// fd = 3 (from the previous snippet)
int fd4 = dup(fd);
OFT indexPathPermOffsetRefCount
...............
123file.txtr--02
140file2.txt-w-1501
142file.txtrw-01
fdOpen File Struct (OFT index)
3123
4140
5142
6123

The call to dup(fd) created a new file descriptor (6) that points to the same open file structure as its argument fd (which equals 3 in our example). This operation also incremented the reference counter for the entry 123 in the Open File Table.

As a result, operations performed on file descriptor 3 and file descriptor 6 are equivalent. For instance, read(3) and read(6) will both increment the shared file offset, while the offset of file descriptor 5 will remain unchanged. If you want to see a concrete example of when duplicating file descriptors is useful, check out file-descriptors/guides/fd-table/support/redirect-stdout.c.

Now that you know how to create entries in the File Descriptor Table and the Open File Table, it’s important to understand how to remove them. Calling close() will always free a file descriptor, but it will only decrement the reference counter of the open file structure. The actual closing of the file occurs when the reference counter reaches 0.

Guide: libc FILE struct

Now, we will take a short look at how the file descriptors are handled in libc. The Software Stack chapter has taught us that applications generally interact with libraries which expose wrappers on top of syscalls. The most important library in a POSIX system (such as Linux) is libc. Among many others, it provides higher-level abstractions over I/O-related syscalls.

Musl (read just like "muscle") is a lightweight implementation of libc, which exposes the same API that you have used so far, while also being fit for embedded and OS development. For example, Unikraft unikernels may use musl.

First, it provides a struct that groups together multiple data that is necessary when handling files. We know from the example in support/simple-file-operations/file_operations.c that the file handler employed by libc is FILE *. FILE is just a typedef for struct _IO_FILE. Here are the most important fields in struct _IO_FILE:

struct _IO_FILE {
int fd; /* File descriptor */

unsigned flags; /* Flags with which `open()` was called */

int mode; /* File permissions; passed to `open()` */

off_t off; /* File offset from where to read / write */

/**
* Internal buffer used to make fewer costly `read()`/`write()`
* syscalls.
*/
unsigned char *buf;
size_t buf_size;

/* Pointers for reading and writing from/to the buffer defined above. */
unsigned char *rpos, *rend;
unsigned char *wend, *wpos;

/* Function pointers to syscall wrappers. */
size_t (*read)(FILE *, unsigned char *, size_t);
size_t (*write)(FILE *, const unsigned char *, size_t);
off_t (*seek)(FILE *, off_t, int);
int (*close)(FILE *);

/* Lock for concurrent file access. */
volatile int lock;
};

As you might have imagined, this structure contains the underlying file descriptor, the mode (read, write, truncate etc.) with which the file was opened, as well as the offset within the file from which the next read / write will start.

Libc also defines its own wrappers over commonly-used syscalls, such as read(), write(), close() and lseek(). These syscalls themselves need to be implemented by the driver for each file system. This is done by writing the required functions for each syscall and then populating this structure with pointers to them. You will recognise quite a few syscalls: open(), close() read(), write(), mmap() etc.

printf() Buffering

  1. Navigate to buffering/support/printf_buffering.c. Those printf() calls obviously end up calling write() at some point. Run the code under strace.

    Quiz: What syscall does printf use?

    Since there is only one write() syscall despite multiple calls to printf(), it means that the strings given to printf() as arguments are kept somewhere until the syscall is made. That somewhere is precisely that buffer inside struct _IO_FILE that we highlighted above. Remember that syscalls cause the system to change from user mode to kernel mode, which is time-consuming. Instead of performing one write() syscall per call to printf(), it is more efficient to copy the string passed to printf() to an internal buffer inside libc (the unsigned char *buf from above) and then at a given time (like when the buffer is full for example) write() the whole buffer. This results in far fewer write() syscalls.

  2. Now, it is interesting to see how we can force libc to dump that internal buffer. The most direct way is by using the fflush() library call, which is made for this exact purpose. But we can be more subtle. Add a \n in some of the strings printed in buffering/support/printf_buffering.c. Place them wherever you want (at the beginning, at the end, in the middle). Recompile the code and observe its change in behaviour under strace.

    Quiz: How to get data out of printf's buffer?

    Now we know that I/O buffering does happen within libc. If you need further convincing, check out the Musl implementation of fread(), for example. It first copies the data previously saved in the internal buffer:

    if (f->rpos != f->rend) {
    /* First exhaust the buffer. */
    k = MIN(f->rend - f->rpos, l);
    memcpy(dest, f->rpos, k);
    f->rpos += k;
    dest += k;
    l -= k;
    }

    Then, if more data is requested and the internal buffer isn't full, it refills it using the internal read() wrapper. This wrapper also places the data inside the destination buffer.

Guide: File Mappings

Mapping a file to the VAS of a process is similar to how shared libraries are loaded into the same VAS. It's a fancier way of saying that the contents of a file are copied from a given offset within that file to a given address. What's nice about this is that the OS handles all offsets, addresses and memory allocations on its own, with a single highly versatile syscall: mmap().

Let's run a sleep process and inspect its memory zones:

student@os:~$ sleep 1000 &  # start a `sleep` process in the background
[1] 17579

student@os:~$ cat /proc/$(pidof sleep)/maps
55b7b646f000-55b7b6471000 r--p 00000000 103:07 6423964 /usr/bin/sleep
55b7b6471000-55b7b6475000 r-xp 00002000 103:07 6423964 /usr/bin/sleep
55b7b6475000-55b7b6477000 r--p 00006000 103:07 6423964 /usr/bin/sleep
55b7b6478000-55b7b6479000 r--p 00008000 103:07 6423964 /usr/bin/sleep
55b7b6479000-55b7b647a000 rw-p 00009000 103:07 6423964 /usr/bin/sleep
55b7b677c000-55b7b679d000 rw-p 00000000 00:00 0 [heap]
7fe442f61000-7fe44379d000 r--p 00000000 103:07 6423902 /usr/lib/locale/locale-archive
7fe44379d000-7fe4437bf000 r--p 00000000 103:07 6432810 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7fe4437bf000-7fe443937000 r-xp 00022000 103:07 6432810 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7fe443937000-7fe443985000 r--p 0019a000 103:07 6432810 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7fe443985000-7fe443989000 r--p 001e7000 103:07 6432810 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7fe443989000-7fe44398b000 rw-p 001eb000 103:07 6432810 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7fe44398b000-7fe443991000 rw-p 00000000 00:00 0
7fe4439ad000-7fe4439ae000 r--p 00000000 103:07 6429709 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7fe4439ae000-7fe4439d1000 r-xp 00001000 103:07 6429709 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7fe4439d1000-7fe4439d9000 r--p 00024000 103:07 6429709 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7fe4439da000-7fe4439db000 r--p 0002c000 103:07 6429709 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7fe4439db000-7fe4439dc000 rw-p 0002d000 103:07 6429709 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7fe4439dc000-7fe4439dd000 rw-p 00000000 00:00 0
7ffd07aeb000-7ffd07b0c000 rw-p 00000000 00:00 0 [stack]
7ffd07b8b000-7ffd07b8e000 r--p 00000000 00:00 0 [vvar]
7ffd07b8e000-7ffd07b8f000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]

In the output above, you can see that the .text, .rodata, and .data sections for each dynamic library are mapped into the process’s VAS, along with the sections of the main executable.

To understand how these mappings are created, let’s explore a simpler example. Below is an illustration of how libc is loaded (or mapped) into the VAS of an ls process.

student@os:~$ strace ls
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
[...]
mmap(NULL, 2037344, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fb313c9c000
mmap(0x7fb313cbe000, 1540096, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x22000) = 0x7fb313cbe000
mmap(0x7fb313e36000, 319488, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x19a000) = 0x7fb313e36000
mmap(0x7fb313e84000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e7000) = 0x7fb313e84000

For a quick recap on mmap(addr, length, prot, flags, fd, offset), the fifth argument specifies the file descriptor to copy data from, while the sixth is the offset within the file from where to start copying.

In summary, when an executable runs, the loader uses mmap() to reserve memory zones for its shared libraries. Performance is not affected by this since pages are populated on-demand, when they’re accessed for the first time.

File I/O vs mmap()

When it comes to dynamic libraries, mmap is unmatched. With a single call, it handles address mapping, permission setting, and leverages demand paging to populate pages only when accessed. Additionally, mmap() fully supports copy-on-write (COW), allowing libraries to share the same physical frames across multiple processes, which conserves memory and reduces load time.

In contrast, using read and write would require loading the entire library into physical memory for each process individually, missing out on both the copy-on-write and demand paging benefits.

For regular files, however, the choice isn’t always straightforward. The main sources of overhead for mmap() include managing virtual memory mappings - which can lead to TLB flushes - and the cost of page faults due to demand paging.

On the plus side, mmap() excels with random access patterns, efficiently reusing mapped pages. It is also great for operating large amounts of data, as it enables the kernel to automatically unload and reload pages as needed when memory when under memory pressure.

A concrete scenario where these downsides outweigh the benefits of mmap() is one-time, sequential I/O. If you’re simply planning to read or write a file in one go, read() and write() are the way to go.

Guide: Reading Linux Directories

Everything in Linux is a file. This statement says that the Linux OS treats every entry in a file system (regular file, directory, block device, char device, link, UNIX socket) as a file. This unified approach simplifies file handling, allowing a single interface to interact with various types of entries. Let's see how this works in practice:

  1. Navigate to guides/reading-linux-dirs/support/ and checkout dir_ops.c. This code creates a directory dir, if it does not exists, and attempts to open it the same way we would open a regular file. Compile and run the code.

    student@os:~/.../reading-linux-dirs/support$ ./dir_ops
    12:45:34 FATAL dir_ops.c:17: fopen: Is a directory

    The error message is crystal clear: we cannot use fopen() on directories. So the FILE structure is unsuited for directories. Therefore, this handler is not generic enough for a regular Linux filesystem, and we have to use a lower-level function.

    Quiz - What syscall does fopen() use?

  2. Now that we know that fopen() relies openat(), let's try using open(), which wraps openat() but offers a simpler interface.

    Inspect, compile and run the code dir_ops_syscalls.c.

    student@os:~/...reading-linux-dirs/support$ ./dir_ops_syscalls
    Directory file descriptor is: 3

    This output proves that the open() syscall is capable of also handling directories, so it's closer to what we want.

    Note: that it is rather uncommon to use open() for directories. Most of the time, opendir() is used instead.

In conclusion, the key difference between fopen() and open() is in the type of handler they return. The FILE structure from fopen() is suited only for regular files, while the file descriptor returned by open() is more flexible. The differences between the two handlers are explored in the file descriptors section.