Lab 9 - File Descriptors
Task: My cat
Navigate to chapters/io/file-descriptors/drills/tasks/my-cat/support/src
and checkout my_cat.c
.
We propose to implement the Linux command cat
that reads one or more files, concatenates them (hence the name cat
), and prints them to standard output.
- Inside the
tests/
directory, you will need to runchecker.sh
. The output for a successful implementation should look like this:
./checker.sh
make: Nothing to be done for 'all'.
Test 1: Comparing single file output..........................PASSED (+30 points)
Test 2: Comparing multiple files output.......................PASSED (+30 points)
Test 3: Testing empty file....................................PASSED (+30 points)
----------------------------------------
Final Score: 100/100 points
Good job!
----------------------------------------
Implement
rread()
wrapper overread()
.read()
system call does not guarantee that it will read the requested number of bytes in a single call. This happens when the file does not have enough bytes, or whenread()
is interrupted by a signal.rread()
will handle these situations, ensuring that it reads eithernum_bytes
or all available bytes.Implement
wwrite()
as a wrapper forwrite()
.The
write()
system call may not write the requested number of bytes in a single call. This happens ifwrite()
is interrupted by a signal.wwrite()
will guarantee that it wrote the fullnum_bytes
, retrying as necessary until all data is successfully written or an error occurs.Implement
cat()
.Use
rread()
to read an entire file andwwrite()
to write the contents to standard output. Keep in mind that the buffer size may not fit the entire file at once.
If you're having difficulties solving this exercise, go through this reading material.
Task: Copy a File with mmap()
Navigate to file-descriptors/drills/tasks/mmap_cp
and run make
to generate support
.
As you know mmap()
can map files in memory, perform operations on them, and then write them back to the disk.
Let's check how well it performs by comparing it to the cp
command.
The benchmarking is automated by benchmark_cp.sh
so focus on completing mmap_cp.c
for now.
- Inside the
tests/
directory, you will need to runchecker.sh
. The output for a successful implementation should look like this:
./checker.sh
make: Nothing to be done for 'all'.
Test PASSED (File copies are identical)
Open
mmap_cp.c
and complete the TODOs to map the files in memory and copy the contents. Do not forget to clean up by unmapping and closing the files.To test, run
make test-file
to generate a 1MB file with random data, and then runmmap_cp test-file output.txt
. Ensure they have the same content with a simplediff
:diff test-file.txt output.txt
.Compare your implementation to the
cp
command. Runmake large-file
to generate a 1GB file with random data, and then run./benchmark_cp.sh
.
If you want a more generic answer, checkout this guide on mmap
vs read()-write()
.
- This demo would not be complete without some live analysis.
Uncomment the calls to
wait_for_input()
and rerun the program. In another terminal, runcat /proc/$(pidof mmap_cp)/maps
to see mapped files, andps -o pid,vsz,rss <PID>
to see how demand paging happens.
Task: Anonymous Pipes Communication
Navigate to chapters/io/ipc/drills/tasks/anon-pipes
and run make
to generate the support/
folder.
In this exercise, you'll implement client-server communication between a parent and a child process using an anonymous pipe.
The parent will act as the sender, while the child acts as the receiver, with both processes sharing messages through the pipe.
Since pipes are unidirectional, each process should close the end of the pipe it does not use.
- Inside the
tests/
directory, you will need to runchecker.sh
. The output for a successful implementation should look like this:
./checker.sh
make: Nothing to be done for 'all'.
Test for short string ........... PASSED
Test for long string ........... PASSED
Use the
pipe()
syscall to create the pipe. Remember, the first file descriptor (fds[0]
) is the read end, and the second (fds[1]
) is the write end, similar to howstdin
andstdout
are represented by file descriptors0
and1
.Hint: Use
exit
to end the program.
Solve the TODOs in
parent_loop
andchild_loop
so that the application stops onexit
. Ensure each process closes the its pipe end before exiting to prevent indefinite blocking.Why is closing the pipe ends important?
The child process checks for the end of communication by reading from the pipe and checking for
EOF
, which occurs when the write end is closed. Without closing the write end, the child will block indefinitely inread()
. As for the parent, it will block indefinitely inwait()
.
File Descriptors
You've most likely had to deal with files in the past.
The most common command that works with files is cat
.
For a quick refresher, let's write something to a file, and then read its contents.
You’ve likely worked with files before;
now it’s time to see what happens behind the scenes.
The most common way to read a file in Linux is by using the cat
command.
For a quick refresher, let’s do a demo by writing some text to a file and then reading it back.
student@os:~/$ echo "OS Rullz!" # Print 'OS Rullz!'
OS Rullz!
student@os:~/$ echo "OS Rullz!" > newfile.txt # redirect the output to newfile.txt
## Let's check the contents of newfile.txt
student@os:~/$ cat newfile.txt
OS Rullz!
If we were to implement this in C, we would use the FILE
structure and write something like this:
FILE *f = fopen("newfile.txt", "r");
if (!f) {...} // handle error
char buf[1024];
int rc = fread(buf, 1, sizeof(buf), f);
if (rc < 0) {...} // handle error
printf("%s\n", buf);
For a complete example, check out this guide on file operations in C, Python, and Java.
FILE
Operations Explained
The FILE
structure is not the most straightforward method for performing file operations.
It is part of libc
and functions as a handler for working with files.
This is not particular to C, as most programming languages offer similar handlers.
Running strace cat newfile.txt
reveals that fopen()
wraps open()
(or openat
), fread()
wraps read()
, and fclose()
wraps close()
.
As you can see, the FILE
-related functions are just syscalls prefixed with f-
.
FILE Operation | Syscall | Description |
---|---|---|
fopen() | open() | Opens a file and returns a file pointer. |
fclose() | close() | Closes the file associated with the pointer. |
fread() | read() | Reads data from the file into a buffer. |
fwrite() | write() | Writes data from a buffer to the file. |
fseek() | lseek() | Moves the file position indicator. |
truncate() | ftruncate() | Truncates the file to a specified length. |
The main distinction between FILE
operations and their corresponding system calls is that the latter use a file descriptor to reference a file.
File descriptors are simply indexes into the process's File Descriptor Table, which is the list of all currently open files for that process.
This concept is not entirely new, as each process has three default channels: stdin
, stdout
, and stderr
.
These are, in fact, the first three entries in every process’s File Descriptor Table.
Let's translate our previous example to illustrate how this change affects the implementation:
int fd = open("newfile.txt", O_RDONLY)
if (fd < 0) {...} // handle error
char buf[1024];
int rc = read(fd, buf, sizeof(buf)); // Not complete, should've used a while loop
if (rc < 0) {...} // handle error
buf[rc] = '\0'; // Null-terminate the buffer
printf("%s\n", buf);
To better understand the file descriptor API, you can either keep reading about file descriptor operations or checkout this guide on reading Linux directories.
If you're interested in understanding how libc
utilizes file descriptors to simplify common operations, check out this guide.
File Descriptor Operations
File descriptors are the primary means of referencing files in our system.
They are created, deleted, and manipulated through file interface operations, namely open()
, close()
read()
, write()
, and lseek()
.
From a programmer's perspective, file descriptors are simply indexes into the process's File Descriptor Table, which maintains a list of all currently open files for that process.
In this section, we will focus on how to utilize file descriptors to perform the same operations that FILE
allows, and more.
If you want to delve deeper into file descriptors, we recommend exploring this guide on the File Descriptor Table.
open()
All processes start with three default file descriptors, inherited from the process's parent:
stdin
(standard input): 0stdout
(standard output): 1stderr
(standard error): 2
To create new file descriptors (i.e. open new files), a process can use the open()
system call.
It receives the path to the file, some flags which are akin to the mode
string passed to fopen()
.
An optional mode
parameter that denotes the file's permissions if the open
must create it can also be provided.
If you use O_CREAT
, just remember to also pass 0644
(rw-r--r--
in octal, denoted by the first 0
), or permissions more restrictive.
Some other useful flags for open()
are:
O_APPEND
: place file cursor at the endO_CLOEXEC
: close the file descriptor whenexec()
is called. This is useful because child processes inherit the file descriptors, and this can lead to security problems.O_TRUNC
: truncate the file to length 0.
close()
Once you are done with a file descriptor you should call close()
to free its open file structure.
This is similar to how you free memory once you are done with it.
read()
and write()
read_bytes = read(fd, buf, num_bytes);
written_bytes = write(fd, buf, num_bytes);
As you know, verifying the return code of system calls is the way to go in general.
This is even more apparent when dealing with I/O syscalls, namely read()
and write()
, which return the number of bytes read or written.
Syscalls returning the number of bytes might seem redundant, but once you hear about partial I/O operations, it is of utmost importance. If your process was interrupted by a signal while reading or writing, it is up to you to continue from where it left off.
Remember: It is mandatory that we always use read()
and write()
inside while
loops.
Higher-level functions like fread()
and fwrite()
also use while
loops when calling read()
and write()
respectively.
You can practice this by implementing your own cat
command.
In the following sections, we'll use file descriptors and read()
and write()
to interact with some inter-process-communication mechanisms, such as pipes.
lseek()
As you know, reading or writing from a file always continues from where it left off. Most of the time you would read from a file monotonically so it makes sense to keep the interface clean and handle bookkeeping in the back.
For cases when you selectively update the file or jump around fetching data, or making updates, we have lseek
.
off_t lseek(int fd, off_t offset, int whence);
Its parameters are pretty intuitive: fd
stands for the file descriptor and offset
stands for the offset.
The whence
directive explains what offset
is relative to, and has the following values:
SEEK_SET
: the file offset is set to offset bytes.SEEK_CUR
: The file offset is set to its current location plus offset bytes.SEEK_END
: the file offset is set to the size of the file plus offset bytes.
Pipes
Anonymous Pipes
In this session, we'll explore a new mean of Inter-Process Communication (IPC), namely the pipes. Pipes are by no means something new, and you most probably played with them already in bash:
cat 'log_*.csv' | tr -s ' ' | cut -d ',' -f 2 | sort -u | head -n 10
Using pipes (denoted as |
in the above example) enables linking the stdout
and stdin
of multiple processes.
The stdout
of cat
is the stdin
of tr
, whose stdout
is the stdin
of cut
and so on.
This "chain" of commands looks like this:
So here we have a unidirectional stream of data that starts from cat
, is modified by each new command, and then is passed to the next one.
We can tell from the image above that the communication channel between any 2 adjacent commands allows one process to write to it while the other reads from it.
For example, there is no need for cat
to read any of tr
's output, only vice versa.
In UNIX, the need for such a channel is fulfilled by the pipe()
syscall.
Imagine there's a literal pipe between any 2 adjacent commands in the image above, where data is what flows through this pipe in only a single way.
Such pipes are known as anonymous pipes because they don’t have identifiers. They are created by a parent process, which shares them with its children. Data written to an anonymous pipe is stored in a kernel-managed circular buffer, where it’s available for related-processes to read.
The following example showcases a typical workflow with anonymous pipes in Unix:
##define EXIT_ON_COND(cond) do { if (cond) exit(EXIT_FAILURE); } while (0)
// pipe_fd[0] -> for reading
// pipe_fd[1] -> for writing
int pipe_fd[2];
EXIT_ON_COND(pipe(pipe_fd) < 0); // Create the pipe
int pid = fork(); // Fork to create a child process
EXIT_ON_COND(pid < 0); // Check for fork() failure
if (pid == 0) { // Child process
EXIT_ON_COND(close(pipe_fd[0]) != 0); // Close the read end
EXIT_ON_COND(write(pipe_fd[1], "hi", 2) < 0); // Write "hi" to the pipe
EXIT_ON_COND(close(pipe_fd[1]) != 0); // Close the write end
} else { // Parent process
char buf[BUFSIZ];
EXIT_ON_COND(close(pipe_fd[1]) != 0); // Close the write end
ssize_t n = read(pipe_fd[0], buf, sizeof(buf)); // Read data from the pipe into buf
EXIT_ON_COND(n < 0); // Check for read() failure
buf[n] = '\0'; // Null-terminate the string
printf("Received: %s\n", buf); // Output the received message
}
In summary, the process creates the pipe and then calls fork()
to create a child process.
By default, the file descriptors created by pipe()
are shared with the child because the (file descriptor table) is copied upon creation.
To better understand how this works, please refer to this guide on the File Descriptor Table (FDT).
You can test your understanding of anonymous pipes by completing the Anonymous Pipes Communication task.
Check your understanding by identifying the limitations of anonymous pipes
Named Pipes (FIFOs)
As we discussed, anonymous pipes are named so because they lack identifiers. Named pipes address this limitation by creating a special file on disk that serves as an identifier for the pipe.
You might think that interacting with a file would result in a performance loss compared to anonymous pipes, but this is not the case. The FIFO file acts merely as a handler within the filesystem, which is used to write data to a buffer inside the kernel. This buffer is responsible for holding the data that is passed between processes, not the filesystem itself.
Keep in mind that reading from and writing to a FIFO is not the same as interacting with a regular file - read()
will block if the pipe is empty and will return EOF
when the peer closes the pipe.
You can practice working with named pipes by completing the Named Pipes Communication task.
Redirections
Although not directly related, redirections (e.g., ls > file.txt
) operate similarly to pipes.
A process creates a new file descriptor, updates its stdout
, and then creates the child process.
You can explore the similarities with pipes further in this guide on redirections.
Guide: Simple File Operations
To manipulate the file (read its contents, modify them, change its size etc.), each process must first get a handler to this file. Think of this handler as an object by which the process can identify and refer to the file.
Now take a look at the code examples in file-descriptors/guides/simple-file-operations/support
.
Each of them reads the contents of file.txt
, modifies them, and then reads the previously modified file again.
Use make
to compile the C code, and make java-file-operations
to compile the Java code.
Now run the programs repeatedly in whatever order you wish:
student@os:~/.../simple-file-operations/support$ python3 file_operations.py
File contents are: OS Rullz!
Wrote new data to file
File contents are: Python was here!
student@os:~/.../simple-file-operations/support$ ./file_operations # from the C code
File contents are: Python was here!
Wrote new data to file
File contents are: C was here!
student@os:~/.../simple-file-operations/support$ java FileOperations
File contents are: Python was here!
Wrote new data to file
File contents are: Java was here!
Note that each piece of code creates a variable, which is then used as a handler.
Guide: Redirections
In the File Descriptors section, we mentioned redirections such as echo "OS Rullz!" > newfile.txt
.
We said file.txt
has to be opened at some point.
Let’s explore the relevant system calls (open()
, openat()
) to see this in action:
student@os:~/.../guides/redirections$ strace -e trace=open,openat,execve,dup2 -f sh -c "ls > file.txt"
execve("/usr/bin/sh", ["sh", "-c", "ls > file.txt"], 0x7fffe1383e78 /* 36 vars */) = 0
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "file.txt", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
dup2(3, 1) = 1
strace: Process 77547 attached
[pid 77547] execve("/usr/bin/ls", ["ls"], 0x55ebb9b2dbf8 /* 36 vars */) = 0
[...]
Notice that we used sh -c
to run ls > file.txt
.
Running strace -e trace=open,openat,execve,dup2 -f ls > file.txt
would instead redirect the strace
output to file.txt
, hiding any system calls related to file.txt
.
This happens because, as we discussed earlier, redirection is transparent for the process being redirected.
The process still writes to its stdout
, but stdout
itself is now directed to the specified file.
Remember how processes are created using fork()
and exec()
, as shown in this diagram:
In our case, the main process is sh -c "ls > file.txt"
.
In the strace
output, we see it opens file.txt
on file descriptor 3
, then uses dup2(3, 1)
to redirect file descriptor 1
to the same open file structure.
It then forks a child process and calls execve()
.
execve
replaces the virtual address space (VAS) of the current process but retains the file descriptor table.
This preserve the stdout
of the parent process, thus the redirection to file.txt
remains effective in the new process as well.
dup()/dup2()
- Atomic IO
If you're not familiar with the dup()
syscall, it essentially creates a new file descriptor pointing to an existing open file structure.
Unlike open()
, as discussed in the file descriptor table guide, dup()
doesn’t create a fresh open file structure.
The dup2(old_fd, new_fd)
variant closes new_fd
before making it point to the same open file structure as old_fd
.
While this might seem like a combination of close(new_fd)
and open(old_fd)
, dup2()
is actually atomic, which prevents race conditions.
To see why atomicity matters, review the code in support/redirect_parallel.c
, compile it, and run it.
You’ll find that redirect_stderr_file.txt
contains Message for STDOUT
, and redirect_stdout_file.txt
contains Message for STDERR
.
Investigate the code to understand where the race condition occurred.
While a mutex
around the close()
and open()
sequence could fix this, it can make the code cumbersome.
Instead, follow the FIXME
comments for a more elegant solution using dup2()
.
Guide: File Descriptor Table
Just as each process has its own Virtual Address Space for memory access, it also maintains its own File Descriptor Table (FDT) for managing open files.
In this section we will explore how the process structures change when executing syscalls like open()
, read()
, write()
, and close()
.
Upon startup, every process has three file descriptors that correspond to standard input (stdin
), standard output (stdout
), and standard error (stderr
).
These descriptors are inherited from the parent process and occupy the first three entries in the process's FDT.
fd | Open File Struct |
---|---|
0 | stdin |
1 | stdout |
2 | stderr |
Each entry points to an open file structure which stores data about the current session:
- Permissions: define how the file can be accessed (read, write, or execute);
these are the options passed to
open()
. - Offset: the current position inside the file from which the next read or write operation will occur.
- Reference Count: The number of file descriptors referencing this open file structure.
- Inode Pointer: A pointer to the inode structure that contains both the data and metadata associated with the file.
These Open File Structures are held in the Open File Table (OFT), which is global across the system. Whenever a new file is opened, a new entry is created in this table.
To illustrate this, let's consider a code snippet and examine how the File Descriptor Table and Open File Table would appear.
We will focus on fd
, permissions, offset, and reference count, as the inode pointer is not relevant at this moment.
For simplicity, we'll also omit the standard file descriptors, as they remain unchanged.
int fd = open("file.txt", O_RDONLY);
int fd2 = open("file2.txt", O_WRONLY | O_APPEND);
int fd3 = open("file.txt", O_RDWR);
OFT index | Path | Perm | Off | RefCount |
---|---|---|---|---|
... | ... | ... | ... | ... |
123 | file.txt | r-- | 0 | 1 |
140 | file2.txt | -w- | 150 | 1 |
142 | file.txt | rw- | 0 | 1 |
fd | Open File Struct (OFT index) |
---|---|
3 | 123 |
4 | 140 |
5 | 142 |
Let's discuss the changes from the OFT and FDT to understand what happened:
open("file.txt", O_RDONLY)
created a new open file structure in the Open File Table forfile.txt
. The entry has read-only (O_RDONLY
) permissions and offset0
, representing the start of the file. Subsequently, file descriptor3
was assigned to point to this OFT entry, and the reference counter was set to1
.open("file2.txt", O_WRONLY)
created a similar structure in the OFT forfile2.txt
, but with write-only (O_WRONLY
) permissions and an offset of150
, representing the end of the file (O_APPEND
). It then assigned this entry to file descriptor4
.open("file.txt", O_RDWR)
created a new open file structure forfile.txt
and assigned it to file descriptor5
.
At this point, one might wonder why the last open()
call didn't reuse the entry at file descriptor 3
and increase its reference counter instead.
It might seem logical, but doing so would lead to conflicts with the permissions and offset of the two open file structures.
Remember: each open()
call creates a new open file structure in the Open File Table.
This raises the question about the necessity for a reference counter.
The short answer is dup()
(or dup2()
) syscall, which duplicates a file descriptor.
Let's continue our previous example with the following snippet:
// fd = 3 (from the previous snippet)
int fd4 = dup(fd);
OFT index | Path | Perm | Offset | RefCount |
---|---|---|---|---|
... | ... | ... | ... | ... |
123 | file.txt | r-- | 0 | 2 |
140 | file2.txt | -w- | 150 | 1 |
142 | file.txt | rw- | 0 | 1 |
fd | Open File Struct (OFT index) |
---|---|
3 | 123 |
4 | 140 |
5 | 142 |
6 | 123 |
The call to dup(fd)
created a new file descriptor (6
) that points to the same open file structure as its argument fd
(which equals 3
in our example).
This operation also incremented the reference counter for the entry 123
in the Open File Table.
As a result, operations performed on file descriptor 3
and file descriptor 6
are equivalent.
For instance, read(3)
and read(6)
will both increment the shared file offset, while the offset of file descriptor 5
will remain unchanged.
If you want to see a concrete example of when duplicating file descriptors is useful, check out file-descriptors/guides/fd-table/support/redirect-stdout.c
.
Now that you know how to create entries in the File Descriptor Table and the Open File Table, it’s important to understand how to remove them.
Calling close()
will always free a file descriptor, but it will only decrement the reference counter of the open file structure.
The actual closing of the file occurs when the reference counter reaches 0
.
Guide: libc FILE
struct
Now, we will take a short look at how the file descriptors are handled in libc. The Software Stack chapter has taught us that applications generally interact with libraries which expose wrappers on top of syscalls. The most important library in a POSIX system (such as Linux) is libc. Among many others, it provides higher-level abstractions over I/O-related syscalls.
Musl (read just like "muscle") is a lightweight implementation of libc, which exposes the same API that you have used so far, while also being fit for embedded and OS development. For example, Unikraft unikernels may use musl.
First, it provides a struct
that groups together multiple data that is necessary when handling files.
We know from the example in support/simple-file-operations/file_operations.c
that the file handler employed by libc is FILE *
.
FILE
is just a typedef
for struct _IO_FILE
.
Here are the most important fields in struct _IO_FILE
:
struct _IO_FILE {
int fd; /* File descriptor */
unsigned flags; /* Flags with which `open()` was called */
int mode; /* File permissions; passed to `open()` */
off_t off; /* File offset from where to read / write */
/**
* Internal buffer used to make fewer costly `read()`/`write()`
* syscalls.
*/
unsigned char *buf;
size_t buf_size;
/* Pointers for reading and writing from/to the buffer defined above. */
unsigned char *rpos, *rend;
unsigned char *wend, *wpos;
/* Function pointers to syscall wrappers. */
size_t (*read)(FILE *, unsigned char *, size_t);
size_t (*write)(FILE *, const unsigned char *, size_t);
off_t (*seek)(FILE *, off_t, int);
int (*close)(FILE *);
/* Lock for concurrent file access. */
volatile int lock;
};
As you might have imagined, this structure contains the underlying file descriptor, the mode
(read, write, truncate etc.) with which the file was opened, as well as the offset within the file from which the next read / write will start.
Libc also defines its own wrappers over commonly-used syscalls, such as read()
, write()
, close()
and lseek()
.
These syscalls themselves need to be implemented by the driver for each file system.
This is done by writing the required functions for each syscall and then populating this structure with pointers to them.
You will recognise quite a few syscalls: open()
, close()
read()
, write()
, mmap()
etc.
printf()
Buffering
- Navigate to
buffering/support/printf_buffering.c
. Thoseprintf()
calls obviously end up callingwrite()
at some point. Run the code understrace
.
Since there is only one write()
syscall despite multiple calls to printf()
, it means that the strings given to printf()
as arguments are kept somewhere until the syscall is made.
That somewhere is precisely that buffer inside struct _IO_FILE
that we highlighted above.
Remember that syscalls cause the system to change from user mode to kernel mode, which is time-consuming.
Instead of performing one write()
syscall per call to printf()
, it is more efficient to copy the string passed to printf()
to an internal buffer inside libc (the unsigned char *buf
from above) and then at a given time (like when the buffer is full for example) write()
the whole buffer.
This results in far fewer write()
syscalls.
- Now, it is interesting to see how we can force libc to dump that internal buffer.
The most direct way is by using the
fflush()
library call, which is made for this exact purpose. But we can be more subtle. Add a\n
in some of the strings printed inbuffering/support/printf_buffering.c
. Place them wherever you want (at the beginning, at the end, in the middle). Recompile the code and observe its change in behaviour understrace
.
Now we know that I/O buffering does happen within libc.
If you need further convincing, check out the Musl implementation of fread()
, for example.
It first copies the data previously saved in the internal buffer:
if (f->rpos != f->rend) {
/* First exhaust the buffer. */
k = MIN(f->rend - f->rpos, l);
memcpy(dest, f->rpos, k);
f->rpos += k;
dest += k;
l -= k;
}
Then, if more data is requested and the internal buffer isn't full, it refills it using the internal read()
wrapper.
This wrapper also places the data inside the destination buffer.
Guide: File Mappings
Mapping a file to the VAS of a process is similar to how shared libraries are loaded into the same VAS.
It's a fancier way of saying that the contents of a file are copied from a given offset within that file to a given address.
What's nice about this is that the OS handles all offsets, addresses and memory allocations on its own, with a single highly versatile syscall: mmap()
.
Let's run a sleep
process and inspect its memory zones:
student@os:~$ sleep 1000 & # start a `sleep` process in the background
[1] 17579
student@os:~$ cat /proc/$(pidof sleep)/maps
55b7b646f000-55b7b6471000 r--p 00000000 103:07 6423964 /usr/bin/sleep
55b7b6471000-55b7b6475000 r-xp 00002000 103:07 6423964 /usr/bin/sleep
55b7b6475000-55b7b6477000 r--p 00006000 103:07 6423964 /usr/bin/sleep
55b7b6478000-55b7b6479000 r--p 00008000 103:07 6423964 /usr/bin/sleep
55b7b6479000-55b7b647a000 rw-p 00009000 103:07 6423964 /usr/bin/sleep
55b7b677c000-55b7b679d000 rw-p 00000000 00:00 0 [heap]
7fe442f61000-7fe44379d000 r--p 00000000 103:07 6423902 /usr/lib/locale/locale-archive
7fe44379d000-7fe4437bf000 r--p 00000000 103:07 6432810 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7fe4437bf000-7fe443937000 r-xp 00022000 103:07 6432810 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7fe443937000-7fe443985000 r--p 0019a000 103:07 6432810 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7fe443985000-7fe443989000 r--p 001e7000 103:07 6432810 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7fe443989000-7fe44398b000 rw-p 001eb000 103:07 6432810 /usr/lib/x86_64-linux-gnu/libc-2.31.so
7fe44398b000-7fe443991000 rw-p 00000000 00:00 0
7fe4439ad000-7fe4439ae000 r--p 00000000 103:07 6429709 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7fe4439ae000-7fe4439d1000 r-xp 00001000 103:07 6429709 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7fe4439d1000-7fe4439d9000 r--p 00024000 103:07 6429709 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7fe4439da000-7fe4439db000 r--p 0002c000 103:07 6429709 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7fe4439db000-7fe4439dc000 rw-p 0002d000 103:07 6429709 /usr/lib/x86_64-linux-gnu/ld-2.31.so
7fe4439dc000-7fe4439dd000 rw-p 00000000 00:00 0
7ffd07aeb000-7ffd07b0c000 rw-p 00000000 00:00 0 [stack]
7ffd07b8b000-7ffd07b8e000 r--p 00000000 00:00 0 [vvar]
7ffd07b8e000-7ffd07b8f000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]
In the output above, you can see that the .text
, .rodata
, and .data
sections for each dynamic library are mapped into the process’s VAS, along with the sections of the main executable.
To understand how these mappings are created, let’s explore a simpler example.
Below is an illustration of how libc
is loaded (or mapped) into the VAS of an ls
process.
student@os:~$ strace ls
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
[...]
mmap(NULL, 2037344, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fb313c9c000
mmap(0x7fb313cbe000, 1540096, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x22000) = 0x7fb313cbe000
mmap(0x7fb313e36000, 319488, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x19a000) = 0x7fb313e36000
mmap(0x7fb313e84000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e7000) = 0x7fb313e84000
For a quick recap on mmap(addr, length, prot, flags, fd, offset)
, the fifth argument specifies the file descriptor to copy data from, while the sixth is the offset within the file from where to start copying.
In summary, when an executable runs, the loader uses mmap()
to reserve memory zones for its shared libraries.
Performance is not affected by this since pages are populated on-demand, when they’re accessed for the first time.
File I/O vs mmap()
When it comes to dynamic libraries, mmap
is unmatched.
With a single call, it handles address mapping, permission setting, and leverages demand paging to populate pages only when accessed.
Additionally, mmap()
fully supports copy-on-write (COW), allowing libraries to share the same physical frames across multiple processes, which conserves memory and reduces load time.
In contrast, using read
and write
would require loading the entire library into physical memory for each process individually, missing out on both the copy-on-write and demand paging benefits.
For regular files, however, the choice isn’t always straightforward.
The main sources of overhead for mmap()
include managing virtual memory mappings - which can lead to TLB flushes - and the cost of page faults due to demand paging.
On the plus side, mmap()
excels with random access patterns, efficiently reusing mapped pages.
It is also great for operating large amounts of data, as it enables the kernel to automatically unload and reload pages as needed when memory when under memory pressure.
A concrete scenario where these downsides outweigh the benefits of mmap()
is one-time, sequential I/O.
If you’re simply planning to read or write a file in one go, read()
and write()
are the way to go.
Guide: Reading Linux Directories
Everything in Linux is a file. This statement says that the Linux OS treats every entry in a file system (regular file, directory, block device, char device, link, UNIX socket) as a file. This unified approach simplifies file handling, allowing a single interface to interact with various types of entries. Let's see how this works in practice:
Navigate to
guides/reading-linux-dirs/support/
and checkoutdir_ops.c
. This code creates a directorydir
, if it does not exists, and attempts to open it the same way we would open a regular file. Compile and run the code.student@os:~/.../reading-linux-dirs/support$ ./dir_ops
12:45:34 FATAL dir_ops.c:17: fopen: Is a directoryThe error message is crystal clear: we cannot use
fopen()
on directories. So theFILE
structure is unsuited for directories. Therefore, this handler is not generic enough for a regular Linux filesystem, and we have to use a lower-level function.
Now that we know that
fopen()
reliesopenat()
, let's try usingopen()
, which wrapsopenat()
but offers a simpler interface.Inspect, compile and run the code
dir_ops_syscalls.c
.student@os:~/...reading-linux-dirs/support$ ./dir_ops_syscalls
Directory file descriptor is: 3This output proves that the
open()
syscall is capable of also handling directories, so it's closer to what we want.Note: that it is rather uncommon to use
open()
for directories. Most of the time,opendir()
is used instead.
In conclusion, the key difference between fopen()
and open()
is in the type of handler they return.
The FILE
structure from fopen()
is suited only for regular files, while the file descriptor returned by open()
is more flexible.
The differences between the two handlers are explored in the file descriptors section.