Skip to main content

Lab 7 - Copy-on-Write

Task: Investigate apache2 Using strace

Enter the chapters/compute/processes-threads-apache2/drills/tasks/apache2/support/ folder and go through the practice items below.

  1. Use strace to discover the server document root. The document root is the path in the filesystem from where httpd serves all the files requested by the clients.

First, you will have to stop the running container using make stop, then restart it with make run-privileged.

  1. Use strace inside the container to attach to the worker processes (use the -p option for this). You will also have to use the -f flag with strace, so that it will follow all the threads inside the processes.

  2. After you have attached successfully to all worker processes, use the curl command to send a request.

  3. Then check the strace output to see what files were opened by the server.

Quiz

If you're having difficulties solving this exercise, go through this reading material.

Task: Minor and Major Page Faults

The code in chapters/compute/copy-on-write/drills/tasks/page-faults/support/page_faults.c generates some minor and major page faults. Open 2 terminals: one in which you will run the program, and one which will monitor the page faults of the program. In the monitoring terminal, run the following command:

watch -n 1 'ps -eo min_flt,maj_flt,cmd | grep ./page_faults | head -n 1'

Compile the program and run it in the other terminal. You must press enter one time, before the program will prompt you to press enter more times. Watch the first number on the monitoring terminal; it increases. Those are the minor page faults.

Minor Page Faults

A minor page fault is generated whenever a requested page is present in the physical memory, as a frame, but that frame isn't allocated to the process generating the request. These types of page faults are the most common, and they happen when calling functions from dynamic libraries, allocating heap memory, loading programs, reading files that have been cached, and many more situations. Now back to the program.

The monitoring command already starts with some minor page faults, generated when loading the program.

After pressing enter, the number increases, because a function from a dynamic library (libc) is fetched when the first printf() is executed. Subsequent calls to functions that are in the same memory page as printf() won't generate other page faults.

After allocating the 100 Bytes, you might not see the number of page faults increase. This is because the "bookkeeping" data allocated by malloc() was able to fit in an already mapped page. The second allocation, the 1GB one, will always generate one minor page fault - for the bookkeeping data about the allocated memory zone. Notice that not all the pages for the 1GB are allocated. They are allocated - and generate page faults - when modified. By now you should know that this mechanism is called copy-on-write.

Continue with pressing enter and observing the effects util you reach opening file.txt.

Note that neither opening a file, getting information about it, nor mapping it in memory using mmap(), generate page faults. Also note the posix_fadvise() call after the one to fstat(). With this call we force the OS to not cache the file, so we can generate a major page fault.

Major Page Faults

Major page faults happen when a page is requested, but it isn't present in the physical memory. These types of page faults happen in 2 situations:

  • a page that was swapped out (to the disk), due to lack of memory, is now accessed - this case is harder to show
  • the OS needs to read a file from the disk, because the file contents aren't present in the cache - the case we are showing now

Press enter to print the file contents. Note the second number go up in the monitoring terminal.

Comment the posix_fadvise() call, recompile the program, and run it again. You won't get any major page fault, because the file contents are cached by the OS, to avoid those page faults. As a rule, the OS will avoid major page faults whenever possible, because they are very costly in terms of running time.

If you're having difficulties solving this exercise, go through this reading material.

Task: Shared Memory

Navigate to the chapters/compute/copy-on-write/drills/tasks/shared-memory/ directory, run make skels to generate the support/ folder, enter the support/src/ folder, open shared_memory.c and go through the practice items below.

Use the support/tests/checker.sh script to check your solution.

./checker.sh
mmap ............................ passed ... 25
sem_wait ........................ passed ... 25
sem_post ........................ passed ... 25
match value ..................... passed ... 25
Total: 100 / 100

As you remember from the Data chapter, one way to allocate a given number of pages is to use the mmap() syscall.

Let's look at its man page, specifically at the flags argument. Its main purpose is to determine the way in which child processes interact with the mapped pages.

Quiz

Now let's test this flag, as well as its opposite: MAP_SHARED. Compile and run the code in shared-memory/support/src/shared_memory.c.

  1. See the value read by the parent is different from that written by the child. Modify the flags parameter of mmap() so they are the same.

  2. Create a semaphore in the shared page and use it to make the parent signal the child before it can exit. Use the API defined in semaphore.h.

    **Be careful!**
    The value written and read previously by the child and the parent, respectively, must not change.

    One way of creating a shared semaphore is to place it within a shared memory area, as we've just done.
    This only works between "related" processes.
    If you want to share a semaphore or other types of memory between any two processes, you need filesystem support.
    For this, you should use **named semaphores**, created using [`sem_open()`](https://man7.org/linux/man-pages/man3/sem_open.3.html).
    You'll get more accustomed to such functions in the [Application Interaction chapter].

Task: Mini-shell

As you might remember, to create a new process you need to use fork (or clone) and exec system calls. If you don't, take a look at what happens under the hood when you use system.

Enter the chapters/compute/processes/drills/tasks/mini-shell directory, run make skels, open the support/src folder and go through the practice items below.

Use the tests/checker.sh script to check your solutions.

./checker.sh
mini_shell: ls ................ passed ... 50
mini_shell: pwd ................ passed ... 25
mini_shell: echo hello ................ passed ... 25
100 / 100
  1. With this knowledge in mind, let's implement our own mini-shell.

    Start from the skeleton code in mini_shell.c. We're already running our Bash interpreter from the command-line, so there's no need to exec another Bash from it.

    Simply exec the command.

    Quiz

    So we need a way to "save" the mini_shell process before exec()-ing our command. Find a way to do this.

    Hint: You can see what sleepy does and draw inspiration from there. Use strace to also list the calls to clone() performed by sleepy or its children. Remember what clone() is used for and use its parameters to deduce which of the two scenarios happens to sleepy.

Usage of Processes and Threads in apache2

We'll take a look at how a real-world application - the apache2 HTTP server - makes use of processes and threads. Since the server must be able to handle multiple clients at the same time, it must therefore use some form of concurrency. When a new client arrives, the server offloads the work of interacting with that client to another process or thread.

The choice of whether to use multiple processes or threads is not baked into the code. Instead, apache2 provides a couple of modules called MPMs (Multi-Processing Modules). Each module implements a different concurrency model, and the users can pick whatever module best fits their needs by editing the server configuration files.

The most common MPMs are

  • prefork: there are multiple worker processes, each process is single-threaded and handles one client request at a time
  • worker: there are multiple worker processes, each process is multi-threaded, and each thread handles one client request at a time
  • event: same as worker but designed to better handle some particular use cases

In principle, prefork provides more stability and backwards compatibility, but it has a bigger overhead. On the other hand, worker and event are more scalable, and thus able to handle more simultaneous connections, due to the usage of threads. On modern systems, event is almost always the default.

Conclusion

So far, you've probably seen that spawning a process can "use" a different program (hence the path in the args of system or Popen), but some languages such as Python allow you to spawn a process that executes a function from the same script. A thread, however, can only start from a certain entry point within the current address space, as it is bound to the same process. Concretely, a process is but a group of threads. For this reason, when we talk about scheduling or synchronization, we talk about threads. A thread is, thus, an abstraction of a task running on a CPU core. A process is a logical group of such tasks.

We can sum up what we've learned so far by saying that processes are better used for separate, independent work, such as the different connections handled by a server. Conversely, threads are better suited for replicated work: when the same task has to be performed on multiple cores. However, replicated work can also be suited for processes. Distributed applications, however, leverage different processes as this allows them to run on multiple physical machines at once. This is required by the very large workloads such applications are commonly required to process.

These rules are not set in stone, though. Like we saw in the apache2 example, the server uses multiple threads as well as multiple processes. This provides a degree of stability - if one worker thread crashes, it will only crash the other threads belonging to the same process - while still taking advantage of the light resource usage inherent to threads.

These kinds of trade-offs are a normal part of the development of real-world applications.

Copy-on-Write

So far, you know that the parent and child process have separate virtual address spaces. But how are they created, namely how are they "separated"? And what about the PAS (physical address space)? Of course, we would like the stack of the parent, for example, to be physically distinct from that of the child, so they can execute different functions and use different local variables.

But should all memory sections from the PAS of the parent be distinct from that of the child? What about some read-only memory sections, such as .text and .rodata? And what about the heap, where the child may use some data previously written by the parent and then override it with its own data.

The answer to all of these questions is a core mechanism of multiprocess operating systems called Copy-on-Write. It works according to one very simple principle:

The VAS of the child process initially points to the same PAS as that of the parent. A (physical) frame is only duplicated by the child when it attempts to write data to it.

This ensures that read-only sections remain shared, while writable sections are shared as long as their contents remain unchanged. When changes happen, the process making the change receives a unique frame as a modified copy of the original frame on demand.

In the image below, we have the state of the child and parent processes right after fork() returns in both of them. See how each has its own VAS, both of them being mapped to (mostly) the same PAS.

Copy-on-Write

When one process writes data to a writeable page (in our case, the child writes to a heap page), the frame to which it corresponds is first duplicated. Then the process' page table points the page to the newly copied frame, as you can see in the image below.

Copy-on-Write

Be careful! Do not confuse copy-on-write with demand paging. Remember from the Data chapter that demand paging means that when you allocate memory, the OS allocates virtual memory that remains unmapped to physical memory until it's used. On the other hand, copy-on-write posits that the virtual memory is already mapped to some frames. These frames are only duplicated when one of the processes attempts to write data to them.

Guide: apache2 Live Action

Let's run an actual instance of apache2 and see how everything works. Go to apache2/support and run make run. This will start a container with apache2 running inside.

Check that the server runs as expected:

student@os:~$ curl localhost:8080
<html><body><h1>It works!</h1></body></html>

Now go inside the container and take a look at running processes:

student@os:~/.../apache2/support$ docker exec -it apache2-test bash

root@56b9a761d598:/usr/local/apache2# ps -ef
UID PID PPID C STIME TTY TIME CMD
root 1 0 0 20:38 pts/0 00:00:00 httpd -DFOREGROUND
www-data 9 1 0 20:38 pts/0 00:00:00 httpd -DFOREGROUND
www-data 10 1 0 20:38 pts/0 00:00:00 httpd -DFOREGROUND
root 25 0 0 20:40 pts/1 00:00:00 bash
root 31 25 0 20:40 pts/1 00:00:00 ps -ef

We see 3 httpd processes. The first one, running as root, is the main process, while the other 2 are the workers.

Let's confirm that we are using the event mpm:

root@56b9a761d598:/usr/local/apache2# grep mod_mpm conf/httpd.conf
LoadModule mpm_event_module modules/mod_mpm_event.so
LoadModule mpm_prefork_module modules/mod_mpm_prefork.so
LoadModule mpm_worker_module modules/mod_mpm_worker.so

The event mpm is enabled, so we expect each worker to be multithreaded. Let's check:

root@56b9a761d598:/usr/local/apache2# ps -efL
UID PID PPID LWP C NLWP STIME TTY TIME CMD
root 1 0 1 0 1 20:56 pts/0 00:00:00 httpd -DFOREGROUND
www-data 8 1 8 0 7 20:56 pts/0 00:00:00 httpd -DFOREGROUND
www-data 8 1 11 0 7 20:56 pts/0 00:00:00 httpd -DFOREGROUND
www-data 8 1 12 0 7 20:56 pts/0 00:00:00 httpd -DFOREGROUND
www-data 8 1 16 0 7 20:56 pts/0 00:00:00 httpd -DFOREGROUND
www-data 8 1 17 0 7 20:56 pts/0 00:00:00 httpd -DFOREGROUND
www-data 8 1 18 0 7 20:56 pts/0 00:00:00 httpd -DFOREGROUND
www-data 8 1 19 0 7 20:56 pts/0 00:00:00 httpd -DFOREGROUND
www-data 9 1 9 0 7 20:56 pts/0 00:00:00 httpd -DFOREGROUND
www-data 9 1 14 0 7 20:56 pts/0 00:00:00 httpd -DFOREGROUND
www-data 9 1 15 0 7 20:56 pts/0 00:00:00 httpd -DFOREGROUND
www-data 9 1 20 0 7 20:56 pts/0 00:00:00 httpd -DFOREGROUND
www-data 9 1 21 0 7 20:56 pts/0 00:00:00 httpd -DFOREGROUND
www-data 9 1 22 0 7 20:56 pts/0 00:00:00 httpd -DFOREGROUND
www-data 9 1 23 0 7 20:56 pts/0 00:00:00 httpd -DFOREGROUND
root 24 0 24 1 1 20:56 pts/1 00:00:00 bash
root 30 24 30 0 1 20:56 pts/1 00:00:00 ps -efL

Indeed, each worker has 7 threads. In fact, the number of threads per worker is configurable, as well as the number of initial workers.

When a new connection is created, it will be handled by whatever thread is available from any worker. If all the threads are busy, then the server will spawn more worker processes (and therefore more threads), as long as the total number of threads is below some threshold, which is also configurable.

Let's see this dynamic scaling in action. We need to create a number of simultaneous connections that is larger than the current number of threads. There is a simple script in /apache2/supportmake_conn.py to do this:

student@os:~/.../apache2/support$ python3 make_conn.py localhost 8080
Press ENTER to exit

The script has created 100 connections and will keep them open until we press Enter.

Now, in another terminal, let's check the situation inside the container:

student@os:~/.../apache2/support$ docker exec -it apache2-test bash

root@56b9a761d598:/usr/local/apache2# ps -efL
UID PID PPID LWP C NLWP STIME TTY TIME CMD
root 1 0 1 0 1 20:56 pts/0 00:00:00 httpd -DFOREGROUND
www-data 40 1 40 0 7 21:07 pts/0 00:00:00 httpd -DFOREGROUND
www-data 40 1 45 0 7 21:07 pts/0 00:00:00 httpd -DFOREGROUND
www-data 40 1 46 0 7 21:07 pts/0 00:00:00 httpd -DFOREGROUND
www-data 40 1 51 0 7 21:07 pts/0 00:00:00 httpd -DFOREGROUND
www-data 40 1 52 0 7 21:07 pts/0 00:00:00 httpd -DFOREGROUND
www-data 40 1 53 0 7 21:07 pts/0 00:00:00 httpd -DFOREGROUND
www-data 40 1 54 0 7 21:07 pts/0 00:00:00 httpd -DFOREGROUND
www-data 55 1 55 0 7 21:07 pts/0 00:00:00 httpd -DFOREGROUND
www-data 55 1 58 0 7 21:07 pts/0 00:00:00 httpd -DFOREGROUND
www-data 55 1 60 0 7 21:07 pts/0 00:00:00 httpd -DFOREGROUND
www-data 55 1 62 0 7 21:07 pts/0 00:00:00 httpd -DFOREGROUND
www-data 55 1 63 0 7 21:07 pts/0 00:00:00 httpd -DFOREGROUND
www-data 55 1 65 0 7 21:07 pts/0 00:00:00 httpd -DFOREGROUND
www-data 55 1 66 0 7 21:07 pts/0 00:00:00 httpd -DFOREGROUND
[...]
www-data 109 1 109 0 7 21:07 pts/0 00:00:00 httpd -DFOREGROUND
www-data 109 1 115 0 7 21:07 pts/0 00:00:00 httpd -DFOREGROUND
www-data 109 1 116 0 7 21:07 pts/0 00:00:00 httpd -DFOREGROUND
www-data 109 1 121 0 7 21:07 pts/0 00:00:00 httpd -DFOREGROUND
www-data 109 1 122 0 7 21:07 pts/0 00:00:00 httpd -DFOREGROUND
www-data 109 1 123 0 7 21:07 pts/0 00:00:00 httpd -DFOREGROUND
www-data 109 1 124 0 7 21:07 pts/0 00:00:00 httpd -DFOREGROUND
root 146 0 146 0 1 21:10 pts/1 00:00:00 bash
root 152 146 152 0 1 21:10 pts/1 00:00:00 ps -efL

We see a much larger number of threads, as expected.

Guide: Fork Faults

Now let's see the copy-on-write mechanism in practice. Keep in mind that fork() is a function used to create a process.

Open two terminals (or better: use tmux). In one of them, compile and run the code in fork-faults/support/fork_faults.c. After each time you press Enter in the first terminal window, run the following command in the second window:

student@os:~/.../fork-faults/support$ ps -o min_flt,maj_flt -p $(pidof fork_faults)

It will show you the number of minor and major page faults performed by the fork_faults process and its child.

Quiz 1

Note that after fork()-ing, there is a second row in the output of ps. That corresponds to the child process. The first one still corresponds to the parent.

Quiz 2

Now it should be clear how demand paging differs from copy-on-write. Shared memory is a similar concept. It's a way of marking certain allocated pages so that copy-on-write is disabled. As you may imagine, changes made by the parent to this memory are visible to the child and vice-versa.