Skip to main content

Lab 6 - Multiprocess and Multithread

Task: Creating a process

Higher level - Python

Enter the chapters/compute/processes/drills/tasks/sleepy directory, run make skels, open the support/src folder and go through the practice items below.

Use the tests/checker.sh script to check your solutions.

./checker.sh
sleepy_creator ...................... passed ... 30
sleepy_creator_wait ................. passed ... 30
sleepy_creator_c .................... passed ... 40
100 / 100

Head over to sleepy_creator.py.

  1. Solve the TODO: use subprocess.Popen() to spawn 10 sleep 1000 processes.

    Start the script:

    student@os:~/.../tasks/sleepy/support$ python3 sleepy_creator.py

    Look for the parent process:

    student@os:~$ ps -e -H -o pid,ppid,cmd | (head -1; grep "python3 sleepy_creator.py")

    It is a python3 process, as this is the interpreter that runs the script, but we call it the sleepy_creator.py process for simplicity. No output will be provided by the above command, as the parent process (sleepy_creator.py) dies before its child processes (the 10 sleep 1000 subprocesses) finish their execution. The parent process of the newly created child processes is an init-like process: either systemd/init or another system process that adopts orphan processes. Look for the sleep child processes using:

    student@os:~$ ps -e -H -o pid,ppid,cmd | (head -1; grep sleep)
    PID PPID CMD
    4164 1680 sleep 1000
    4165 1680 sleep 1000
    4166 1680 sleep 1000
    4167 1680 sleep 1000
    4168 1680 sleep 1000
    4169 1680 sleep 1000
    4170 1680 sleep 1000
    4171 1680 sleep 1000
    4172 1680 sleep 1000
    4173 1680 sleep 1000

    Notice that the child processes do not have sleepy_creator.py as a parent. What's more, as you saw above, sleepy_creator.py doesn't even exist anymore. The child processes have been adopted by an init-like process (in the output above, that process has PID 1680 - PPID stands for parent process ID).

  1. Solve the TODO: change the code in sleepy_creator_wait.py so that the sleep 1000 processes remain the children of sleepy_creator_wait.py. This means that the parent / creator process must not exit until its children have finished their execution. In other words, the parent / creator process must wait for the termination of its children. Check out Popen.wait() and add the code that makes the parent / creator process wait for its children. Before anything, terminate the sleep processes created above:

    student@os:~$ pkill sleep

    Start the program, again, as you did before:

    student@os:~/.../tasks/sleepy/support$ python3 sleepy_creator.py

    On another terminal, verify that sleepy_creator_wait.py remains the parent of the sleep processes it creates:

    student@os:~$ ps -e -H -o pid,ppid,cmd | (head -1; grep sleep)
    PID PPID CMD
    16107 9855 python3 sleepy_creator.py
    16108 16107 sleep 1000
    16109 16107 sleep 1000
    16110 16107 sleep 1000
    16111 16107 sleep 1000
    16112 16107 sleep 1000
    16113 16107 sleep 1000
    16114 16107 sleep 1000
    16115 16107 sleep 1000
    16116 16107 sleep 1000
    16117 16107 sleep 1000

    Note that the parent process sleepy_creator_wait.py (PID 16107) is still alive, and its child processes (the 10 sleep 1000) have its ID as their PPID. You've successfully waited for the child processes to finish their execution.

    If you're having difficulties solving this exercise, go through this reading material.

Lower level - C

Now let's see how to create a child process in C. There are multiple ways of doing this. For now, we'll start with a higher-level approach.

Go to sleepy_creator.c and use system to create a sleep 1000 process.

The man page also mentions that system calls fork() and exec() to run the command it's given. If you want to find out more about them, head over to the Arena and create your own mini-shell.

Task: Wait for Me

Enter the chapters/compute/processes/drills/tasks/wait-for-me-processes/ directory, run make skels, open the support/src folder and go through the practice items below.

Use the tests/checker.sh script to check your solutions.

wait_for_me_processes ...................... passed ... 100
100 / 100
  1. Run the code in wait_for_me_processes.py (e.g: python3 wait_for_me_processes.py). The parent process creates one child that writes and message to the given file. Then the parent reads that message. Simple enough, right? But running the code raises a FileNotFoundError. If you inspect the file you gave the script as an argument, it does contain a string. What's going on?
In order to solve race conditions, we need **synchronization**.
This is a mechanism similar to a set of traffic lights in a crossroads.
Just like traffic lights allow some cars to pass only after others have already passed, synchronization is a means for threads to communicate with each other and tell each other to access a resource or not.

The most basic form of synchronization is **waiting**.
Concretely, if the parent process **waits** for the child to end, we are sure the file is created and its contents are written.
  1. Use join() to make the parent wait for its child before reading the file.

Task: Create Process

Enter the chapters/compute/processes/drills/tasks/create-process/ directory, run make skels, open the support/src folder and go through the practice items below.

Use the tests/checker.sh script to check your solutions.

./checker.sh
exit_code22 ...................... passed ... 50
second_fork ...................... passed ... 50
100 / 100
  1. Change the return value of the child process to 22 so that the value displayed by the parent is changed.

  2. Create a child process of the newly created child.

Use a similar logic and a similar set of prints to those in the support code. Take a look at the printed PIDs. Make sure the PPID of the "grandchild" is the PID of the child, whose PPID is, in turn, the PID of the parent.

Task: Multithreaded

Enter the chapters/compute/threads/drills/tasks/multithreaded/ folder, run make skels, and go through the practice items below in the support/ directory.

  1. Use the Makefile to compile multithread.c, run it and follow the instructions.

    The aim of this task is to familiarize you with the pthreads library. In order to use it, you have to add #include <pthread.h> in multithreaded.c and -lpthread in the compiler options.

    The executable creates 5 threads besides the main thread, puts each of them to sleep for 5 seconds, then waits for all of them to finish. Give it a run and notice that the total waiting time is around 5 seconds since you started the last thread. That is the whole point - they each run in parallel.

  2. Make each thread print its ID once it is done sleeping.

    Create a new function sleep_wrapper2() identical to sleep_wrapper() to organize your work. So far, the data argument is unused (mind the __unused attribute), so that is your starting point. You cannot change sleep_wrapper2() definition, since pthreads_create() expects a pointer to a function that receives a void * argument. What you can and should do is to pass a pointer to a int as argument, and then cast data to int * inside sleep_wrapper2().

    Note: Do not simply pass &i as argument to the function. This will make all threads to use the same integer as their ID.

    Note: Do not use global variables.

    If you get stuck you can google pthread example and you will probably stumble upon this.

  3. On top of printing its ID upon completion, make each thread sleep for a different amount of time.

    Create a new function sleep_wrapper3() identical to sleep_wrapper() to organize your work. The idea is to repeat what you did on the previous exercise and use the right argument for sleep_wrapper3(). Keep in mind that you cannot change its definition. Bonus points if you do not use the thread's ID as the sleeping amount.

Task: Libraries for Parallel Processing

In chapters/compute/threads/drills/tasks/sum-array/support/c/sum_array_threads.c we spawned threads "manually" by using the pthread_create() function. This is not a syscall, but a wrapper over the common syscall used by both fork() (which is also not a syscall) and pthread_create().

Still, pthread_create() is not yet a syscall. In order to see what syscall pthread_create() uses, check out this section.

Most programming languages provide a more advanced API for handling parallel computation.

Array Sum in Python

Let's first probe this by implementing two parallel versions of the code in sum-array/support/python/sum_array_sequential.py. One version should use threads and the other should use processes. Run each of them using 1, 2, 4, and 8 threads / processes respectively and compare the running times. Notice that the running times of the multithreaded implementation do not decrease. This is because the GIL makes it so that those threads that you create essentially run sequentially.

The GIL also makes it so that individual Python instructions are atomic. Run the code in chapters/compute/synchronization/drills/tasks/race-condition/support/python/race_condition.py. Every time, var will be 0 because the GIL doesn't allow the two threads to run in parallel and reach the critical section at the same time. This means that the instructions var += 1 and var -= 1 become atomic.

If you're having difficulties solving this exercise, go through this reading material.

Task: Wait for It

The process that spawns all the others and subsequently calls waitpid to wait for them to finish can also get their return codes. Update the code in chapters/compute/threads/drills/tasks/sum-array-bugs/support/seg-fault/sum_array_processes.c and modify the call to waitpid to obtain and investigate this return code. Display an appropriate message if one of the child processes returns an error.

Remember to use the appropriate macros for handling the status variable that is modified by waitpid(), as it is a bit-field. When a process runs into a system error, it receives a signal. A signal is a means to interrupt the normal execution of a program from the outside. It is associated with a number. Use kill -l to find the full list of signals.

So up to this point we've seen that one advantage of processes is that they offer better safety than threads. Because they use separate virtual address spaces, sibling processes are better isolated than threads. Thus, an application that uses processes can be more robust to errors than if it were using threads.

Memory Corruption

Because they share the same address space, threads run the risk of corrupting each other's data. Take a look at the code in sum-array-bugs/support/memory-corruption/python/. The two programs only differ in how they spread their workload. One uses threads while the other uses processes.

Run both programs with and without memory corruption. Pass any value as a third argument to trigger the corruption.

student@os:~/.../sum-array-bugs/support/memory-corruption/python$ python3 memory_corruption_processes.py <number_of_processes>  # no memory corruption
[...]

student@os:~/.../sum-array-bugs/support/memory-corruption/python$ python3 memory_corruption_processes.py <number_of_processes> 1 # do memory corruption
[...]

The one using threads will most likely print a negative sum, while the other displays the correct sum. This happens because all threads refer to the same memory for the array arr. What happens to the processes is a bit more complicated.

Later in this lab, we will see that initially, the page tables of all processes point to the same physical frames or arr. When the malicious process tries to corrupt this array by writing data to it, the OS duplicates the original frames of arr so that the malicious process writes the corrupted values to these new frames, while leaving the original ones untouched. This mechanism is called Copy-on-Write and is an OS optimisation so that memory is shared between the parent and the child process, until one of them attempts to write to it. At this point, this process receives its own separate copies of the previously shared frames.

Note that in order for the processes to share the sums dictionary, it is not created as a regular dictionary, but using the Manager module. This module provides some special data structures that are allocated in shared memory so that all processes can access them. You can learn more about shared memory and its various implementations in this section.

If you're having difficulties solving this exercise, go through this reading material.

Hardware Perspective

The main criterion we use to rank CPUs is their computation power, i.e. their ability to crunch numbers and do math. Numerous benchmarks exist out there, and they are publicly displayed on sites such as CPUBenchmark.

For example, a benchmark can measure the performance of the computer's CPU in a variety of scenarios:

  • its ability to perform integer operations
  • its speed in floating point arithmetic
  • data encryption and compression
  • sorting algorithms and others

You can take a look at what exactly is measured using this link. It displays the scores obtained by a high-end CPU. Apart from the tests above, other benchmarks might focus on different performance metrics, such as branch prediction or prefetching.

Other approaches are less artificial, measuring performance on real-world applications such as compile times and performance in the latest (and most resource-demanding) video games. The latter metric revolves around how many average FPS (frames per second) a given CPU is able to crank out in a specific video game. You can find a lot of articles online on how CPU benchmarking is done.

Most benchmarks, unfortunately, are not open source, especially the more popular ones, such as Geekbench 5. Despite this shortcoming, benchmarks are widely used to compare the performance of various computer hardware, CPUs included.

The Role of the Operating System

As you've seen so far, the CPU provides the "muscle" required for fast computation, i.e. the highly optimised hardware and multiple ALUs, FPUs and cores necessary to perform those computations. However, it is the operating system that provides the "brains" for this computation. Specifically, modern CPUs have the capacity to run multiple tasks in parallel. But they do not provide a means to decide which task to run at each moment. The OS comes as an orchestrator to schedule the way these tasks (that we will later call threads) are allowed to run and use the CPU's resources. This way, the OS tells the CPU what code to run on each CPU core so that it reaches a good balance between high throughput (running many instructions) and fair access to CPU cores.

It is cumbersome for a user-level application to interact directly with the CPU. The developer would have to write hardware-specific code, which is not scalable and is difficult to maintain. In addition, doing so would leave it up to the developer to isolate their application from the others that are present on the system. This leaves applications vulnerable to countless bugs and exploits.

To guard apps from these pitfalls, the OS comes and mediates interactions between regular programs and the CPU by providing a set of abstractions. These abstractions offer a safe, uniform and also isolated way to leverage the CPU's resources, i.e. its cores. There are 2 main abstractions: processes and threads.

Interaction between applications, OS and CPU

As we can see from the image above, an application can spawn one or more processes. Each of these is handled and maintained by the OS. Similarly, each process can spawn however many threads, which are also managed by the OS. The OS decides when and on what CPU core to make each thread run. This is in line with the general interaction between an application and the hardware: it is always mediated by the OS.

Processes

A process is simply a running program. Let's take the ls command as a trivial example. ls is a program on your system. It has a binary file which you can find and inspect with the help of the which command:

student@os:~$ which ls
/usr/bin/ls

student@os:~$ file /usr/bin/ls
/usr/bin/ls: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=6e3da6f0bc36b6398b8651bbc2e08831a21a90da, for GNU/Linux 3.2.0, stripped

When you run it, the ls binary stored on the disk at /usr/bin/ls is read by another application called the loader. The loader spawns a process by copying some of the contents /usr/bin/ls in memory (such as the .text, .rodata and .data sections). Using strace, we can see the execve system call:

student@os:~$ strace -s 100 ls -a  # -s 100 limits strings to 100 bytes instead of the default 32
execve("/usr/bin/ls", ["ls", "-a"], 0x7fffa7e0d008 /* 61 vars */) = 0
[...]
write(1, ". .. content\tCONTRIBUTING.md COPYING.md .git .gitignore README.md REVIEWING.md\n", 86. .. content CONTRIBUTING.md COPYING.md .git .gitignore README.md REVIEWING.md
) = 86
close(1) = 0
close(2) = 0
exit_group(0) = ?
+++ exited with 0 +++

Look at its parameters:

  • the path to the program: /usr/bin/ls
  • the list of arguments: "ls", "-a"
  • the environment variables: the rest of the syscall's arguments

execve invokes the loader to load the VAS of the ls process by replacing that of the existing process. All subsequent syscalls are performed by the newly spawned ls process. We will get into more details regarding execve towards the end of this lab.

Loading of `ls` Process

Fork

Up to now we've been creating processes using various high-level APIs, such as Popen(), Process() and system(). Yes, despite being a C function, as you've seen from its man page, system() itself calls 2 other functions: fork() to create a process and execve() to execute the given command. As you already know from the Software Stack chapter, library functions may call one or more underlying system calls or other functions. Now we will move one step lower on the call stack and call fork() ourselves.

fork() creates one child process that is almost identical to its parent. We say that fork() returns twice: once in the parent process and once more in the child process. This means that after fork() returns, assuming no error has occurred, both the child and the parent resume execution from the same place: the instruction following the call to fork(). What's different between the two processes is the value returned by fork():

  • child process: fork() returns 0
  • parent process: fork() returns the PID of the child process (> 0)
  • on error: fork() returns -1, only once, in the initial process

Therefore, the typical code for handling a fork() is available in create-process/support/fork.c. Take a look at it and then run it. Notice what each of the two processes prints:

  • the PID of the child is also known by the parent
  • the PPID of the child is the PID of the parent

Unlike system(), who also waits for its child, when using fork() we must do the waiting ourselves. In order to wait for a process to end, we use the waitpid() syscall. It places the exit code of the child process in the status parameter. This argument is actually a bit-field containing more information than merely the exit code. To retrieve the exit code, we use the WEXITSTATUS macro. Keep in mind that WEXITSTATUS only makes sense if WIFEXITED is true, i.e. if the child process finished on its own and wasn't killed by another one or by an illegal action (such as a segfault or illegal instruction) for example. Otherwise, WEXITSTATUS will return something meaningless. You can view the rest of the information stored in the status bit-field in the man page.

Moral of the story: Usually the execution flow is:

  1. fork(), followed by

  2. wait() (called by the parent)

  3. exit(), called by the child.

The order of last 2 steps may be swapped.

Threads

Threads vs Processes

So why use the implementation that spawns more processes if it's slower than the one using threads? The table below lists the differences between threads and processes. Generally, if we only want to do some computing, we use threads. If we need to drastically change the behaviour of the program, we need a new program altogether, or we need more than computing (e.g. communication on the network to create a computing cluster), we use processes.

PROCESSTHREAD
independentpart of a process
collection of threadsshares VAS with other threads
slower creation (new page table must be created)faster creation
longer context switch duration (TLB must be flushed)shorter context switch duration (part of the same process, so same TLB)
ending means ending all threadsother threads continue when finished

Safety

Compile and run the two programs in chapters/compute/threads/drills/tasks/sum-array-bugs/support/seg-fault/, first with 2 processes and threads and then with 4. They do the same thing as before: compute the sum of the elements in an array, but with a twist: each of them contains a bug causing a segfault. Notice that sum_array_threads doesn't print anything with 4 threads, but merely a "Segmentation fault" message. On the other hand, sum_array_processes prints a sum and a running time, albeit different from the sums we've seen so far.

The reason is that signals such as SIGSEGV, which is used when a segmentation fault happens affect the entire process that handles them. Therefore, when we split our workload between several threads and one of them causes an error such as a segfault, that error is going to terminate the entire process. The same thing happens when we use processes instead of threads: one process causes an error, which gets it killed, but the other processes continue their work unhindered. This is why we end up with a lower sum in the end: because one process died too early and didn't manage to write the partial sum it had computed to the results array.

Memory Layout of Multithreaded Programs

When a new thread is created, a new stack is allocated for a thread. The default stack size if 8 MB / 8192 KB:

student@os:~$ ulimit -s
8192

Enter the chapters/compute/threads/drills/tasks/multithreaded/support/ directory to observe the update of the memory layout when creating new threads.

Build the multithreaded executable:

student@os:~/.../multithreaded/support$ make

Start the program:

student@os:~/.../multithreaded/support$ ./multithreaded
Press key to start creating threads ...
[...]

And investigate it with pmap on another console, while pressing a key to create new threads.

As you can see, there is a new 8192 KB area created for every thread, also increasing the total virtual size.

Guide: Baby steps - Python

Run the code in chapters/compute/processes/guides/create-process/support/popen.py. It simply spawns a new process running the ls command using subprocess.Popen(). Do not worry about the huge list of arguments that Popen() takes. They are used for inter-process-communication. You'll learn more about this in the [Application Interaction chapter].

Note that this usage of Popen() is not entirely correct. You'll discover why later, but for now focus on simply understanding how to use Popen() on its own.

Now change the command to anything you want. Also give it some arguments. From the outside, it's as if you were running these commands from the terminal.

Guide: Sum Array Processes

Sum of the Elements in an Array

Let's assume we only have one process on our system, and that process knows how to add the numbers in an array. It can use however many resources it wants, since there is no other process to contest it. It would probably look like the code in chapters/compute/processes/guides/sum-array-processes/support/c/sum_array_sequential.c. The program also measures the time spent computing the sum. Let's compile and run it:

student@os:~/.../sum-array/support/c$ ./sum_array_sequential
Array sum is: 49945994146
Time spent: 127 ms

You will most likely get a different sum (because the array is made up of random numbers) and a different time than the ones shown above. This is perfectly fine. Use these examples qualitatively, not quantitatively.

Spreading the Work Among Other Processes

Due to how it's implemented so far, our program only uses one of our CPU's cores. We never tell it to distribute its workload to other cores. This is wasteful as the rest of our cores remain unused:

student@os:~$ lscpu | grep ^CPU\(s\):
CPU(s): 8

We have 7 more cores waiting to add numbers in our array.

What if we used 100% of the CPU?

What if we use 7 more processes and spread the task of adding the numbers in this array between them? If we split the array into several equal parts and designate a separate process to calculate the sum of each part, we should get a speedup because now the work performed by each individual process is reduced.

Let's take it methodically. Compile and run sum_array_processes.c using 1, 2, 4 and 8 processes respectively. If your system only has 4 cores (hyperthreading included), limit your runs to 4 processes. Note the running times for each number of processes. We expect the speedups compared to our reference run to be 1, 2, 4 and 8 respectively, right?

You most likely did get some speedup, especially when using 8 processes. Now we will try to improve this speedup by using threads instead.

Also notice that we're not using hundreds or thousands of processes. Assuming our system has 8 cores, only 8 threads can run at the same time. In general, the maximum number of threads that can run at the same time is equal to the number of cores. In our example, each process only has one thread: its main thread. So by consequence and by forcing the terminology (because it's the main thread of these processes that is running, not the processes themselves), we can only run in parallel a number of processes equal to at most the number of cores.

Guide: system Dissected

You already know that system calls fork() and execve() to create the new process. Let's see how and why. First, we run the following command to trace the execve() syscalls used by sleepy_creator. We'll leave fork() for later.

student@os:~/.../sleepy/support$ strace -e execve -ff -o syscalls ./sleepy_creator

At this point, you will get two files whose names start with syscalls, followed by some numbers. Those numbers are the PIDs of the parent and the child process. Therefore, the file with the higher number contains logs of the execve and clone syscalls issued by the parent process, while the other logs those two syscalls when made by the child process. Let's take a look at them. The numbers below will differ from those on your system:

student@os:~/.../sleepy/support:$ cat syscalls.2523393  # syscalls from parent process
execve("sleepy_creator", ["sleepy_creator"], 0x7ffd2c157758 /* 39 vars */) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2523394, si_uid=1052093, si_status=0, si_utime=0, si_stime=0} ---
+++ exited with 0 +++

student@os:~/.../sleepy/support:$ cat syscalls.2523394 # syscalls from child process
execve("/bin/sh", ["sh", "-c", "sleep 10"], 0x7ffd36253be8 /* 39 vars */) = 0
execve("/usr/bin/sleep", ["sleep", "10"], 0x560f41659d40 /* 38 vars */) = 0
+++ exited with 0 +++

Now notice that the child process doesn't simply call execve("/usr/bin/sleep" ...). It first changes its virtual address space (VAS) to that of a bash process (execve("/bin/sh" ...)) and then that bash process switches its VAS to sleep. Therefore, calling system(<some_command>) is equivalent to running <some_command> in the command-line.

Moral of the story: When spawning a new command, the call order is:

  • parent: fork(), exec(), wait()
  • child: exit()

Guide: Sum array Threads

Spreading the Work Among Other Threads

Compile the code in chapters/compute/threads/guides/sum-array-threads/support/c/sum_array_threads.c and run it using 1, 2, 4 and 8 threads as you did before. Each thread runs the calculate_array_part_sum() function and then finishes. Running times should be slightly smaller than the implementation using processes. This slight time difference is caused by process creation actions, which are costlier than thread creation actions. Because a process needs a separate virtual address space (VAS) and needs to duplicate some internal structures such as the file descriptor table and page table, it takes the operating system more time to create it than to create a thread. On the other hand, threads belonging to the same process share the same VAS and, implicitly, the same OS-internal structures. Therefore, they are more lightweight than processes.

std.parallelism in D

D language's standard library exposes the std.parallelism, which provides a series of parallel processing functions. One such function is reduce(), which splits an array between a given number of threads and applies a given operation to these chunks. In our case, the operation simply adds the elements to an accumulator: a + b. Follow and run the code in chapters/compute/threads/guides/sum-array-threads/support/d/sum_array_threads_reduce.d.

The number of threads is used within a TaskPool. This structure is a thread manager (not scheduler). It silently creates the number of threads we request and then reduce() spreads its workload between these threads.

Now that you've seen how parallelism works in D, go in chapters/compute/threads/guides/sum-array-threads/support/java/SumArrayThreads.java and follow the TODOs. The code is similar to the one written in D, and it uses ThreadPoolExecutor. More about that here. To run the code use:

javac SumArrayThreads.java
java SumArrayThreads 4

4 is the number of threads used, but you can replace the value with a number less or equal than your available cores.

OpenMP for C

Unlike D, C does not support parallel computation by design. It needs a library to do advanced things, like reduce() from D. We have chosen to use the OpenMP library for this. Follow the code in chapters/compute/threads/guides/sum-array-threads/support/c/sum_array_threads_openmp.c.

The #pragma used in the code instructs the compiler to enable the omp module, and to parallelise the code. In this case, we instruct the compiler to perform a reduce of the array, using the + operator, and to store the results in the result variable. This reduction uses threads to calculate the sum, similar to summ_array_threads.c, but in a much more optimised form.

One of the advantages of OpenMP is that is relatively easy to use. The syntax requires only a few additional lines of code and compiler options, thus converting sequential code into parallel code quickly. For example, using #pragma omp parallel for, a developer can parallelize a for loop, enabling iterations to run across multiple threads.

OpenMP uses a shared-memory model, meaning all threads can access a common memory space. This model is particularly useful for tasks that require frequent access to shared data, as it avoids the overhead of transferring data between threads. However, shared memory can also introduce challenges, such as race conditions or synchronization issues, which can occur when multiple threads attempt to modify the same data simultaneously, but we'll talk about that later. OpenMP offers constructs such as critical sections, atomic operations, and reductions to help manage these issues and ensure that parallel code executes safely and correctly.

Now compile and run the sum_array_threads_openmp binary using 1, 2, 4, and 8 threads as before. You'll see lower running times than sum_array_threads due to the highly-optimised code emitted by the compiler. For this reason and because library functions are usually much better tested than your own code, it is always preferred to use a library function for a given task.

For a challenge, enter chapters/compute/threads/guides/sum-array-threads/support/c/add_array_threads_openmp.c. Use what you've learned from the previous exercise and add the value 100 to an array using OpenMP.

Guide: Threads and Processes: clone

Let's go back to our initial demos that used threads and processes. We'll see that in order to create both threads and processes, the underlying Linux syscall is clone. For this, we'll run both sum_array_threads and sum_array_processes under strace. As we've already established, we're only interested in the clone syscall:

student@os:~/.../sum-array/support/c$ strace -e clone,clone3 ./sum_array_threads 2
clone(child_stack=0x7f60b56482b0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[1819693], tls=0x7f60b5649640, child_tidptr=0x7f60b5649910) = 1819693
clone(child_stack=0x7f60b4e472b0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[1819694], tls=0x7f60b4e48640, child_tidptr=0x7f60b4e48910) = 1819694

student@os:~/.../sum-array/support/c$ strace -e clone,clone3 ./sum_array_processes 2
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f7a4e346650) = 1820599
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f7a4e346650) = 1820600

We ran each program with an argument of 2, so we have 2 calls to clone. Notice that in the case of threads, the clone3 syscall receives more arguments. The relevant flags passed as arguments when creating threads are documented in clone's man page:

  • CLONE_VM: the child and the parent process share the same VAS
  • CLONE_{FS,FILES,SIGHAND}: the new thread shares the filesystem information, file and signal handlers with the one that created it. The syscall also receives valid pointers to the new thread's stack and TLS, i.e. the only parts of the VAS that are distinct between threads (although they are technically accessible from all threads).

By contrast, when creating a new process, the arguments of the clone syscall are simpler (i.e. fewer flags are present). Remember that in both cases clone creates a new thread. When creating a process, clone creates this new thread within a new separate address space.