Lab 6 - Multiprocess and Multithread
Task: Creating a process
Higher level - Python
Enter the chapters/compute/processes/drills/tasks/sleepy
directory, run make skels
, open the support/src
folder and go through the practice items below.
Use the tests/checker.sh
script to check your solutions.
./checker.sh
sleepy_creator ...................... passed ... 30
sleepy_creator_wait ................. passed ... 30
sleepy_creator_c .................... passed ... 40
100 / 100
Head over to sleepy_creator.py
.
Solve the
TODO
: usesubprocess.Popen()
to spawn 10sleep 1000
processes.Start the script:
student@os:~/.../tasks/sleepy/support$ python3 sleepy_creator.py
Look for the parent process:
student@os:~$ ps -e -H -o pid,ppid,cmd | (head -1; grep "python3 sleepy_creator.py")
It is a
python3
process, as this is the interpreter that runs the script, but we call it thesleepy_creator.py
process for simplicity. No output will be provided by the above command, as the parent process (sleepy_creator.py
) dies before its child processes (the 10sleep 1000
subprocesses) finish their execution. The parent process of the newly created child processes is aninit
-like process: eithersystemd
/init
or another system process that adopts orphan processes. Look for thesleep
child processes using:student@os:~$ ps -e -H -o pid,ppid,cmd | (head -1; grep sleep)
PID PPID CMD
4164 1680 sleep 1000
4165 1680 sleep 1000
4166 1680 sleep 1000
4167 1680 sleep 1000
4168 1680 sleep 1000
4169 1680 sleep 1000
4170 1680 sleep 1000
4171 1680 sleep 1000
4172 1680 sleep 1000
4173 1680 sleep 1000Notice that the child processes do not have
sleepy_creator.py
as a parent. What's more, as you saw above,sleepy_creator.py
doesn't even exist anymore. The child processes have been adopted by aninit
-like process (in the output above, that process has PID1680
-PPID
stands for parent process ID).
Solve the
TODO
: change the code insleepy_creator_wait.py
so that thesleep 1000
processes remain the children ofsleepy_creator_wait.py
. This means that the parent / creator process must not exit until its children have finished their execution. In other words, the parent / creator process must wait for the termination of its children. Check outPopen.wait()
and add the code that makes the parent / creator process wait for its children. Before anything, terminate thesleep
processes created above:student@os:~$ pkill sleep
Start the program, again, as you did before:
student@os:~/.../tasks/sleepy/support$ python3 sleepy_creator.py
On another terminal, verify that
sleepy_creator_wait.py
remains the parent of thesleep
processes it creates:student@os:~$ ps -e -H -o pid,ppid,cmd | (head -1; grep sleep)
PID PPID CMD
16107 9855 python3 sleepy_creator.py
16108 16107 sleep 1000
16109 16107 sleep 1000
16110 16107 sleep 1000
16111 16107 sleep 1000
16112 16107 sleep 1000
16113 16107 sleep 1000
16114 16107 sleep 1000
16115 16107 sleep 1000
16116 16107 sleep 1000
16117 16107 sleep 1000Note that the parent process
sleepy_creator_wait.py
(PID 16107
) is still alive, and its child processes (the 10sleep 1000
) have its ID as theirPPID
. You've successfully waited for the child processes to finish their execution.If you're having difficulties solving this exercise, go through this reading material.
Lower level - C
Now let's see how to create a child process in C. There are multiple ways of doing this. For now, we'll start with a higher-level approach.
Go to sleepy_creator.c
and use system
to create a sleep 1000
process.
The man
page also mentions that system
calls fork()
and exec()
to run the command it's given.
If you want to find out more about them, head over to the Arena and create your own mini-shell.
Task: Wait for Me
Enter the chapters/compute/processes/drills/tasks/wait-for-me-processes/
directory, run make skels
, open the support/src
folder and go through the practice items below.
Use the tests/checker.sh
script to check your solutions.
wait_for_me_processes ...................... passed ... 100
100 / 100
- Run the code in
wait_for_me_processes.py
(e.g:python3 wait_for_me_processes.py
). The parent process creates one child that writes and message to the given file. Then the parent reads that message. Simple enough, right? But running the code raises aFileNotFoundError
. If you inspect the file you gave the script as an argument, it does contain a string. What's going on?
In order to solve race conditions, we need **synchronization**.
This is a mechanism similar to a set of traffic lights in a crossroads.
Just like traffic lights allow some cars to pass only after others have already passed, synchronization is a means for threads to communicate with each other and tell each other to access a resource or not.
The most basic form of synchronization is **waiting**.
Concretely, if the parent process **waits** for the child to end, we are sure the file is created and its contents are written.
- Use
join()
to make the parent wait for its child before reading the file.
Task: Create Process
Enter the chapters/compute/processes/drills/tasks/create-process/
directory, run make skels
, open the support/src
folder and go through the practice items below.
Use the tests/checker.sh
script to check your solutions.
./checker.sh
exit_code22 ...................... passed ... 50
second_fork ...................... passed ... 50
100 / 100
Change the return value of the child process to 22 so that the value displayed by the parent is changed.
Create a child process of the newly created child.
Use a similar logic and a similar set of prints to those in the support code. Take a look at the printed PIDs. Make sure the PPID of the "grandchild" is the PID of the child, whose PPID is, in turn, the PID of the parent.
Task: Multithreaded
Enter the chapters/compute/threads/drills/tasks/multithreaded/
folder, run make skels
, and go through the practice items below in the support/
directory.
Use the Makefile to compile
multithread.c
, run it and follow the instructions.The aim of this task is to familiarize you with the
pthreads
library. In order to use it, you have to add#include <pthread.h>
inmultithreaded.c
and-lpthread
in the compiler options.The executable creates 5 threads besides the main thread, puts each of them to sleep for 5 seconds, then waits for all of them to finish. Give it a run and notice that the total waiting time is around 5 seconds since you started the last thread. That is the whole point - they each run in parallel.
Make each thread print its ID once it is done sleeping.
Create a new function
sleep_wrapper2()
identical tosleep_wrapper()
to organize your work. So far, thedata
argument is unused (mind the__unused
attribute), so that is your starting point. You cannot changesleep_wrapper2()
definition, sincepthreads_create()
expects a pointer to a function that receives avoid *
argument. What you can and should do is to pass a pointer to aint
as argument, and then castdata
toint *
insidesleep_wrapper2()
.Note: Do not simply pass
&i
as argument to the function. This will make all threads to use the same integer as their ID.Note: Do not use global variables.
If you get stuck you can google
pthread example
and you will probably stumble upon this.On top of printing its ID upon completion, make each thread sleep for a different amount of time.
Create a new function
sleep_wrapper3()
identical tosleep_wrapper()
to organize your work. The idea is to repeat what you did on the previous exercise and use the right argument forsleep_wrapper3()
. Keep in mind that you cannot change its definition. Bonus points if you do not use the thread's ID as the sleeping amount.
Task: Libraries for Parallel Processing
In chapters/compute/threads/drills/tasks/sum-array/support/c/sum_array_threads.c
we spawned threads "manually" by using the pthread_create()
function.
This is not a syscall, but a wrapper over the common syscall used by both fork()
(which is also not a syscall) and pthread_create()
.
Still, pthread_create()
is not yet a syscall.
In order to see what syscall pthread_create()
uses, check out this section.
Most programming languages provide a more advanced API for handling parallel computation.
Array Sum in Python
Let's first probe this by implementing two parallel versions of the code in sum-array/support/python/sum_array_sequential.py
.
One version should use threads and the other should use processes.
Run each of them using 1, 2, 4, and 8 threads / processes respectively and compare the running times.
Notice that the running times of the multithreaded implementation do not decrease.
This is because the GIL makes it so that those threads that you create essentially run sequentially.
The GIL also makes it so that individual Python instructions are atomic.
Run the code in chapters/compute/synchronization/drills/tasks/race-condition/support/python/race_condition.py
.
Every time, var
will be 0 because the GIL doesn't allow the two threads to run in parallel and reach the critical section at the same time.
This means that the instructions var += 1
and var -= 1
become atomic.
If you're having difficulties solving this exercise, go through this reading material.
Task: Wait for It
The process that spawns all the others and subsequently calls waitpid
to wait for them to finish can also get their return codes.
Update the code in chapters/compute/threads/drills/tasks/sum-array-bugs/support/seg-fault/sum_array_processes.c
and modify the call to waitpid
to obtain and investigate this return code.
Display an appropriate message if one of the child processes returns an error.
Remember to use the appropriate macros for handling the status
variable that is modified by waitpid()
, as it is a bit-field.
When a process runs into a system error, it receives a signal.
A signal is a means to interrupt the normal execution of a program from the outside.
It is associated with a number.
Use kill -l
to find the full list of signals.
So up to this point we've seen that one advantage of processes is that they offer better safety than threads. Because they use separate virtual address spaces, sibling processes are better isolated than threads. Thus, an application that uses processes can be more robust to errors than if it were using threads.
Memory Corruption
Because they share the same address space, threads run the risk of corrupting each other's data.
Take a look at the code in sum-array-bugs/support/memory-corruption/python/
.
The two programs only differ in how they spread their workload.
One uses threads while the other uses processes.
Run both programs with and without memory corruption. Pass any value as a third argument to trigger the corruption.
student@os:~/.../sum-array-bugs/support/memory-corruption/python$ python3 memory_corruption_processes.py <number_of_processes> # no memory corruption
[...]
student@os:~/.../sum-array-bugs/support/memory-corruption/python$ python3 memory_corruption_processes.py <number_of_processes> 1 # do memory corruption
[...]
The one using threads will most likely print a negative sum, while the other displays the correct sum.
This happens because all threads refer to the same memory for the array arr
.
What happens to the processes is a bit more complicated.
Later in this lab, we will see that initially, the page tables of all processes point to the same physical frames or arr
.
When the malicious process tries to corrupt this array by writing data to it, the OS duplicates the original frames of arr
so that the malicious process writes the corrupted values to these new frames, while leaving the original ones untouched.
This mechanism is called Copy-on-Write and is an OS optimisation so that memory is shared between the parent and the child process, until one of them attempts to write to it.
At this point, this process receives its own separate copies of the previously shared frames.
Note that in order for the processes to share the sums
dictionary, it is not created as a regular dictionary, but using the Manager
module.
This module provides some special data structures that are allocated in shared memory so that all processes can access them.
You can learn more about shared memory and its various implementations in this section.
If you're having difficulties solving this exercise, go through this reading material.
Hardware Perspective
The main criterion we use to rank CPUs is their computation power, i.e. their ability to crunch numbers and do math. Numerous benchmarks exist out there, and they are publicly displayed on sites such as CPUBenchmark.
For example, a benchmark can measure the performance of the computer's CPU in a variety of scenarios:
- its ability to perform integer operations
- its speed in floating point arithmetic
- data encryption and compression
- sorting algorithms and others
You can take a look at what exactly is measured using this link. It displays the scores obtained by a high-end CPU. Apart from the tests above, other benchmarks might focus on different performance metrics, such as branch prediction or prefetching.
Other approaches are less artificial, measuring performance on real-world applications such as compile times and performance in the latest (and most resource-demanding) video games. The latter metric revolves around how many average FPS (frames per second) a given CPU is able to crank out in a specific video game. You can find a lot of articles online on how CPU benchmarking is done.
Most benchmarks, unfortunately, are not open source, especially the more popular ones, such as Geekbench 5. Despite this shortcoming, benchmarks are widely used to compare the performance of various computer hardware, CPUs included.
The Role of the Operating System
As you've seen so far, the CPU provides the "muscle" required for fast computation, i.e. the highly optimised hardware and multiple ALUs, FPUs and cores necessary to perform those computations. However, it is the operating system that provides the "brains" for this computation. Specifically, modern CPUs have the capacity to run multiple tasks in parallel. But they do not provide a means to decide which task to run at each moment. The OS comes as an orchestrator to schedule the way these tasks (that we will later call threads) are allowed to run and use the CPU's resources. This way, the OS tells the CPU what code to run on each CPU core so that it reaches a good balance between high throughput (running many instructions) and fair access to CPU cores.
It is cumbersome for a user-level application to interact directly with the CPU. The developer would have to write hardware-specific code, which is not scalable and is difficult to maintain. In addition, doing so would leave it up to the developer to isolate their application from the others that are present on the system. This leaves applications vulnerable to countless bugs and exploits.
To guard apps from these pitfalls, the OS comes and mediates interactions between regular programs and the CPU by providing a set of abstractions. These abstractions offer a safe, uniform and also isolated way to leverage the CPU's resources, i.e. its cores. There are 2 main abstractions: processes and threads.
As we can see from the image above, an application can spawn one or more processes. Each of these is handled and maintained by the OS. Similarly, each process can spawn however many threads, which are also managed by the OS. The OS decides when and on what CPU core to make each thread run. This is in line with the general interaction between an application and the hardware: it is always mediated by the OS.
Processes
A process is simply a running program.
Let's take the ls
command as a trivial example.
ls
is a program on your system.
It has a binary file which you can find and inspect with the help of the which
command:
student@os:~$ which ls
/usr/bin/ls
student@os:~$ file /usr/bin/ls
/usr/bin/ls: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=6e3da6f0bc36b6398b8651bbc2e08831a21a90da, for GNU/Linux 3.2.0, stripped
When you run it, the ls
binary stored on the disk at /usr/bin/ls
is read by another application called the loader.
The loader spawns a process by copying some of the contents /usr/bin/ls
in memory (such as the .text
, .rodata
and .data
sections).
Using strace
, we can see the execve
system call:
student@os:~$ strace -s 100 ls -a # -s 100 limits strings to 100 bytes instead of the default 32
execve("/usr/bin/ls", ["ls", "-a"], 0x7fffa7e0d008 /* 61 vars */) = 0
[...]
write(1, ". .. content\tCONTRIBUTING.md COPYING.md .git .gitignore README.md REVIEWING.md\n", 86. .. content CONTRIBUTING.md COPYING.md .git .gitignore README.md REVIEWING.md
) = 86
close(1) = 0
close(2) = 0
exit_group(0) = ?
+++ exited with 0 +++
Look at its parameters:
- the path to the program:
/usr/bin/ls
- the list of arguments:
"ls", "-a"
- the environment variables: the rest of the syscall's arguments
execve
invokes the loader to load the VAS of the ls
process by replacing that of the existing process.
All subsequent syscalls are performed by the newly spawned ls
process.
We will get into more details regarding execve
towards the end of this lab.
Fork
Up to now we've been creating processes using various high-level APIs, such as Popen()
, Process()
and system()
.
Yes, despite being a C function, as you've seen from its man page, system()
itself calls 2 other functions: fork()
to create a process and execve()
to execute the given command.
As you already know from the Software Stack chapter, library functions may call one or more underlying system calls or other functions.
Now we will move one step lower on the call stack and call fork()
ourselves.
fork()
creates one child process that is almost identical to its parent.
We say that fork()
returns twice: once in the parent process and once more in the child process.
This means that after fork()
returns, assuming no error has occurred, both the child and the parent resume execution from the same place: the instruction following the call to fork()
.
What's different between the two processes is the value returned by fork()
:
- child process:
fork()
returns 0 - parent process:
fork()
returns the PID of the child process (> 0) - on error:
fork()
returns -1, only once, in the initial process
Therefore, the typical code for handling a fork()
is available in create-process/support/fork.c
.
Take a look at it and then run it.
Notice what each of the two processes prints:
- the PID of the child is also known by the parent
- the PPID of the child is the PID of the parent
Unlike system()
, who also waits for its child, when using fork()
we must do the waiting ourselves.
In order to wait for a process to end, we use the waitpid()
syscall.
It places the exit code of the child process in the status
parameter.
This argument is actually a bit-field containing more information than merely the exit code.
To retrieve the exit code, we use the WEXITSTATUS
macro.
Keep in mind that WEXITSTATUS
only makes sense if WIFEXITED
is true, i.e. if the child process finished on its own and wasn't killed by another one or by an illegal action (such as a segfault or illegal instruction) for example.
Otherwise, WEXITSTATUS
will return something meaningless.
You can view the rest of the information stored in the status
bit-field in the man page.
Moral of the story: Usually the execution flow is:
fork()
, followed bywait()
(called by the parent)exit()
, called by the child.
The order of last 2 steps may be swapped.
Threads
Threads vs Processes
So why use the implementation that spawns more processes if it's slower than the one using threads? The table below lists the differences between threads and processes. Generally, if we only want to do some computing, we use threads. If we need to drastically change the behaviour of the program, we need a new program altogether, or we need more than computing (e.g. communication on the network to create a computing cluster), we use processes.
PROCESS | THREAD |
---|---|
independent | part of a process |
collection of threads | shares VAS with other threads |
slower creation (new page table must be created) | faster creation |
longer context switch duration (TLB must be flushed) | shorter context switch duration (part of the same process, so same TLB) |
ending means ending all threads | other threads continue when finished |
Safety
Compile and run the two programs in chapters/compute/threads/drills/tasks/sum-array-bugs/support/seg-fault/
, first with 2 processes and threads and then with 4.
They do the same thing as before: compute the sum of the elements in an array, but with a twist: each of them contains a bug causing a segfault.
Notice that sum_array_threads
doesn't print anything with 4 threads, but merely a "Segmentation fault" message.
On the other hand, sum_array_processes
prints a sum and a running time, albeit different from the sums we've seen so far.
The reason is that signals such as SIGSEGV
, which is used when a segmentation fault happens affect the entire process that handles them.
Therefore, when we split our workload between several threads and one of them causes an error such as a segfault, that error is going to terminate the entire process.
The same thing happens when we use processes instead of threads: one process causes an error, which gets it killed, but the other processes continue their work unhindered.
This is why we end up with a lower sum in the end: because one process died too early and didn't manage to write the partial sum it had computed to the results
array.
Memory Layout of Multithreaded Programs
When a new thread is created, a new stack is allocated for a thread.
The default stack size if 8 MB
/ 8192 KB
:
student@os:~$ ulimit -s
8192
Enter the chapters/compute/threads/drills/tasks/multithreaded/support/
directory to observe the update of the memory layout when creating new threads.
Build the multithreaded
executable:
student@os:~/.../multithreaded/support$ make
Start the program:
student@os:~/.../multithreaded/support$ ./multithreaded
Press key to start creating threads ...
[...]
And investigate it with pmap
on another console, while pressing a key to create new threads.
As you can see, there is a new 8192 KB
area created for every thread, also increasing the total virtual size.
Guide: Baby steps - Python
Run the code in chapters/compute/processes/guides/create-process/support/popen.py
.
It simply spawns a new process running the ls
command using subprocess.Popen()
.
Do not worry about the huge list of arguments that Popen()
takes.
They are used for inter-process-communication.
You'll learn more about this in the [Application Interaction chapter].
Note that this usage of Popen()
is not entirely correct.
You'll discover why later, but for now focus on simply understanding how to use Popen()
on its own.
Now change the command to anything you want. Also give it some arguments. From the outside, it's as if you were running these commands from the terminal.
Guide: Sum Array Processes
Sum of the Elements in an Array
Let's assume we only have one process on our system, and that process knows how to add the numbers in an array.
It can use however many resources it wants, since there is no other process to contest it.
It would probably look like the code in chapters/compute/processes/guides/sum-array-processes/support/c/sum_array_sequential.c
.
The program also measures the time spent computing the sum.
Let's compile and run it:
student@os:~/.../sum-array/support/c$ ./sum_array_sequential
Array sum is: 49945994146
Time spent: 127 ms
You will most likely get a different sum (because the array is made up of random numbers) and a different time than the ones shown above. This is perfectly fine. Use these examples qualitatively, not quantitatively.
Spreading the Work Among Other Processes
Due to how it's implemented so far, our program only uses one of our CPU's cores. We never tell it to distribute its workload to other cores. This is wasteful as the rest of our cores remain unused:
student@os:~$ lscpu | grep ^CPU\(s\):
CPU(s): 8
We have 7 more cores waiting to add numbers in our array.
What if we use 7 more processes and spread the task of adding the numbers in this array between them? If we split the array into several equal parts and designate a separate process to calculate the sum of each part, we should get a speedup because now the work performed by each individual process is reduced.
Let's take it methodically.
Compile and run sum_array_processes.c
using 1, 2, 4 and 8 processes respectively.
If your system only has 4 cores (hyperthreading included), limit your runs to 4 processes.
Note the running times for each number of processes.
We expect the speedups compared to our reference run to be 1, 2, 4 and 8 respectively, right?
You most likely did get some speedup, especially when using 8 processes. Now we will try to improve this speedup by using threads instead.
Also notice that we're not using hundreds or thousands of processes. Assuming our system has 8 cores, only 8 threads can run at the same time. In general, the maximum number of threads that can run at the same time is equal to the number of cores. In our example, each process only has one thread: its main thread. So by consequence and by forcing the terminology (because it's the main thread of these processes that is running, not the processes themselves), we can only run in parallel a number of processes equal to at most the number of cores.
Guide: system
Dissected
You already know that system
calls fork()
and execve()
to create the new process.
Let's see how and why.
First, we run the following command to trace the execve()
syscalls used by sleepy_creator
.
We'll leave fork()
for later.
student@os:~/.../sleepy/support$ strace -e execve -ff -o syscalls ./sleepy_creator
At this point, you will get two files whose names start with syscalls
, followed by some numbers.
Those numbers are the PIDs of the parent and the child process.
Therefore, the file with the higher number contains logs of the execve
and clone
syscalls issued by the parent process, while
the other logs those two syscalls when made by the child process.
Let's take a look at them.
The numbers below will differ from those on your system:
student@os:~/.../sleepy/support:$ cat syscalls.2523393 # syscalls from parent process
execve("sleepy_creator", ["sleepy_creator"], 0x7ffd2c157758 /* 39 vars */) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=2523394, si_uid=1052093, si_status=0, si_utime=0, si_stime=0} ---
+++ exited with 0 +++
student@os:~/.../sleepy/support:$ cat syscalls.2523394 # syscalls from child process
execve("/bin/sh", ["sh", "-c", "sleep 10"], 0x7ffd36253be8 /* 39 vars */) = 0
execve("/usr/bin/sleep", ["sleep", "10"], 0x560f41659d40 /* 38 vars */) = 0
+++ exited with 0 +++
Now notice that the child process doesn't simply call execve("/usr/bin/sleep" ...)
.
It first changes its virtual address space (VAS) to that of a bash
process (execve("/bin/sh" ...)
) and then that bash
process switches its VAS to sleep
.
Therefore, calling system(<some_command>)
is equivalent to running <some_command>
in the command-line.
Moral of the story: When spawning a new command, the call order is:
- parent:
fork()
,exec()
,wait()
- child:
exit()
Guide: Sum array Threads
Spreading the Work Among Other Threads
Compile the code in chapters/compute/threads/guides/sum-array-threads/support/c/sum_array_threads.c
and run it using 1, 2, 4 and 8 threads as you did before.
Each thread runs the calculate_array_part_sum()
function and then finishes.
Running times should be slightly smaller than the implementation using processes.
This slight time difference is caused by process creation actions, which are costlier than thread creation actions.
Because a process needs a separate virtual address space (VAS) and needs to duplicate some internal structures such as the file descriptor table and page table, it takes the operating system more time to create it than to create a thread.
On the other hand, threads belonging to the same process share the same VAS and, implicitly, the same OS-internal structures.
Therefore, they are more lightweight than processes.
std.parallelism
in D
D language's standard library exposes the std.parallelism
, which provides a series of parallel processing functions.
One such function is reduce()
, which splits an array between a given number of threads and applies a given operation to these chunks.
In our case, the operation simply adds the elements to an accumulator: a + b
.
Follow and run the code in chapters/compute/threads/guides/sum-array-threads/support/d/sum_array_threads_reduce.d
.
The number of threads is used within a TaskPool
.
This structure is a thread manager (not scheduler).
It silently creates the number of threads we request and then reduce()
spreads its workload between these threads.
Now that you've seen how parallelism works in D, go in chapters/compute/threads/guides/sum-array-threads/support/java/SumArrayThreads.java
and follow the TODOs.
The code is similar to the one written in D, and it uses ThreadPoolExecutor
.
More about that here.
To run the code use:
javac SumArrayThreads.java
java SumArrayThreads 4
4 is the number of threads used, but you can replace the value with a number less or equal than your available cores.
OpenMP for C
Unlike D, C does not support parallel computation by design.
It needs a library to do advanced things, like reduce()
from D.
We have chosen to use the OpenMP library for this.
Follow the code in chapters/compute/threads/guides/sum-array-threads/support/c/sum_array_threads_openmp.c
.
The #pragma
used in the code instructs the compiler to enable the omp
module, and to parallelise the code.
In this case, we instruct the compiler to perform a reduce of the array, using the +
operator, and to store the results in the result
variable.
This reduction uses threads to calculate the sum, similar to summ_array_threads.c
, but in a much more optimised form.
One of the advantages of OpenMP is that is relatively easy to use.
The syntax requires only a few additional lines of code and compiler options, thus converting sequential code into parallel code quickly.
For example, using #pragma omp parallel for
, a developer can parallelize a for loop
, enabling iterations to run across multiple threads.
OpenMP uses a shared-memory model
, meaning all threads can access a common memory space.
This model is particularly useful for tasks that require frequent access to shared data, as it avoids the overhead of transferring data between threads.
However, shared memory can also introduce challenges, such as race conditions or synchronization issues, which can occur when multiple threads attempt to modify the same data simultaneously, but we'll talk about that later.
OpenMP offers constructs such as critical sections, atomic operations, and reductions to help manage these issues and ensure that parallel code executes safely and correctly.
Now compile and run the sum_array_threads_openmp
binary using 1, 2, 4, and 8 threads as before.
You'll see lower running times than sum_array_threads
due to the highly-optimised code emitted by the compiler.
For this reason and because library functions are usually much better tested than your own code, it is always preferred to use a library function for a given task.
For a challenge, enter chapters/compute/threads/guides/sum-array-threads/support/c/add_array_threads_openmp.c
.
Use what you've learned from the previous exercise and add the value 100 to an array using OpenMP.
Guide: Threads and Processes: clone
Let's go back to our initial demos that used threads and processes.
We'll see that in order to create both threads and processes, the underlying Linux syscall is clone
.
For this, we'll run both sum_array_threads
and sum_array_processes
under strace
.
As we've already established, we're only interested in the clone
syscall:
student@os:~/.../sum-array/support/c$ strace -e clone,clone3 ./sum_array_threads 2
clone(child_stack=0x7f60b56482b0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[1819693], tls=0x7f60b5649640, child_tidptr=0x7f60b5649910) = 1819693
clone(child_stack=0x7f60b4e472b0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tid=[1819694], tls=0x7f60b4e48640, child_tidptr=0x7f60b4e48910) = 1819694
student@os:~/.../sum-array/support/c$ strace -e clone,clone3 ./sum_array_processes 2
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f7a4e346650) = 1820599
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f7a4e346650) = 1820600
We ran each program with an argument of 2, so we have 2 calls to clone
.
Notice that in the case of threads, the clone3
syscall receives more arguments.
The relevant flags passed as arguments when creating threads are documented in clone
's man page:
CLONE_VM
: the child and the parent process share the same VASCLONE_{FS,FILES,SIGHAND}
: the new thread shares the filesystem information, file and signal handlers with the one that created it. The syscall also receives valid pointers to the new thread's stack and TLS, i.e. the only parts of the VAS that are distinct between threads (although they are technically accessible from all threads).
By contrast, when creating a new process, the arguments of the clone
syscall are simpler (i.e. fewer flags are present).
Remember that in both cases clone
creates a new thread.
When creating a process, clone
creates this new thread within a new separate address space.