Lab 8 - Syncronization
Task: Race Conditions
C - TLS on Demand
The perspective of C towards TLS is the following: everything is shared by default. This makes multithreading easier and more lightweight to implement than in other languages, like D, because synchronization is left entirely up to the developer, at the cost of potential unsafety.
Of course, we can specify that some data belongs to the TLS, by preceding the declaration of a variable with __thread
keyword.
First, compile and run the code in chapters/compute/synchronization/drills/tasks/race-condition/support/c/race_condition_tls.c
a few times.
As expected, the result is different each time.
Modify the declaration of
var
and add the__thread
keyword to place the variable in the TLS of each thread. Recompile and run the code a few more times. You should see that in the end,var
is 0.Print the address and value of
var
in each thread. See that they differ.Modify the value of
var
in themain()
function before callingpthread_create()
. Notice that the value doesn't propagate to the other threads. This is because, upon creating a new thread, its TLS is initialised.
Atomic Assembly
No, this section is not about nukes, sadly :(. Instead, we aim to get accustomed to the way in which the x86 ISA provides atomic instructions.
This mechanism looks very simple.
It is but one instruction prefix: lock
.
It is not an instruction with its own separate opcode, but a prefix that slightly modifies the opcode of the following instructions to make the CPU execute it atomically (i.e. with exclusive access to the data bus).
lock
must only be place before an instruction that executes a read-modify-write action.
For example, we cannot place it before a mov
instruction, as the action of a mov
is simply read
or write
.
Instead, we can place it in front of an inc
instruction if its operand is memory.
Go to chapters/compute/synchronization/drills/tasks/race-condition/support/
.
Look at the code in asm/race_condition_lock.S
.
It's an Assembly equivalent of the code you've already seen many times so far (such as c/race_condition.c
).
Assemble and run it a few times.
Notice the different results you get.
Now add the lock
prefix before inc
and dec
.
Reassemble and rerun the code.
And now we have synchronized the two threads by leveraging CPU support.
Task: Synchronization - Thread-Safe Data Structure
Now it's time for a fully practical exercise.
Go to chapters/compute/synchronization/drills/tasks/threadsafe-data-struct/support/
.
In the file clist.c
you'll find a simple implementation of an array list.
Although correct, it is not (yet) thread-safe.
The code in test.c
verifies its single-threaded correctness, while the one in test_parallel.c
verifies it works properly with multiple threads.
Your task is to synchronize this data structure using whichever primitives you like.
Try to keep the implementation efficient.
Aim to decrease your running times as much as you can.
Task: Another Time Slice
Enter the chapters/compute/scheduling/drills/tasks/libult/support/
folder and go through the practice items below.
Modify the time slice set to the timer to 2 seconds. Re-run the code in
test_ult.c
. Notice that now no context switch happens between the 2 created threads because they end before the timer can fire.Now change the
printer_thread()
function intest_ult.c
to make it run for more than 2 seconds. See that now the prints from the two threads appear intermingled. Add prints to thehandle_sigprof()
function inthreads.c
to see the context switch happen.
If you're having difficulties solving this exercise, go through this reading material.
Synchronization
So far, we've used threads and processes without wondering how to "tell" them how to access shared data. Moreover, in order to make threads wait for each other, we simply had the main thread wait for the others to finish all their work. But what if we want one thread to wait until another one simply performs some specific action, after which it resumes its execution? For this, we need to use some more complex synchronization mechanisms.
Race Conditions
For example, what if one thread wants to increase a global variable while another one wants to decrease it? Let's say the assembly code for increasing and decreasing the variable looks like the one in the snippet below.
increase:
mov eax, [var]
inc eax
mov [var], eax
decrease:
mov eax, [var]
dec eax
mov [var], eax
Imagine both threads executed mov eax, [var]
at the same time.
Then each would independently increase its (non-shared) eax
register.
In the end, the final value of var
depends on which thread executes mov [var], eax
last.
So it's kind of a reversed race.
The thread that runs the slowest "wins" this race and writes the final value of var
.
But this is up to the scheduler and is non-deterministic.
Such undefined behaviours can cripple the execution of a program if var
is some critical variable.
Let's see this bug in action.
Go to chapters/compute/synchronization/drills/tasks/race-condition/support/c/race_condition.c
, compile and run the code a few times.
It spawns to threads that do exactly what we've talked about so far: one thread increments var
10 million times, while the other decrements it 10 million times.
As you can see from running the program, the differences between subsequent runs can be substantial.
To fix this, we must ensure that only one thread can execute either var++
or var--
at any time.
We call these code sections critical sections.
A critical section is a piece of code that can only be executed by one thread at a time.
So we need some sort of mutual exclusion mechanism so that when one thread runs the critical section, the other has to wait before entering it.
This mechanism is called a mutex, whose name comes from "mutual exclusion".
Go to chapters/compute/synchronization/drills/tasks/race-condition/support/c/race_condition_mutex.c
and notice the differences between this code and the buggy one.
We now use a pthread_mutex_t
variable, which we lock
at the beginning of a critical section, and we unlock
at the end.
Generally speaking, lock
-ing a mutex makes a thread enter a critical section, while calling pthread_mutex_unlock()
makes the thread leave said critical section.
Therefore, as we said previously, the critical sections in our code are var--
and var++
.
Run the code multiple times to convince yourself that in the end, the value of var
will always be 0.
Mutexes contain an internal variable which can be either 1 (locked) or 0 (unlocked).
When a thread calls pthread_mutex_lock()
, it attempts to set that variable to 1.
If it was 0, the thread sets it to 1 and proceeds to execute the critical section.
Otherwise, it suspends its execution and waits until that variable is set to 0 again.
When calling pthread_mutex_unlock()
, the internal variable is set to 0 and all waiting threads are woken up to try to acquire the mutex again.
Be careful: It is generally considered unsafe and in many cases undefined behaviour to call pthread_mutex_unlock()
from a different thread than the one that acquired the lock.
So the general workflow should look something like this:
within a single thread:
pthread_mutex_lock(&mutex)
// do atomic stuff
pthread_mutex_unlock(&mutex)
Synchronization - Overhead
There ain't no such thing as a free lunch
This saying is also true for multithreading.
Running threads in parallel is nice and efficient, but synchronization always comes with a penalty: overhead.
Use the time
command to record the running times of race_condition
and race_condition_mutex
.
Notice that those of race_condition_mutex
are larger than those of race_condition
.
The cause of this is that now when one thread is executing the critical section, the other has to wait and do nothing. Waiting means changing its state from RUNNING to WAITING, which brings further overhead from the scheduler. This latter overhead comes from the context switch that is necessary for a thread to switch its state from RUNNING to WAITING and back.
Practice: Wrap the Whole for
Statements in Critical Sections
Move the calls to pthread_mutex_lock()
and pthread_mutex_unlock()
outside the for
statements so that the critical sections become the entire statement.
Measure the new time spent by the code and compare it with the execution times recorded when the critical sections were made up of only var--
and var++
.
Atomics
So now we know how to use mutexes.
And we know that mutexes work by using an internal variable that can be either 1 (locked) or 0 (unlocked).
But how does pthread_mutex_lock()
actually set that variable to 1?
How does it avoid a race condition in case another thread also wants to set it to 1?
We need a guarantee that anyone "touching" that variable does so "within its own critical section". But now we need a critical section to implement a critical section... To solve this circular problem, we make use of a very common Deus ex Machina: hardware support.
Modern processors are capable of atomically accessing data, either for reads or writes.
An atomic action is an indivisible sequence of operations that a thread runs without interference from others.
Concretely, before initiating an atomic transfer on one of its data buses, the CPU first makes sure all other transfers have ended, then locks the data bus by stalling all cores attempting to transfer data on it.
This way, one thread obtains exclusive access to the data bus while accessing data.
As a side note, the critical sections in chapters/compute/synchronization/drills/tasks/race-condition/support/c/race_condition_mutex.c
are also atomic once they are wrapped between calls to pthread_mutex_lock()
and pthread_mutex_unlock()
.
As with every hardware feature, the x86
ISA exposes an instruction for atomic operations.
In particular, this instruction is a prefix, called lock
.
It makes the instruction that follows it run atomically.
The lock
prefix ensures that the core performing the instruction has exclusive ownership of the cache line from where the data is transferred for the entire operation.
This is how the increment is made into an indivisible unit.
For example, inc dword [x]
can be made atomic, like so: lock inc dword [x]
.
Compilers provide support for such hardware-level atomic operations.
GCC exposes built-ins such as __atomic_load()
, __atomic_store()
, __atomic_compare_exchange()
and many others.
All of them rely on the mechanism described above.
Go to chapters/compute/synchronization/drills/tasks/race-condition/support/c/race_condition_atomic.c
and complete the function decrement_var()
.
Compile and run the code.
Now measure its running time against the mutex implementations.
It should be somewhere between race_condition
and race_condition_mutex
.
The C standard library also provides atomic data types.
Access to these variables can be done only by one thread at a time.
Go to chapters/compute/synchronization/drills/tasks/race-condition/support/c/race_condition_atomic2.c
, compile and run the code.
Now measure its running time against the other implementations.
Notice that the time is similar to race_condition_atomic
.
So using the hardware support is more efficient, but it usually is leveraged only for simple, individual instructions, such as loads and stores. And the fact that high-level languages also expose an API for atomic operations shows how useful these operations are for developers.
Semaphores
Up to now, we've learned how to create critical sections that can be accessed by only one thread at a time.
These critical sections revolved around data.
Whenever we define a critical section, there is some specific data to which we cannot allow parallel access.
The reason why we can't allow it is, in general, data integrity, as we've seen in our examples in chapters/compute/synchronization/drills/tasks/race-condition/support/
But what if threads need to count?
Counting is inherently thread-unsafe because it's a read-modify-write operation.
We read the counter, increment (modify) it and then write it back.
Think about our example with apache2
Let's say a worker
has created a pool of 3 threads.
They are not doing any work initially;
they are in the WAITING state.
As clients initiate connections, these threads are picked up and are used to serve at most 3 connections at a time.
But the number of connections may be arbitrarily large.
Therefore, we need a way to keep track of it.
When serving a client, a thread should decrement it to inform the others that a connection has been finished.
In short, we need a counter that the dispatcher increments and that worker threads decrement.
Such a counter could be implemented using a semaphore.
For simplicity's sake, you can view a semaphore as simply a mutex whose internal variable can take any value and acts like a counter.
When a thread attempts to acquire()
a semaphore, it will wait if this counter is less than or equal to 0.
Otherwise, the thread decrements the internal counter and the function returns.
The opposite of acquire()
is release()
, which increases the internal counter by a given value (by default 1).
Conditions
Another way we can implement our apache2
simulator is to use a condition variable.
This one is probably the most intuitive synchronization primitive.
It's a means by which a thread can tell another one: "Hey, wake up, this happened!".
So it's a way for threads to notify each other.
For this reason, the main methods associated with conditions are notify()
and wait()
.
As you might expect, they are complementary:
wait()
puts the thread in the WAITING state until it's woken up by another onenotify()
wakes up one or morewait()
-ing threads. Ifnotify()
is called before any thread has calledwait()
, the first thread that calls it will continue its execution unhindered.
Thread-Local Storage (TLS)
First things first: what if we don't want data to be shared between threads? Are we condemned to have to worry about race conditions? Well, no.
To protect data from race conditions "by design", we can place in what's called Thread-Local Storage (TLS). As its name implies, this is a type of storage that is "owned" by individual threads, as opposed to being shared among all threads. Do not confuse it with copy-on-write. TLS pages are always duplicated when creating a new thread and their contents are reinitialised.
User-Level Threads
User-level threads differ from the threads you are used to (kernel-level threads, those created by pthread_create
).
This kind of threads are scheduled by an user-level scheduler, and can run on the same kernel-level thread.
From now on, we will refer to user-level threads as fibers, and kernel-level threads as simply threads.
We will use the fiber implementation from libboost
.
This implementation uses a cooperative scheduler on each thread, meaning that each fiber has to yield, in order for other fiber to be executed.
We will also use C++, and the standard thread
implementation.
Prerequisites
Unless you are using the OS docker image, you will need to install cmake
and libboost
.
You can do this with the following command:
student@os:~$ sudo apt-get install cmake libboost-context-dev libboost-fiber-dev
Creation
Follow the chapters/compute/user-level-threads/guides/user-level-threads/support/simple.cc
implementation.
It creates NUM_FIBERS
fibers, that each prints "Hello World".
To compile and run the program, do the following steps:
student@os:~/.../user-level-threads/support$ mkdir build/
student@os:~/.../user-level-threads/support$ cd build/
student@os:~/.../user-level-threads/support$ cmake -S .. -B .
student@os:~/.../user-level-threads/support$ make
student@os:~/.../user-level-threads/support$ ./simple
The cmake
step must be executed only once.
After modifying the source files, it is enough to run make
.
Practice: Sleeper Fiber
Add in user-level-threads/support/simple.cc
a fiber that sleeps for 5 seconds, before the other ones are created.
What happens?
Answer in this quiz.
No system calls
Use strace
to find calls to clone()
in the execution of simple
.
Can you find any?
Provide your answer in this quiz
Remember that clone()
is the system call used to create kernel-level threads, as pointed out here.
Synchronization
By default, the fibers that run on the same thread are synchronized - no race-conditions can occur.
This is illustrated by the chapters/compute/user-level-threads/guides/user-level-threads/support/sum.cc
implementation.
The user can, however, implement further synchronization, by using the yield()
call, or classic synchronization methods, like mutexes, barriers and condition variables.
Yielding
As the scheduler is cooperative, each fiber can yield (or not), to allow another fiber to run.
Follow the compute/user-level-threads/guides/user-level-threads/support/yield_launch.cc
implementation and run it.
Note the boost::fibers::launch::dispatch
parameter provided to the fiber constructor.
It notifies the scheduler to start the fiber as soon as it is created.
In order to explain the output, we must consider that the fibers are created by a main fiber, that is scheduled along with the others, in this case.
Practice
Modify the launch parameter into boost::fibers::launch::post
, compile and notice the differences.
The post
parameter notifies the scheduler not to start the fibers immediately, but rather place them into an execution queue.
Their execution will start after the main fiber calls the join()
function.
Barriers
Follow the chapters/compute/user-level-threads/guides/user-level-threads/support/yield_barrier.cc
implementation.
It uses a barrier to achieve the same result as the previous implementation, that used post
as the launch parameter.
C++ unique_lock
unique_lock
is a type of mutex that is unlocked automatically when the end of its scope is reached (end of function or bracket-pair).
Scheduling
Up to now, we know that the OS decides which thread (not process) runs on each CPU core at each time. Now we'll learn about the component that performs this task specifically: the scheduler.
There are thousands of threads running at any time in a modern OS. The job of the scheduler is to run and pause threads as well as allocate them to the CPU cores, with the following goals:
- fairness: we do want all threads to get the same chance to run, i.e. run for about the same amount of time
- throughput: we want to run as many threads to completion so as to complete as many tasks as we can
To do this, the scheduler must decide, at given times, to suspend a thread, save its current state and let another one run in its place. This event is called a context switch. A context switch means changing the state of one thread (the replaced thread) from RUNNING to WAITING, and the state of the replacement thread from READY / WAITING to RUNNING.
- Quiz?
User-Level vs Kernel-Level Threads
There are two types of threads. The threads you've used so far are kernel-level threads (KLT). They are created and scheduled in the kernel of the OS. One of the most important of their features is that they offer true parallelism. With KLTs, we can truly run a program on all the cores of our CPU at once. But we must pay a price for this: scheduling them is very complex, and context switches are costly (in terms of time), especially when switching threads belonging to different processes.
By contrast, user-level threads (ULT) are managed by the user space. More of the ULTs created by a program are generally mapped to the same kernel thread. If a process only creates ULTs, then they will all be mapped to the single, main kernel thread of the process. So if we cannot run code in parallel with ULTs, then why use them? Well, programs that create many context switches may suffer from the larger overhead if they use kernel-level threads. In such cases, user-level threads may be useful as context switches bring less overhead between user-level threads.
Thread Control Block
Let's dissect the threads_create()
function a bit.
It first initialises its queues and the timer for preemption.
We'll discuss preemption in the next section.
After performing initialisations, the function creates a TCB
object.
TCB stands for Thread Control Block.
During the lecture, you saw that the kernel stores one instance of a task_struct
for each thread.
Remember that its most important fields are:
struct task_struct {
unsigned int __state;
void *stack;
unsigned int flags;
int on_cpu;
int prio;
/* Scheduler information */
struct sched_entity se;
const struct sched_class *sched_class;
/* The VAS: memory mappings */
struct mm_struct *mm;
int exit_state;
int exit_code;
pid_t pid;
struct task_struct __rcu *parent;
/* Child processes */
struct list_head children;
/* Open file information */
struct files_struct *files;
};
As you can see, this struct
stores metadata regarding a given thread.
The same is true about the TCB
in libult.so
:
typedef struct {
int id;
ucontext_t context;
bool has_dynamic_stack;
void *(*start_routine) (void *);
void *argument;
void *return_value;
} TCB;
It stores the thread ID (tid - id
), similar to the PID of a process.
It stores a pointer to the function passed as argument to threads_create()
(start_routine
), as well as the argument (argument
) and the returned value (return_value
) of said function.
In addition, the TCB
stores a context
.
From the man page of the ucontext.h
header, we can see this type is a struct
that stores a pointer to the stack of the current thread (uc_stack
).
This is similar to the stack
pointer in the task_struct
above.
In short, we can say a context defines an execution unit, such as a thread.
This is why changing the running thread is called a context switch.
Let's compare this context with another thread implementation, from Unikraft.
We'll look at the uk_thread
struct
, which is the TCB used in Unikraft:
struct uk_thread {
const char *name;
void *stack;
void *tls;
void *ctx;
UK_TAILQ_ENTRY(struct uk_thread) thread_list;
uint32_t flags;
__snsec wakeup_time;
bool detached;
struct uk_waitq waiting_threads;
struct uk_sched *sched;
void (*entry)(void *);
void *arg;
void *prv;
};
There are some visible similarities between the two TCBs.
Therefore, the workflow for creating and running a thread goes like this:
main thread
|
`--> threads_create()
|
|--> tcb_new()
`--> makecontext()
|
`--> handle_thread_start() - called using the context
|
|--> start_routine() - the thread runs
`--> threads_exit()
Compile and run the code in libult/support/test_ult.c
.
If you encounter the following error when running test_ult
, remember what you learned about the loader and using custom shared libraries in the Software Stack lab.
./test_ult: error while loading shared libraries: libult.so: cannot open shared object file: No such file or directory
Hint: Use the
LD_LIBRARY_PATH
variable.
Notice that the threads run their code and alternatively, because their prints appear interleaved. In the next section, we'll see how this is done.
Scheduling - How is it done?
There are two types of schedulers: preemptive and cooperative.
When discussing this distinction, we need to first define the notion of yielding.
Yielding the CPU means that a thread suspends its own execution and enters the WAITING or READY state, either as a result of a blocking call (I/O operations or calling the scheduler's yield()
function directly).
So, yielding the CPU triggers a context switch whereby the current thread stops running and another one resumes or starts running in its place.
Cooperative Scheduling
Cooperative scheduling relies on the fact that threads themselves would yield the CPU at some point. If threads don't abide by this convention, they end up monopolising the CPU (since there is no one to suspend them) and end up starving the others. You can get a feel of this behaviour by running the cooperative scheduler from Unikraft in the [lecture demos].
This type of schedulers have the advantage of being lightweight, thus resulting in less overhead caused by context switches. However, as we've already stated, they rely on the "good will" of threads to yield the CPU at some point.
Preemptive Scheduling
Preemptive scheduling solves the issue stated above by leaving the task of suspending the currently RUNNING thread and replacing it with another one from the READY queue up to the scheduler. This increases its complexity and the duration of context switches, but threads now are not required to worry about yielding themselves and can focus on running their code and performing the task for which they are created.
Preemptive schedulers allow threads to run only for a maximum amount of time, called a time slice (usually a few milliseconds). They use timers which fire when a new time slice passes. The firing of one such timer causes a context switch whereby the currently RUNNING thread is preempted (i.e. suspended) and replaced with another one.
Look at the init_profiling_timer()
function.
It creates a timer that generates a SIGPROF
signal and then defines a handler (the handle_sigprof()
function) that is executed whenever the SIGPROF
signal is received.
It is this handler that performs the context switch per se. Look at its code. It first saves the context of the current thread:
ucontext_t *stored = &running->context;
ucontext_t *updated = (ucontext_t *) context;
stored->uc_flags = updated->uc_flags;
stored->uc_link = updated->uc_link;
stored->uc_mcontext = updated->uc_mcontext;
stored->uc_sigmask = updated->uc_sigmask;
Then it places the current thread in the ready
queue and replaces it with the first thread in the same queue.
This algorithm (that schedules the first thread in the READY queue) is called Round-Robin:
if (queue_enqueue(ready, running) != 0) {
abort();
}
if ((running = queue_dequeue(ready)) == NULL) {
abort();
}
The new running
thread is resumed upon setting the current context to it:
if (setcontext(&running->context) == -1) {
abort();
}
This is how scheduling is done!
Guide: Interaction Between Threads and Fibers
As we mentioned before, multiple fibers can run on the same thread, and a scheduler is implemented on each thread.
By default, the scheduling algorithm is round_robin
.
It runs the fibers, in the order of their creation, until they yield or finish their work.
If a fiber yields, it is placed at the back of the round-robin queue.
Using this scheduler, each thread only uses its fibers;
if one thread has more work to do than another, bad luck.
This may lead to starvation.
But there are other scheduler implementations, such as shared_work
and work_stealing
.
Follow the chapters/compute/user-level-threads/guides/user-level-threads/support/threads_and_fibers.cc
implementation.
It creates multiple fibers and threads, and uses the shared_work
scheduler to balance the workload between the threads.
Each main fiber, from each thread, is suspended until all worker fibers have completed their work, using a condition variable.
cnd_count.wait( lk, []{ return 0 == fiber_count; } );
The program also uses thread local storage
and fiber local storage
to store the ID of each thread / fiber.
Now change the shared_work
scheduler into the work_stealing
one.
It takes a parameter, the number of threads that will use that scheduler.
Compile, rerun and note the differences.
The work_stealing
scheduler, as the name suggests, will "steal" fibers from other schedulers.
So, if the shared_work
scheduler tried to balance the available work between the available threads, the work_stealing
one will focus on having as many threads as possible on 100% workload.
Vary the number of threads and fibers, and the workload (maybe put each fibre to do some computational-intensive work), and observe the results.
Guide: User-Level Threads Scheduler
Go to chapters/compute/scheduling/guides/libult/support
.
It contains a minimalist user-level threads scheduler.
Compiling it produces a shared library called libult.so
.
You can also consult its documentation.
It explains the API as well as its implementation.
The API exposed by the scheduling library is very simple.
It is only made up of 3 functions:
threads_create()
creates a new ULTthreads_exit()
moves the current ULT to the COMPLETED statethreads_join()
waits for a given thread to end and saves its return value in theresult
argument
Look inside chapters/compute/scheduling/guides/libult/support/threads.c
.
Here you will find the 3 functions mentioned above.
The scheduler only uses 3 states: RUNNING, READY, COMPLETED.
The threads in the READY and COMPLETED states are kept in 2 separate queues. When the scheduler wants to run a new thread, it retrieves it from the READY queue. When a thread ends its execution, it is added to the COMPLETED queue, together with its return value.