Lab 8 - Synchronization

Task: C: Race Conditions

Go to chapters/compute/synchronization/drills/tasks/race-condition/support/c/race_condition_mutex.c and notice the differences between this code and the buggy one. We now use a pthread_mutex_t variable, which we lock at the beginning of a critical section, and we unlock at the end. Generally speaking, lock-ing a mutex makes a thread enter a critical section, while calling pthread_mutex_unlock() makes the thread leave said critical section. Therefore, as we said previously, the critical sections in our code are var-- and var++. Run the code multiple times to convince yourself that in the end, the value of var will always be 0.

Mutexes contain an internal variable which can be either 1 (locked) or 0 (unlocked). When a thread calls pthread_mutex_lock(), it attempts to set that variable to 1. If it was 0, the thread sets it to 1 and proceeds to execute the critical section. Otherwise, it suspends its execution and waits until that variable is set to 0 again.

When calling pthread_mutex_unlock(), the internal variable is set to 0 and all waiting threads are woken up to try to acquire the mutex again. Be careful: It is generally considered unsafe and in many cases undefined behaviour to call pthread_mutex_unlock() from a different thread than the one that acquired the lock. So the general workflow should look something like this:

within a single thread:
    pthread_mutex_lock(&mutex)
    // do atomic stuff
    pthread_mutex_unlock(&mutex)

Synchronization - Overhead

There ain't no such thing as a free lunch

This saying is also true for multithreading. Running threads in parallel is nice and efficient, but synchronization always comes with a penalty: overhead. Use the time command to record the running times of race_condition and race_condition_mutex. Notice that those of race_condition_mutex are larger than those of race_condition.

The cause of this is that now when one thread is executing the critical section, the other has to wait and do nothing. Waiting means changing its state from RUNNING to WAITING, which brings further overhead from the scheduler. This latter overhead comes from the context switch that is necessary for a thread to switch its state from RUNNING to WAITING and back.

Task: Wrap the Whole `for` Statements in Critical Sections

Navigate to the chapters/compute/synchronization/drills/tasks/wrap-the-for/ directory, run make skels and open the support/src directory.

Here you will find two source files:

race_condition_inner_mutex
race_condition_outer_mutex

Following the pattern used in race_condition_inner_mutex.c go in race_condition_outer_mutex.c move the calls to pthread_mutex_lock() and pthread_mutex_unlock() outside the for statements so that the critical sections become the entire statement. To see the time difference between the two implementations all you have to do is enter tests/ and run the checker.

cd ../tests/
./checker.sh

You should see in the terminal:

Test passed

Task: C: Atomics

So now we know how to use mutexes. And we know that mutexes work by using an internal variable that can be either 1 (locked) or 0 (unlocked). But how does pthread_mutex_lock() actually set that variable to 1? How does it avoid a race condition in case another thread also wants to set it to 1?

We need a guarantee that anyone "touching" that variable does so "within its own critical section". But now we need a critical section to implement a critical section... To solve this circular problem, we make use of a very common Deus ex Machina: hardware support.

Modern processors are capable of atomically accessing data, either for reads or writes. An atomic action is an indivisible sequence of operations that a thread runs without interference from others. Concretely, before initiating an atomic transfer on one of its data buses, the CPU first makes sure all other transfers have ended, then locks the data bus by stalling all cores attempting to transfer data on it. This way, one thread obtains exclusive access to the data bus while accessing data. As a side note, the critical sections in chapters/compute/synchronization/drills/tasks/race-condition/support/c/race_condition_mutex.c are also atomic once they are wrapped between calls to pthread_mutex_lock() and pthread_mutex_unlock().

As with every hardware feature, the x86 ISA exposes an instruction for atomic operations. In particular, this instruction is a prefix, called lock. It makes the instruction that follows it run atomically. The lock prefix ensures that the core performing the instruction has exclusive ownership of the cache line from where the data is transferred for the entire operation. This is how the increment is made into an indivisible unit.

For example, inc dword [x] can be made atomic, like so: lock inc dword [x].

Compilers provide support for such hardware-level atomic operations. GCC exposes built-ins such as __atomic_load(), __atomic_store(), __atomic_compare_exchange() and many others. All of them rely on the mechanism described above.

Go to chapters/compute/synchronization/drills/tasks/race-condition-atomic/ and run:

make skels

Now enter chapters/compute/synchronization/drills/tasks/race-condition-atomic/support/src/race_condition_atomic.c and complete the function decrement_var(). Compile and run the code. Its running time should be somewhere between race_condition and race_condition_mutex.

The C standard library also provides atomic data types. Access to these variables can be done only by one thread at a time. Go to chapters/compute/synchronization/drills/tasks/race-condition-atomic/support/race_condition_atomic2.c, compile and run the code.

After both tasks are done, go in the checker folder and run it using the following commands:

cd ../tests/
./checker.sh

Notice that the time is similar to race_condition_atomic.

So using the hardware support is more efficient, but it usually is leveraged only for simple, individual instructions, such as loads and stores. And the fact that high-level languages also expose an API for atomic operations shows how useful these operations are for developers.

Task: C - TLS on Demand

The perspective of C towards TLS is the following: everything is shared by default. This makes multithreading easier and more lightweight to implement than in other languages, like D, because synchronization is left entirely up to the developer, at the cost of potential unsafety.

Of course, we can specify that some data belongs to the TLS, by preceding the declaration of a variable with __thread keyword. Enter chapters/compute/synchronization/drills/tasks/tls-on-demand/ and run make skels. Now enter support/src and follow the TODOs.

Create the declaration of var and add the __thread keyword to place the variable in the TLS of each thread. Recompile and run the code a few more times. You should see that in the end, var is 0.
Print the address and value of var in each thread. See that they differ.
Modify the value of var in the main() function before calling pthread_create(). Notice that the value doesn't propagate to the other threads. This is because, upon creating a new thread, its TLS is initialised.

Quiz 1

Task: Atomic Assembly

No, this section is not about nukes, sadly :(. Instead, we aim to get accustomed to the way in which the x86 ISA provides atomic instructions.

This mechanism looks very simple. It is but one instruction prefix: lock. It is not an instruction with its own separate opcode, but a prefix that slightly modifies the opcode of the following instructions to make the CPU execute it atomically (i.e. with exclusive access to the data bus).

lock must only be place before an instruction that executes a read-modify-write action. For example, we cannot place it before a mov instruction, as the action of a mov is simply read or write. Instead, we can place it in front of an inc instruction if its operand is memory.

Go in chapters/compute/synchronization/drills/tasks/atomic-assembly/ and run:

make skels

Look at the code in chapters/compute/synchronization/drills/tasks/atomic-assembly/support/src/race_condition_lock.asm. It's an Assembly equivalent of the code you've already seen many times so far (such as chapters/compute/synchronization/drills/tasks/race-condition/support/c/race_condition.c). The 2 assembly functions (increment_var and decrement_var) are called by race_condition_lock_checker.c

Now add the lock prefix before dec. Go in the tests/ folder and run the checker:

cd ../tests/
./checker.sh

You should see something like this, if done correctly:

Building the project...

nasm -f elf64 -o race_condition_lock.o race_condition_lock.asm
gcc -no-pie -o race_condition_lock_checker race_condition_lock_checker.c race_condition_lock.o -lpthread
Checking if var == 0...

Test passed

And now we have synchronized the two threads by leveraging CPU support.

Guide: `apache2` Simulator - Semaphore

Semaphores

Up to now, we've learned how to create critical sections that can be accessed by only one thread at a time. These critical sections revolved around data. Whenever we define a critical section, there is some specific data to which we cannot allow parallel access. The reason why we can't allow it is, in general, data integrity, as we've seen in our examples in chapters/compute/synchronization/drills/tasks/race-condition/support/

But what if threads need to count? Counting is inherently thread-unsafe because it's a read-modify-write operation. We read the counter, increment (modify) it and then write it back. Think about our example with apache2 Let's say a worker has created a pool of 3 threads. They are not doing any work initially; they are in the WAITING state. As clients initiate connections, these threads are picked up and are used to serve at most 3 connections at a time. But the number of connections may be arbitrarily large. Therefore, we need a way to keep track of it. When serving a client, a thread should decrement it to inform the others that a connection has been finished. In short, we need a counter that the dispatcher increments and that worker threads decrement.

Such a counter could be implemented using a semaphore. For simplicity's sake, you can view a semaphore as simply a mutex whose internal variable can take any value and acts like a counter. When a thread attempts to acquire() a semaphore, it will wait if this counter is less than or equal to 0. Otherwise, the thread decrements the internal counter and the function returns. The opposite of acquire() is release(), which increases the internal counter by a given value (by default 1).

Go to chapters/compute/synchronization/guides/apache2-simulator-semaphore/support/apache2_simulator_semaphore.py. In the main() function we create a semaphore which we increment (release()) upon every new message. Each thread decrements (acquire()) this semaphore to signal that it wants to retrieve a message from the list. The retrieval means modifying a data structure, which is a critical section, so we use a separate mutex for this. Otherwise, multiple threads could acquire the semaphore at the same time and try to modify the list at the same time. Not good.

Locking this mutex (which in Python is called Lock) is done with the following statement: with msg_mutex: This is a syntactic equivalent to:

event.acquire()
messages.append(msg)
event.release()

Since the length of the messages list is simply len(messages), it may seem a bit redundant to use a semaphore to store essentially the same value. In the next section, we'll look at a more refined mechanism for our use case: condition variables.

Task: `apache2` Simulator - Condition

Conditions

Another way we can implement our apache2 simulator is to use a condition variable. This one is probably the most intuitive synchronization primitive. It's a means by which a thread can tell another one: "Hey, wake up, this happened!". So it's a way for threads to notify each other. For this reason, the main methods associated with conditions are notify() and wait(). As you might expect, they are complementary:

wait() puts the thread in the WAITING state until it's woken up by another one
notify() wakes up one or more wait()-ing threads. If notify() is called before any thread has called wait(), the first thread that calls it will continue its execution unhindered.

Let's talk about a classic problem in synchronization and parallel computing: The producer-consumer problem The problem states that one or more producers generate data and places it in a shared buffer (a queue for example), while one or more consumers take data from the buffer to further process it. More about it in this video. There are a few rules though, such as:

The producer must not insert data when the buffer is full.
The consumer must not retrieve data if the buffer is empty.
The producer and the consumer can't access the shared buffer at the same time.

Now enter chapters/compute/synchronization/drills/tasks/apache2-simulator-condition/ and run make skels. Look at the code in chapters/compute/synchronization/drills/tasks/apache2-simulator/support/src/producer_consumer.py. We have one producer and one consumer for simplicity. Observe that the producer calls notify() once there is data available, and the consumer calls notify(), when data is read. Notice that this call is preceded by an acquire() call, and succeeded by a release() call.

acquire() and release() are commonly associated with mutexes or semaphores. What do they have to do with condition variables?

Well, a lock Condition variable also stores an inner lock (mutex). It is this lock that we acquire() and release(). In fact, the documentation states we should only call Condition methods with its inner lock taken.

Why is this necessary? We don't want both the consumer and the producer to modify a buffer at the same time, this could lead to a race condition, especially if we have more producers and more consumers.

So now we know we cannot only use a mutex. The mutex is used to access and modify the queue atomically. Now, you might be thinking that this code causes a deadlock:

condition.acquire()
    with condition:
    ...
    condition.wait()

The thread gets the lock and then, if there is no data, it switches its state to WAITING. A classic deadlock, right? No. wait() also releases the inner lock of the Condition and being woken up reacquires it. Neat!

So now we have both synchronization and signalling. This is what conditions are for, ultimately.

Open chapters/compute/synchronization/drills/tasks/apache2-simulator/support/src/apache2_simulator_condition.py and follow the TODOs. The code is similar to apache2_simulator_semaphore.py, but this time we use condition variables as shown in producer_consumer.py.

Synchronization

So far, we've used threads and processes without wondering how to "tell" them how to access shared data. Moreover, in order to make threads wait for each other, we simply had the main thread wait for the others to finish all their work. But what if we want one thread to wait until another one simply performs some specific action, after which it resumes its execution? For this, we need to use some more complex synchronization mechanisms.

Race Conditions

For example, what if one thread wants to increase a global variable while another one wants to decrease it? Let's say the assembly code for increasing and decreasing the variable looks like the one in the snippet below.

increase:
    mov eax, [var]
    inc eax
    mov [var], eax

decrease:
    mov eax, [var]
    dec eax
    mov [var], eax

Imagine both threads executed mov eax, [var] at the same time. Then each would independently increase its (non-shared) eax register. In the end, the final value of var depends on which thread executes mov [var], eax last. So it's kind of a reversed race. The thread that runs the slowest "wins" this race and writes the final value of var. But this is up to the scheduler and is non-deterministic. Such undefined behaviours can cripple the execution of a program if var is some critical variable.

Let's see this bug in action. Go to chapters/compute/synchronization/drills/tasks/race-condition/support/c/race_condition.c, compile and run the code a few times. It spawns to threads that do exactly what we've talked about so far: one thread increments var 10 million times, while the other decrements it 10 million times.

As you can see from running the program, the differences between subsequent runs can be substantial. To fix this, we must ensure that only one thread can execute either var++ or var-- at any time. We call these code sections critical sections. A critical section is a piece of code that can only be executed by one thread at a time. So we need some sort of mutual exclusion mechanism so that when one thread runs the critical section, the other has to wait before entering it. This mechanism is called a mutex, whose name comes from "mutual exclusion".

Thread-Local Storage (TLS)

First things first: what if we don't want data to be shared between threads? Are we condemned to have to worry about race conditions? Well, no.

To protect data from race conditions "by design", we can place in what's called Thread-Local Storage (TLS). As its name implies, this is a type of storage that is "owned" by individual threads, as opposed to being shared among all threads. Do not confuse it with copy-on-write. TLS pages are always duplicated when creating a new thread and their contents are reinitialised.

User-Level Threads

User-level threads differ from the threads you are used to (kernel-level threads, those created by pthread_create). This kind of threads are scheduled by an user-level scheduler, and can run on the same kernel-level thread. From now on, we will refer to user-level threads as fibers, and kernel-level threads as simply threads.

We will use the fiber implementation from libboost. This implementation uses a cooperative scheduler on each thread, meaning that each fiber has to yield, in order for other fiber to be executed. We will also use C++, and the standard thread implementation.

Prerequisites

Unless you are using the OS docker image, you will need to install cmake and libboost. You can do this with the following command:

student@os:~$ sudo apt-get install cmake libboost-context-dev libboost-fiber-dev

Creation

Follow the chapters/compute/user-level-threads/guides/user-level-threads/support/simple.cc implementation. It creates NUM_FIBERS fibers, that each prints "Hello World". To compile and run the program, do the following steps:

student@os:~/.../user-level-threads/support$ mkdir build/
student@os:~/.../user-level-threads/support$ cd build/
student@os:~/.../user-level-threads/support$ cmake -S .. -B .
student@os:~/.../user-level-threads/support$ make
student@os:~/.../user-level-threads/support$ ./simple

The cmake step must be executed only once. After modifying the source files, it is enough to run make.

Practice: Sleeper Fiber

Add in user-level-threads/support/simple.cc a fiber that sleeps for 5 seconds, before the other ones are created. What happens? Answer in this quiz.

No system calls

Use strace to find calls to clone() in the execution of simple. Can you find any? Provide your answer in this quiz clone is the system call used to create kernel-level threads, as pointed out in this guide on creating threads and processes.

Synchronization

By default, the fibers that run on the same thread are synchronized - no race-conditions can occur. This is illustrated by the chapters/compute/user-level-threads/guides/user-level-threads/support/sum.cc implementation.

The user can, however, implement further synchronization, by using the yield() call, or classic synchronization methods, like mutexes, barriers and condition variables.

Yielding

As the scheduler is cooperative, each fiber can yield (or not), to allow another fiber to run. Follow the compute/user-level-threads/guides/user-level-threads/support/yield_launch.cc implementation and run it. Note the boost::fibers::launch::dispatch parameter provided to the fiber constructor. It notifies the scheduler to start the fiber as soon as it is created. In order to explain the output, we must consider that the fibers are created by a main fiber, that is scheduled along with the others, in this case.

Practice

Modify the launch parameter into boost::fibers::launch::post, compile and notice the differences. The post parameter notifies the scheduler not to start the fibers immediately, but rather place them into an execution queue. Their execution will start after the main fiber calls the join() function.

Barriers

Follow the chapters/compute/user-level-threads/guides/user-level-threads/support/yield_barrier.cc implementation. It uses a barrier to achieve the same result as the previous implementation, that used post as the launch parameter.

C++ unique_lock

unique_lock is a type of mutex that is unlocked automatically when the end of its scope is reached (end of function or bracket-pair).

Scheduling

Up to now, we know that the OS decides which thread (not process) runs on each CPU core at each time. Now we'll learn about the component that performs this task specifically: the scheduler.

There are thousands of threads running at any time in a modern OS. The job of the scheduler is to run and pause threads as well as allocate them to the CPU cores, with the following goals:

fairness: we do want all threads to get the same chance to run, i.e. run for about the same amount of time
throughput: we want to run as many threads to completion so as to complete as many tasks as we can

To do this, the scheduler must decide, at given times, to suspend a thread, save its current state and let another one run in its place. This event is called a context switch. A context switch means changing the state of one thread (the replaced thread) from RUNNING to WAITING, and the state of the replacement thread from READY / WAITING to RUNNING.

User-Level vs Kernel-Level Threads

There are two types of threads. The threads you've used so far are kernel-level threads (KLT). They are created and scheduled in the kernel of the OS. One of the most important of their features is that they offer true parallelism. With KLTs, we can truly run a program on all the cores of our CPU at once. But we must pay a price for this: scheduling them is very complex, and context switches are costly (in terms of time), especially when switching threads belonging to different processes.

By contrast, user-level threads (ULT) are managed by the user space. More of the ULTs created by a program are generally mapped to the same kernel thread. If a process only creates ULTs, then they will all be mapped to the single, main kernel thread of the process. So if we cannot run code in parallel with ULTs, then why use them? Well, programs that create many context switches may suffer from the larger overhead if they use kernel-level threads. In such cases, user-level threads may be useful as context switches bring less overhead between user-level threads.

Thread Control Block

Let's dissect the threads_create() function a bit. It first initialises its queues and the timer for preemption. We'll discuss preemption separately, in a section of its own. After performing initialisations, the function creates a TCB object. TCB stands for Thread Control Block.

During the lecture, you saw that the kernel stores one instance of a task_struct for each thread. Remember that its most important fields are:

struct task_struct {
 unsigned int                    __state;

 void                           *stack;

 unsigned int                    flags;

 int                             on_cpu;
 int                             prio;

 /* Scheduler information */
 struct sched_entity             se;
 const struct sched_class        *sched_class;

 /* The VAS: memory mappings */
 struct mm_struct                *mm;

 int                             exit_state;
 int                             exit_code;

 pid_t                           pid;

 struct task_struct __rcu        *parent;

 /* Child processes */
 struct list_head                children;

 /* Open file information */
 struct files_struct             *files;
};

As you can see, this struct stores metadata regarding a given thread. The same is true about the TCB in libult.so:

typedef struct {
 int id;
 ucontext_t context;
 bool has_dynamic_stack;
 void *(*start_routine) (void *);
 void *argument;
 void *return_value;
} TCB;

It stores the thread ID (tid - id), similar to the PID of a process. It stores a pointer to the function passed as argument to threads_create() (start_routine), as well as the argument (argument) and the returned value (return_value) of said function.

In addition, the TCB stores a context. From the man page of the ucontext.h header, we can see this type is a struct that stores a pointer to the stack of the current thread (uc_stack). This is similar to the stack pointer in the task_struct above. In short, we can say a context defines an execution unit, such as a thread. This is why changing the running thread is called a context switch.

Let's compare this context with another thread implementation, from Unikraft. We'll look at the uk_thread struct, which is the TCB used in Unikraft:

struct uk_thread {
 const char *name;
 void *stack;
 void *tls;
 void *ctx;
 UK_TAILQ_ENTRY(struct uk_thread) thread_list;
 uint32_t flags;
 __snsec wakeup_time;
 bool detached;
 struct uk_waitq waiting_threads;
 struct uk_sched *sched;
 void (*entry)(void *);
 void *arg;
 void *prv;
};

There are some visible similarities between the two TCBs.

Therefore, the workflow for creating and running a thread goes like this:

main thread
    |
    `--> threads_create()
            |
     |--> tcb_new()
            `--> makecontext()
             |
      `--> handle_thread_start() - called using the context
              |
       |--> start_routine() - the thread runs
                            `--> threads_exit()

Compile and run the code in libult/support/test_ult.c. If you encounter the following error when running test_ult, remember what you learned about the loader and using custom shared libraries in the Software Stack lab.

./test_ult: error while loading shared libraries: libult.so: cannot open shared object file: No such file or directory

Hint: Use the LD_LIBRARY_PATH variable.

Notice that the threads run their code and alternatively, because their prints appear interleaved.

Quiz

Scheduling - How is it done?

There are two types of schedulers: preemptive and cooperative. When discussing this distinction, we need to first define the notion of yielding. Yielding the CPU means that a thread suspends its own execution and enters the WAITING or READY state, either as a result of a blocking call (I/O operations or calling the scheduler's yield() function directly). So, yielding the CPU triggers a context switch whereby the current thread stops running and another one resumes or starts running in its place.

Cooperative Scheduling

Cooperative scheduling relies on the fact that threads themselves would yield the CPU at some point. If threads don't abide by this convention, they end up monopolising the CPU (since there is no one to suspend them) and end up starving the others. You can get a feel of this behaviour by running the cooperative scheduler from Unikraft in the [lecture demos].

This type of schedulers have the advantage of being lightweight, thus resulting in less overhead caused by context switches. However, as we've already stated, they rely on the "good will" of threads to yield the CPU at some point.

Preemptive Scheduling

Preemptive scheduling solves the issue stated above by leaving the task of suspending the currently RUNNING thread and replacing it with another one from the READY queue up to the scheduler. This increases its complexity and the duration of context switches, but threads now are not required to worry about yielding themselves and can focus on running their code and performing the task for which they are created.

Preemptive schedulers allow threads to run only for a maximum amount of time, called a time slice (usually a few milliseconds). They use timers which fire when a new time slice passes. The firing of one such timer causes a context switch whereby the currently RUNNING thread is preempted (i.e. suspended) and replaced with another one.

Look at the init_profiling_timer() function. It creates a timer that generates a SIGPROF signal and then defines a handler (the handle_sigprof() function) that is executed whenever the SIGPROF signal is received.

It is this handler that performs the context switch per se. Look at its code. It first saves the context of the current thread:

ucontext_t *stored = &running->context;
ucontext_t *updated = (ucontext_t *) context;

stored->uc_flags = updated->uc_flags;
stored->uc_link = updated->uc_link;
stored->uc_mcontext = updated->uc_mcontext;
stored->uc_sigmask = updated->uc_sigmask;

Then it places the current thread in the ready queue and replaces it with the first thread in the same queue. This algorithm (that schedules the first thread in the READY queue) is called Round-Robin:

if (queue_enqueue(ready, running) != 0) {
 abort();
}

if ((running = queue_dequeue(ready)) == NULL) {
 abort();
}

The new running thread is resumed upon setting the current context to it:

if (setcontext(&running->context) == -1) {
 abort();
}

This is how scheduling is done!

Guide: Interaction Between Threads and Fibers

As we mentioned before, multiple fibers can run on the same thread, and a scheduler is implemented on each thread. By default, the scheduling algorithm is round_robin. It runs the fibers, in the order of their creation, until they yield or finish their work. If a fiber yields, it is placed at the back of the round-robin queue. Using this scheduler, each thread only uses its fibers; if one thread has more work to do than another, bad luck. This may lead to starvation.

But there are other scheduler implementations, such as shared_work and work_stealing. Follow the chapters/compute/user-level-threads/guides/user-level-threads/support/threads_and_fibers.cc implementation. It creates multiple fibers and threads, and uses the shared_work scheduler to balance the workload between the threads. Each main fiber, from each thread, is suspended until all worker fibers have completed their work, using a condition variable.

cnd_count.wait( lk, []{ return 0 == fiber_count; } );

The program also uses thread local storage and fiber local storage to store the ID of each thread / fiber.

Now change the shared_work scheduler into the work_stealing one. It takes a parameter, the number of threads that will use that scheduler.

Compile, rerun and note the differences. The work_stealing scheduler, as the name suggests, will "steal" fibers from other schedulers. So, if the shared_work scheduler tried to balance the available work between the available threads, the work_stealing one will focus on having as many threads as possible on 100% workload. Vary the number of threads and fibers, and the workload (maybe put each fibre to do some computational-intensive work), and observe the results.

Guide: User-Level Threads Scheduler

Go to chapters/compute/scheduling/guides/libult/support. It contains a minimalist user-level threads scheduler. Compiling it produces a shared library called libult.so. You can also consult its documentation. It explains the API as well as its implementation. The API exposed by the scheduling library is very simple. It is only made up of 3 functions:

threads_create() creates a new ULT
threads_exit() moves the current ULT to the COMPLETED state
threads_join() waits for a given thread to end and saves its return value in the result argument

Look inside chapters/compute/scheduling/guides/libult/support/threads.c. Here you will find the 3 functions mentioned above.

The scheduler only uses 3 states: RUNNING, READY, COMPLETED.

The threads in the READY and COMPLETED states are kept in 2 separate queues. When the scheduler wants to run a new thread, it retrieves it from the READY queue. When a thread ends its execution, it is added to the COMPLETED queue, together with its return value.

Quiz

Lab 8 - Synchronization

Task: C: Race Conditions​

Synchronization - Overhead​

Task: Wrap the Whole for Statements in Critical Sections​

Task: C: Atomics​

Task: C - TLS on Demand​

Task: Atomic Assembly​

Guide: apache2 Simulator - Semaphore​

Semaphores​

Task: apache2 Simulator - Condition​

Conditions​

Synchronization​

Race Conditions​

Thread-Local Storage (TLS)​

User-Level Threads​

Prerequisites​

Creation​

Practice: Sleeper Fiber​

No system calls​

Synchronization​

Yielding​

Practice​

Barriers​

C++ unique_lock​

Scheduling​

User-Level vs Kernel-Level Threads​

Thread Control Block​

Scheduling - How is it done?​

Cooperative Scheduling​

Preemptive Scheduling​

Guide: Interaction Between Threads and Fibers​

Guide: User-Level Threads Scheduler​

Task: C: Race Conditions

Synchronization - Overhead

Task: Wrap the Whole `for` Statements in Critical Sections

Task: C: Atomics

Task: C - TLS on Demand

Task: Atomic Assembly

Guide: `apache2` Simulator - Semaphore

Semaphores

Task: `apache2` Simulator - Condition

Conditions

Synchronization

Race Conditions

Thread-Local Storage (TLS)

User-Level Threads

Prerequisites

Creation

Practice: Sleeper Fiber

No system calls

Synchronization

Yielding

Practice

Barriers

C++ unique_lock

Scheduling

User-Level vs Kernel-Level Threads

Thread Control Block

Scheduling - How is it done?

Cooperative Scheduling

Preemptive Scheduling

Guide: Interaction Between Threads and Fibers

Guide: User-Level Threads Scheduler