Lab 11 - IO Optimizations

Task: Ordered Client-Server Communication

Navigate to chapters/io/ipc/drills/tasks/client-server/ and run make to generate the support directory. This exercise will guide you in creating a basic messaging protocol between a server and a client. Although in real-world applications a server typically handles multiple connections at once, here we focus on a single connection. Handling multiple connections is further explored in I/O multiplexing.

Our application protocol is defined as follows:

The server listens for connections on localhost and a specified port.
The client connects to localhost:port.
The client sends a message, which the server then prints, responds to, and the client prints the reply. This sequence repeats in a loop.
The communication ends when either the client or the server sends the message exit.

Since we are blocking on recv(), the message order is fixed - the client must initiate communication. In real-world applications, this constraint can be avoided with I/O multiplexing.

Open support/client.c and complete the TODOs to enable message exchange with the server. Test your client by running python server.py in one terminal and then ./client in another. If correctly implemented, you should be able to exchange messages as outlined above.
Bonus Question: Why is it OK for the client to be implemented in C while the server is implemented in Python?
Open support/server.c and complete the TODOs to enable message exchange with the client. Test your server by running ./server in one terminal and then python client.py in another. If implemented correctly, you should be able to exchange messages as outlined above.

If you're having difficulties solving this exercise, go through the sockets API and the client-server model.

Task: Multiplexed Client Server

Navigate to chapters/io/optimizations/drills/tasks/multiplexed-client-server and run make to generate the support files.

This task builds on the previous implementation of a client-server ordered communication. Previously, the client and server followed a strict, sequential communication pattern: each peer had to send a message and wait for a reply before proceeding.

We plan to build a group chat where clients can send messages at any time, and each message is broadcast to all other connected clients. To accomplish this, we’ll implement I/O multiplexing mechanisms that notify us only when data is available on a file descriptor. This non-blocking approach allows for smooth, unhindered communication between clients.

We’ll use the epoll interface to manage multiple file descriptors. Begin by opening w_epoll.h and completing the TODOs. We will define wrapper functions to add and remove file descriptors from the epoll instance, making the main code more readable. Note: Ensure that each wrapper returns the result of the epoll_ctl() for error handling.
Complete the TODOs in support/client.c to enable multiplexing of the available file descriptors. The file descriptors are stdin (for receiving user messages) and sockfd (for communication with the server). Use epoll_create(), epoll_wait(), and the wrappers defined in w_epoll.h to handle these descriptors without blocking. Remember to close the sockets before exiting.
To test, start python server.py in one terminal and run your client implementation in two separate terminals. If successful, the clients should be able to communicate through the server.
Complete the TODOs in support/server.c to multiplex I/O with clients. You will need to create an epoll instance and dynamically add and remove clients as they connect and disconnect. Use epoll_create(), epoll_wait(), and the wrappers defined in w_epoll.h to achieve this functionality. Remember to remove the client sockets from the epoll instance before closing them.
To test your implementation, run ./server in one terminal and ./client (or python client.py) in two separate terminals. If everything works correctly, the clients should be able to communicate with each other via the server.

If you're having difficulties solving this exercise, go through the epoll API from I/O Multiplexing.

Task: Async Server

Navigate to chapters/io/optimizations/drills/tasks/async-server and run make to generate the support files. Enter support and run make test-file.txt to generate the test file.

This task builds on the previous example of a multiplexed client-server. The server accepts connections from clients and downloads a file of 1 GB from each. After uploading the file, the clients close the connection.

Open server.c and complete the TODOs in the main function to setup IO multiplexing using epoll. Use epoll_create(), epoll_wait(), and the wrappers defined in w_epoll.h to handle descriptors without blocking. Remember to remove the client sockets from the epoll instance before closing them.
To test, run ./server in one terminal and ./client in another terminal.s If successful, the clients should print the upload progress.
There is a problem with our current implementation. Try to start two clients at the same time - the first one will start uploading, and the second one will block at connect(). This happens because, even though we are multiplexing file descriptors on the server-side, it is busy with another client. To account for this, complete the TODOs in handle_client() to serve each client on a different process.
To test, start python server.py in one terminal and run your client implementation in two separate terminals. If successful, the clients should be able to upload at the same time.

If you're having difficulties solving this exercise, go through Async I/O and I/O Multiplexing.

I/O Multiplexing

I/O multiplexing is the ability to serve multiple I/O channels (or anything that can be referenced via a file descriptor / handle) simultaneously. If a given application, such a server, has multiple sockets on which it serves connection, it may be the case that operating on one socket blocks the server. One solution is using asynchronous operations, with different backends. The other solution is using I/O multiplexing.

The classical functions for I/O multiplexing are select and poll. Due to several limitations, modern operating systems provide advanced (non-portable) variants to these:

Windows provides I/O completion ports (IOCP).
BSD provides kqueue.
Linux provides epoll().

Note that I/O multiplexing is orthogonal to asynchronous I/O. You could tie them together if the completion of the asynchronous operation sends a notification that can be handled via a file descriptor / handle. This is the case with Windows asynchronous I/O (called overlapped I/O).

The `epoll` API

The epoll API allows user-space programs to efficiently monitor multiple file descriptors and be notified when one of them has data to read. It provides a powerful, event-driven interface concentrated in three primary functions:

int epoll_create1(int flags): Creates an epoll instance. The flags argument specifies additional options for the instance. The default value is 0.
int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout): Waits for events on the monitored file descriptors.
- epfd: The file descriptor returned by epoll_create1().
- events: An array of struct epoll_event that will store the events that have occurred. It only contains events that are ready (i.e., received data).
- maxevents: The maximum number of events that can be stored in the events array.
- timeout: The maximum time (in milliseconds) that epoll_wait() will block. A value of -1 means it will block indefinitely.
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event): Modifies the set of file descriptors monitored by epoll.
- epfd: The file descriptor returned by epoll_create1().
- The op argument specifies the operation to perform, which can be:
  - EPOLL_CTL_ADD: Adds a file descriptor to the monitoring list.
  - EPOLL_CTL_MOD: Modifies an existing file descriptor’s event list.
  - EPOLL_CTL_DEL: Removes a file descriptor from the monitoring list.
- The fd argument is the file descriptor to be added, modified, or removed.
- The event argument is a pointer to a struct epoll_event that defines the events associated with the file descriptor.

The struct epoll_event is the core structure used to interact with epoll. It is used to return events to user space after epoll_wait() is called and to pass parameters to epoll_ctl() when modifying the set of monitored file descriptors. While the internal workings of epoll are complex, understanding how to use these functions and structures will cover most use cases.

Here is an example demonstrating how to use the epoll interface:

efd = epoll_create1(0)
if (efd < 0) {...} // handle error

// Add fd to monitored set
struct epoll_event ev;
ev.events = EPOLLIN;  // monitor fd for reading
ev.data.fd = fd;
rc = epoll_ctl(efd, EPOLL_CTL_ADD, fd, &ev);
if (rc < 0) {...} // handle error

struct epoll_event events[10];
n = epoll_wait(efd, events, 10, -1);  // Wait indefinitely
if (n < 0) {...} // handle error

// Iterate through the events to get active file descriptors
for (int i = 0; i < n; i++)
    printf("%d received data\n", events[i].data.fd);

Test your epoll understanding by implementing I/O multiplexing in a client-server app.

Asynchronous I/O

Asynchronous I/O (async I/O) provides an efficient way to handle input/output operations that are typically slower than CPU operations by allowing programs to continue executing while waiting for I/O operations to complete. Here’s a breakdown of I/O operation types and how asynchronous I/O compares to synchronous operations:

Synchronous Blocking Operation:
- This is the simplest and most common form of I/O (e.g., read() and write()).
- It’s synchronous, meaning the program waits for a response to proceed.
- It’s blocking, so if the requested data isn’t available (e.g., no data in the buffer for read()), the program waits for the operation to finish.
Synchronous Non-blocking Operation:
- This gives more control, especially in situations where waiting isn’t ideal.
- Opening a file using open() alongside O_NONBLOCK flag ensures the operation returns immediately instead of blocking.
- If data isn’t available right away, the operation notifies the program, which can try again later, avoiding unnecessary waiting.
Asynchronous Operation:
- Here, the function call returns immediately, allowing the program to continue without waiting for the result.
- A notification or callback is sent when the operation completes, or the program can check its status periodically.

Keep in mind the async I/O is not the same thing as I/O multiplexing. While both techniques improve I/O efficiency, they’re conceptually different:

Asynchronous I/O schedules operations concurrently, allowing the program to proceed without blocking.
I/O Multiplexing (e.g., select(), poll(), epoll()) monitors multiple channels simultaneously and informs the program when a channel has data ready. Blocking could still occur if the program cannot proceed without data from a channel.

Think of them as complementary: multiplexing helps monitor multiple channels, while async I/O allows the program to do other things while waiting.

There are several ways asynchronous I/O can be implemented in practice:

Multiprocess Backend: Each request runs in a separate process, isolating tasks and preventing blocking between them.

// Handle requests with processes
void handle_client_proc(int client_sockfd) {
   pid_t pid = fork();
   if (pid < 0) // handle error

   if (pid == 0) {  // Child process: handle client connection
      close(server_sockfd);  // Close the server socket in the child

      {...}  // compute and send answer

      close(client_sockfd);  // Close client socket when done
      exit(0);              // Exit child process
   }

   close(client_sockfd);     // close client socket in parent
}

Multithreaded Backend: Each request runs in a separate thread, allowing concurrent operations within the same process.

// Handle requests with threads
void* handler(void* arg) {
   int client_sockfd = *(int*)arg;

   {...}                   // compute and send answer
   close(client_sockfd);   // Close client socket when done

   return NULL;
}

void handle_client_thread(int sockfd) {
   int *sockfd_p = malloc(sizeof(int));  // use the heap to pass the address
   *sockfd_p = sockfd;

   int rc = pthread_create(&thread_id, NULL, handler, client_sock_ptr);
   if (rc < 0)                 // handle error
   pthread_detach(thread_id);  // Let the thread clean up after itself
}

Event-based Backend: An action is scheduled with a callback, which is invoked upon completion, using event loops to manage tasks. A callback is simply a function pointer, allowing the system to execute the function later or when a specific event is triggered.

Test your understanding by solving the Async Server task.

Zero-Copy

Imagine a server that responds with files that it stores locally. Its actions would be those highlighted in the image below:

Receive a new request and extract the filename
Read the filename from the disk into memory
Send the file from memory to the client

Client-Server Steps

As you might have guessed, 2 of these copies are useless. Since the app doesn't modify the file, there's no need for it to store the file in its own buffer. It would be more efficient to use a single kernel buffer as intermediate storage between the disk and the NIC (Network Interface Card), as shown in the image below.

Server Copies - Zero-Copy

For an easier comparison with the "classic" read() + send() model, here's the first version again:

Server Copies - Read-Send

It should be obvious that the former approach is more efficient than the latter.

These diagrams capture the essence of zero-copy: transferring data directly between kernel buffers, avoiding intermediate user-space buffers. This approach is ideal for serving requests, whether forwarding data between clients or reading from disk. It relies on the OS to retrieve and send data efficiently without extra copying steps.

`sendfile()`

The syscall with which we can leverage zero-copy is called sendfile(). Here are some practical examples on how to use it:

// file to file
int in_fd = open("input_file.txt", O_RDONLY);  // src
int out_fd = open("output_socket", O_WRONLY);  // dst

ssize_t bytes_sent = sendfile(out_fd, in_fd, &offset, 4096); // Transfer 4096 bytes
if (bytes_sent < 0) {...} // handle error

// file to network
int in_fd = open("input_file.txt", O_RDONLY);  // src
int sockfd = socket(AF_INET, SOCK_STREAM, 0);  // dst

int rc = connect(sock_fd, &server_addr, sizeof(server_addr));
if (rc < 0) {...} // handle error

ssize_t bytes_sent = sendfile(out_fd, in_fd, &offset, 4096); // Transfer 4096 bytes
if (bytes_sent < 0) {...} // handle error

You can read a slightly more detailed article about zero-copy here.

Guide: Benchmarking `sendfile()`

Navigate to chapters/io/optimizations/guides/benchmarking-sendfile/support and run make to prepare the test file.

sendfile() transfers data between two file descriptors directly within the kernel, bypassing user-space buffers. This process is known as zero-copy. Having established that sendfile() is likely faster than traditional I/O operations, it's time to put this theory to the test!

The code in server.py creates two threads that behave nearly identically. One listens on port 8081 and handles connections using read() and send(), while the other listens on port 8082 and uses sendfile(). Start the server in one terminal with python server.py. In a second terminal, run benchmark_client.py read-send followed by benchmark_client.py sendfile to compare the performance.

The results below are generic, and your outcomes may vary significantly depending on factors such as disk performance, network interface card (NIC), kernel version, Python version, and system load.

student@os:/.../benchmarking-sendfile/support$ python3 benchmark_client.py read-send
Time taken: 7.175773588009179 seconds

student@os:/.../benchmarking-sendfile/support$ python3 benchmark_client.py sendfile
Time taken: 3.71454380400246 seconds

Using sendfile() halves the number of copies required, reducing it from 4 to 2. This should translate to a roughly halved running time as well, making sendfile() a clear performance improvement over traditional methods.

You can explore another example of zero-copy in practice in this

Guide: Kernel Caching

I/O is critical to system efficiency, but also often its weakest link. Techniques to improve I/O performance include buffering, zero-copy, and async I/O. Among these, buffering is the most common and powerful.

Remember buffering/support/benchmark_buffering.sh or file-mappings/support/benchmark_cp.sh. They both used this line:

sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"

The line invalidates caches, forcing the OS to perform I/O operations directly from the disk. This ensures that the scripts benchmark the C code's performance alone, without any speedup from cached data.

The kernel goes even further with buffering. This time, it’s at a level beyond syscalls like read() and write(), using a strategy known as caching. While buffering helps with handling the next data efficiently by reading in advance or delaying writes, caching is about speeding up repeated access to the same data. Just as your browser caches frequently visited pages or your CPU caches recent addresses, the OS caches files that are accessed often, such as logs or configuration files.

When the OS encounters a file access, it stores portions of that file in memory so that subsequent requests can read or modify data from RAM rather than waiting on the slower disk. This makes I/O faster.

Caching in action

Navigate to chapters/io/optimizations/guides/kernel-caching/support and run make to create a large file that we'll use for benchmarking. We have two scripts to benchmark the cp command with and without caching:

student@os:/.../kernel-caching/support$ ./benchmark_cp.sh
make: 'large-file.txt' is up to date.
Benchmarking cp on a 1 GB file...

real    0m1.473s
user    0m0.001s
sys     0m0.985s

student@os:/.../kernel-caching/support$ ./benchmark_cp_allow_caching.sh
make: 'large-file.txt' is up to date.
Benchmarking cp on a 1 GB file...

real    0m0.813s
user    0m0.000s
sys     0m0.837s

Each subsequent benchmark actually reads the data from the caches populated or refreshed by the previous one. So running the script multiple times might improve the results.

You can use free -h to view how much data your kernel is caching. Look at the buff/cache column. One possible output is shown below. It says the OS is caching 7 GB of data.

student@os:~$ free -h
              total        used        free      shared  buff/cache   available
Mem:           15Gi       8,1Gi       503Mi       691Mi       7,0Gi       6,5Gi
Swap:         7,6Gi       234Mi       7,4Gi

Guide: Async

Enter the chapters/io/optimizations/guide/async/ folder for some implementations of a simple request-reply server in C. Here we have the implementation of a server that computes the n-th fibonacci number. The server serves requests in different ways:

synchronous server: server.c
multiprocess server: mp_server.c
multithreaded server: mt_server.c

Note: There is no asynchronous C variant, because of the unstable API of libaio and io_uring.

Benchmark the synchronous server to have a reference point. Run ./server 2999 in one terminal and time ./client_bench.sh 2999 in another. client_bench.sh we'll run 8 instances of client.py that make a request.
```
student@os:~/async/support$ time ./client_bench.sh 2000
[...]
root: Connected to localhost:2000
root: Sending 30
function(30): 1346269

real    0m1.075s
user    0m0.301s
sys     0m0.029s
```
The value you obtain might be different, but you should observe a speed-up when benchmarking the other two solutions.
Benchmark the multiprocess and multithreaded alternatives and see which one got the best speed increase. You can obtain more relevant values by tweaking parameters in client_bench.sh. For example, you could increase the number of clients or the fibonacci value to compute.
Begin by benchmarking the synchronous server to establish a baseline. Run ./server 2999 in one terminal and time ./client_bench.sh 2999 in another. The client_bench.sh script will initiate 8 instances of client.py, each making a request. The output might look like this:
```
student@os:~/async/support$ time ./client_bench.sh 2000
[...]
root: Connected to localhost:2000
root: Sending 30
function(30): 1346269

real    0m1.075s
user    0m0.301s
sys     0m0.029s
```
Although the actual value may vary, you should observe a noticeable speed-up when testing the other two solutions.
Next, benchmark the multiprocess and multithreaded alternatives to determine which offers the best performance improvement. To obtain more meaningful results, adjust parameters in client_bench.sh, such as increasing the number of clients or the Fibonacci value to compute.

If you're having difficulties understanding the support code, go through this reading material. If you want to practice this yourself, go through the Async Server task.

Lab 11 - IO Optimizations

Task: Ordered Client-Server Communication​

Task: Multiplexed Client Server​

Task: Async Server​

I/O Multiplexing​

The epoll API​

Asynchronous I/O​

Zero-Copy​

sendfile()​

Guide: Benchmarking sendfile()​

Guide: Kernel Caching​

Caching in action​

Guide: Async​