Lab 11 - IO Optimizations
Task: Ordered Client-Server Communication
Navigate to chapters/io/ipc/drills/tasks/client-server/
and run make
to generate the support
directory.
This exercise will guide you in creating a basic messaging protocol between a server and a client.
Although in real-world applications a server typically handles multiple connections at once, here we focus on a single connection.
Handling multiple connections is further explored in I/O multiplexing.
Our application protocol is defined as follows:
- The server listens for connections on localhost and a specified port.
- The client connects to
localhost:port
. - The client sends a message, which the server then prints, responds to, and the client prints the reply. This sequence repeats in a loop.
- The communication ends when either the client or the server sends the message
exit
.
Since we are blocking on recv()
, the message order is fixed - the client must initiate communication.
In real-world applications, this constraint can be avoided with I/O multiplexing.
Open
support/client.c
and complete the TODOs to enable message exchange with the server. Test your client by runningpython server.py
in one terminal and then./client
in another. If correctly implemented, you should be able to exchange messages as outlined above.Bonus Question: Why is it OK for the client to be implemented in C while the server is implemented in Python?
Open
support/server.c
and complete the TODOs to enable message exchange with the client. Test your server by running./server
in one terminal and thenpython client.py
in another. If implemented correctly, you should be able to exchange messages as outlined above.
If you're having difficulties solving this exercise, go through the sockets API and the client-server model.
Task: Multiplexed Client Server
Navigate to chapters/io/optimizations/drills/tasks/multiplexed-client-server
and run make
to generate the support
files.
This task builds on the previous implementation of a client-server ordered communication. Previously, the client and server followed a strict, sequential communication pattern: each peer had to send a message and wait for a reply before proceeding.
We plan to build a group chat where clients can send messages at any time, and each message is broadcast to all other connected clients. To accomplish this, we’ll implement I/O multiplexing mechanisms that notify us only when data is available on a file descriptor. This non-blocking approach allows for smooth, unhindered communication between clients.
We’ll use the
epoll
interface to manage multiple file descriptors. Begin by openingw_epoll.h
and completing the TODOs. We will define wrapper functions to add and remove file descriptors from theepoll
instance, making the main code more readable. Note: Ensure that each wrapper returns the result of theepoll_ctl()
for error handling.Complete the TODOs in
support/client.c
to enable multiplexing of the available file descriptors. The file descriptors arestdin
(for receiving user messages) andsockfd
(for communication with the server). Useepoll_create()
,epoll_wait()
, and the wrappers defined inw_epoll.h
to handle these descriptors without blocking. Remember to close the sockets before exiting.To test, start
python server.py
in one terminal and run your client implementation in two separate terminals. If successful, the clients should be able to communicate through the server.Complete the TODOs in
support/server.c
to multiplex I/O with clients. You will need to create anepoll
instance and dynamically add and remove clients as they connect and disconnect. Useepoll_create()
,epoll_wait()
, and the wrappers defined inw_epoll.h
to achieve this functionality. Remember to remove the client sockets from theepoll
instance before closing them.To test your implementation, run
./server
in one terminal and./client
(orpython client.py
) in two separate terminals. If everything works correctly, the clients should be able to communicate with each other via the server.
If you're having difficulties solving this exercise, go through the epoll
API from I/O Multiplexing.
Task: Async Server
Navigate to chapters/io/optimizations/drills/tasks/async-server
and run make
to generate the support
files.
Enter support
and run make test-file.txt
to generate the test file.
This task builds on the previous example of a multiplexed client-server.
The server accepts connections from clients and downloads a file of 1 GB
from each.
After uploading the file, the clients close the connection.
Open
server.c
and complete the TODOs in the main function to setup IO multiplexing usingepoll
. Useepoll_create()
,epoll_wait()
, and the wrappers defined inw_epoll.h
to handle descriptors without blocking. Remember to remove the client sockets from theepoll
instance before closing them.To test, run
./server
in one terminal and./client
in another terminal.s If successful, the clients should print the upload progress.There is a problem with our current implementation. Try to start two clients at the same time - the first one will start uploading, and the second one will block at
connect()
. This happens because, even though we are multiplexing file descriptors on the server-side, it is busy with another client. To account for this, complete the TODOs inhandle_client()
to serve each client on a different process.To test, start
python server.py
in one terminal and run your client implementation in two separate terminals. If successful, the clients should be able to upload at the same time.
If you're having difficulties solving this exercise, go through Async I/O and I/O Multiplexing.
I/O Multiplexing
I/O multiplexing is the ability to serve multiple I/O channels (or anything that can be referenced via a file descriptor / handle) simultaneously. If a given application, such a server, has multiple sockets on which it serves connection, it may be the case that operating on one socket blocks the server. One solution is using asynchronous operations, with different backends. The other solution is using I/O multiplexing.
The classical functions for I/O multiplexing are select
and poll
.
Due to several limitations, modern operating systems provide advanced (non-portable) variants to these:
- Windows provides I/O completion ports (
IOCP
). - BSD provides
kqueue
. - Linux provides
epoll()
.
Note that I/O multiplexing is orthogonal to asynchronous I/O. You could tie them together if the completion of the asynchronous operation sends a notification that can be handled via a file descriptor / handle. This is the case with Windows asynchronous I/O (called overlapped I/O).
The epoll
API
The epoll
API allows user-space programs to efficiently monitor multiple file descriptors and be notified when one of them has data to read.
It provides a powerful, event-driven interface concentrated in three primary functions:
int epoll_create1(int flags)
: Creates anepoll
instance. Theflags
argument specifies additional options for the instance. The default value is0
.int epoll_wait(int epfd, struct epoll_event *events, int maxevents, int timeout)
: Waits for events on the monitored file descriptors.epfd
: The file descriptor returned byepoll_create1()
.events
: An array ofstruct epoll_event
that will store the events that have occurred. It only contains events that are ready (i.e., received data).maxevents
: The maximum number of events that can be stored in theevents
array.timeout
: The maximum time (in milliseconds) thatepoll_wait()
will block. A value of-1
means it will block indefinitely.
int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event)
: Modifies the set of file descriptors monitored byepoll
.epfd
: The file descriptor returned byepoll_create1()
.- The
op
argument specifies the operation to perform, which can be:EPOLL_CTL_ADD
: Adds a file descriptor to the monitoring list.EPOLL_CTL_MOD
: Modifies an existing file descriptor’s event list.EPOLL_CTL_DEL
: Removes a file descriptor from the monitoring list.
- The
fd
argument is the file descriptor to be added, modified, or removed. - The
event
argument is a pointer to astruct epoll_event
that defines the events associated with the file descriptor.
The struct epoll_event
is the core structure used to interact with epoll
.
It is used to return events to user space after epoll_wait()
is called and to pass parameters to epoll_ctl()
when modifying the set of monitored file descriptors.
While the internal workings of epoll
are complex, understanding how to use these functions and structures will cover most use cases.
Here is an example demonstrating how to use the epoll
interface:
efd = epoll_create1(0)
if (efd < 0) {...} // handle error
// Add fd to monitored set
struct epoll_event ev;
ev.events = EPOLLIN; // monitor fd for reading
ev.data.fd = fd;
rc = epoll_ctl(efd, EPOLL_CTL_ADD, fd, &ev);
if (rc < 0) {...} // handle error
struct epoll_event events[10];
n = epoll_wait(efd, events, 10, -1); // Wait indefinitely
if (n < 0) {...} // handle error
// Iterate through the events to get active file descriptors
for (int i = 0; i < n; i++)
printf("%d received data\n", events[i].data.fd);
Test your epoll
understanding by implementing I/O multiplexing in a client-server app.
Asynchronous I/O
Asynchronous I/O (async I/O) provides an efficient way to handle input/output operations that are typically slower than CPU operations by allowing programs to continue executing while waiting for I/O operations to complete. Here’s a breakdown of I/O operation types and how asynchronous I/O compares to synchronous operations:
Synchronous Blocking Operation:
- This is the simplest and most common form of I/O (e.g.,
read()
andwrite()
). - It’s synchronous, meaning the program waits for a response to proceed.
- It’s blocking, so if the requested data isn’t available (e.g., no data in the buffer for
read()
), the program waits for the operation to finish.
- This is the simplest and most common form of I/O (e.g.,
Synchronous Non-blocking Operation:
- This gives more control, especially in situations where waiting isn’t ideal.
- Opening a file using
open()
alongsideO_NONBLOCK
flag ensures the operation returns immediately instead of blocking. - If data isn’t available right away, the operation notifies the program, which can try again later, avoiding unnecessary waiting.
Asynchronous Operation:
- Here, the function call returns immediately, allowing the program to continue without waiting for the result.
- A notification or callback is sent when the operation completes, or the program can check its status periodically.
Keep in mind the async I/O is not the same thing as I/O multiplexing. While both techniques improve I/O efficiency, they’re conceptually different:
- Asynchronous I/O schedules operations concurrently, allowing the program to proceed without blocking.
- I/O Multiplexing (e.g.,
select()
,poll()
,epoll()
) monitors multiple channels simultaneously and informs the program when a channel has data ready. Blocking could still occur if the program cannot proceed without data from a channel.
Think of them as complementary: multiplexing helps monitor multiple channels, while async I/O allows the program to do other things while waiting.
There are several ways asynchronous I/O can be implemented in practice:
Multiprocess Backend: Each request runs in a separate process, isolating tasks and preventing blocking between them.
// Handle requests with processes
void handle_client_proc(int client_sockfd) {
pid_t pid = fork();
if (pid < 0) // handle error
if (pid == 0) { // Child process: handle client connection
close(server_sockfd); // Close the server socket in the child
{...} // compute and send answer
close(client_sockfd); // Close client socket when done
exit(0); // Exit child process
}
close(client_sockfd); // close client socket in parent
}Multithreaded Backend: Each request runs in a separate thread, allowing concurrent operations within the same process.
// Handle requests with threads
void* handler(void* arg) {
int client_sockfd = *(int*)arg;
{...} // compute and send answer
close(client_sockfd); // Close client socket when done
return NULL;
}
void handle_client_thread(int sockfd) {
int *sockfd_p = malloc(sizeof(int)); // use the heap to pass the address
*sockfd_p = sockfd;
int rc = pthread_create(&thread_id, NULL, handler, client_sock_ptr);
if (rc < 0) // handle error
pthread_detach(thread_id); // Let the thread clean up after itself
}Event-based Backend: An action is scheduled with a callback, which is invoked upon completion, using event loops to manage tasks. A callback is simply a function pointer, allowing the system to execute the function later or when a specific event is triggered.
Test your understanding by solving the Async Server task.
Zero-Copy
Imagine a server that responds with files that it stores locally. Its actions would be those highlighted in the image below:
- Receive a new request and extract the filename
- Read the filename from the disk into memory
- Send the file from memory to the client
Quiz: How many copies does the OS make?
As you might have guessed, 2 of these copies are useless. Since the app doesn't modify the file, there's no need for it to store the file in its own buffer. It would be more efficient to use a single kernel buffer as intermediate storage between the disk and the NIC (Network Interface Card), as shown in the image below.
For an easier comparison with the "classic" read()
+ send()
model, here's the first version again:
It should be obvious that the former approach is more efficient than the latter.
These diagrams capture the essence of zero-copy: transferring data directly between kernel buffers, avoiding intermediate user-space buffers. This approach is ideal for serving requests, whether forwarding data between clients or reading from disk. It relies on the OS to retrieve and send data efficiently without extra copying steps.
sendfile()
The syscall with which we can leverage zero-copy is called sendfile()
.
Here are some practical examples on how to use it:
// file to file
int in_fd = open("input_file.txt", O_RDONLY); // src
int out_fd = open("output_socket", O_WRONLY); // dst
ssize_t bytes_sent = sendfile(out_fd, in_fd, &offset, 4096); // Transfer 4096 bytes
if (bytes_sent < 0) {...} // handle error
// file to network
int in_fd = open("input_file.txt", O_RDONLY); // src
int sockfd = socket(AF_INET, SOCK_STREAM, 0); // dst
int rc = connect(sock_fd, &server_addr, sizeof(server_addr));
if (rc < 0) {...} // handle error
ssize_t bytes_sent = sendfile(out_fd, in_fd, &offset, 4096); // Transfer 4096 bytes
if (bytes_sent < 0) {...} // handle error
You can read a slightly more detailed article about zero-copy here.
Guide: Benchmarking sendfile()
Navigate to chapters/io/optimizations/guides/benchmarking-sendfile/support
and run make
to prepare the test file.
sendfile()
transfers data between two file descriptors directly within the kernel, bypassing user-space buffers.
This process is known as zero-copy.
Having established that sendfile()
is likely faster than traditional I/O operations, it's time to put this theory to the test!
The code in server.py
creates two threads that behave nearly identically.
One listens on port 8081
and handles connections using read()
and send()
, while the other listens on port 8082
and uses sendfile()
.
Start the server in one terminal with python server.py
.
In a second terminal, run benchmark_client.py read-send
followed by benchmark_client.py sendfile
to compare the performance.
The results below are generic, and your outcomes may vary significantly depending on factors such as disk performance, network interface card (NIC), kernel version, Python version, and system load.
student@os:/.../benchmarking-sendfile/support$ python3 benchmark_client.py read-send
Time taken: 7.175773588009179 seconds
student@os:/.../benchmarking-sendfile/support$ python3 benchmark_client.py sendfile
Time taken: 3.71454380400246 seconds
Using sendfile()
halves the number of copies required, reducing it from 4 to 2.
This should translate to a roughly halved running time as well, making sendfile()
a clear performance improvement over traditional methods.
You can explore another example of zero-copy in practice in this Quiz: Why is cp
faster than mmap()
-based cp
.
Guide: Kernel Caching
I/O is critical to system efficiency, but also often its weakest link. Techniques to improve I/O performance include buffering, zero-copy, and async I/O. Among these, buffering is the most common and powerful.
Remember buffering/support/benchmark_buffering.sh
or file-mappings/support/benchmark_cp.sh
.
They both used this line:
sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
The line invalidates caches, forcing the OS to perform I/O operations directly from the disk. This ensures that the scripts benchmark the C code's performance alone, without any speedup from cached data.
The kernel goes even further with buffering.
This time, it’s at a level beyond syscalls like read()
and write()
, using a strategy known as caching.
While buffering helps with handling the next data efficiently by reading in advance or delaying writes, caching is about speeding up repeated access to the same data.
Just as your browser caches frequently visited pages or your CPU caches recent addresses, the OS caches files that are accessed often, such as logs or configuration files.
When the OS encounters a file access, it stores portions of that file in memory so that subsequent requests can read or modify data from RAM rather than waiting on the slower disk. This makes I/O faster.
Caching in action
Navigate to chapters/io/optimizations/guides/kernel-caching/support
and run make
to create a large file that we'll use for benchmarking.
We have two scripts to benchmark the cp
command with and without caching:
student@os:/.../kernel-caching/support$ ./benchmark_cp.sh
make: 'large-file.txt' is up to date.
Benchmarking cp on a 1 GB file...
real 0m1.473s
user 0m0.001s
sys 0m0.985s
student@os:/.../kernel-caching/support$ ./benchmark_cp_allow_caching.sh
make: 'large-file.txt' is up to date.
Benchmarking cp on a 1 GB file...
real 0m0.813s
user 0m0.000s
sys 0m0.837s
Each subsequent benchmark actually reads the data from the caches populated or refreshed by the previous one. So running the script multiple times might improve the results.
You can use free -h
to view how much data your kernel is caching.
Look at the buff/cache
column.
One possible output is shown below.
It says the OS is caching 7 GB of data.
student@os:~$ free -h
total used free shared buff/cache available
Mem: 15Gi 8,1Gi 503Mi 691Mi 7,0Gi 6,5Gi
Swap: 7,6Gi 234Mi 7,4Gi
Guide: Async
Enter the chapters/io/optimizations/guide/async/
folder for some implementations of a simple request-reply server in C.
Here we have the implementation of a server that computes the n-th fibonacci number.
The server serves requests in different ways:
- synchronous server:
server.c
- multiprocess server:
mp_server.c
- multithreaded server:
mt_server.c
Note: There is no asynchronous C variant, because of the unstable API of libaio
and io_uring
.
Benchmark the synchronous server to have a reference point. Run
./server 2999
in one terminal andtime ./client_bench.sh 2999
in another.client_bench.sh
we'll run8
instances ofclient.py
that make a request.student@os:~/async/support$ time ./client_bench.sh 2000
[...]
root: Connected to localhost:2000
root: Sending 30
function(30): 1346269
real 0m1.075s
user 0m0.301s
sys 0m0.029sThe value you obtain might be different, but you should observe a speed-up when benchmarking the other two solutions.
Benchmark the multiprocess and multithreaded alternatives and see which one got the best speed increase. You can obtain more relevant values by tweaking parameters in
client_bench.sh
. For example, you could increase the number of clients or the fibonacci value to compute.Begin by benchmarking the synchronous server to establish a baseline. Run
./server 2999
in one terminal andtime ./client_bench.sh 2999
in another. Theclient_bench.sh
script will initiate8
instances ofclient.py
, each making a request. The output might look like this:student@os:~/async/support$ time ./client_bench.sh 2000
[...]
root: Connected to localhost:2000
root: Sending 30
function(30): 1346269
real 0m1.075s
user 0m0.301s
sys 0m0.029sAlthough the actual value may vary, you should observe a noticeable speed-up when testing the other two solutions.
Next, benchmark the multiprocess and multithreaded alternatives to determine which offers the best performance improvement. To obtain more meaningful results, adjust parameters in
client_bench.sh
, such as increasing the number of clients or the Fibonacci value to compute.
If you're having difficulties understanding the support code, go through this reading material. If you want to practice this yourself, go through the Async Server task.