# IO 1. File Interface 1. Interprocess Communication 1. IO Optimizations --- ## Backstory --- - You finish your summer internship - You explain the cool internal parts of the project to your friends - They hit you with: "How do I interact with it?" - You describe the data gathering algorithms again and see them lose interest What was missing? ---- ### Interactive Software is Good Software ![interactive software](overview/media/interactive-software.png) ---- ### Interactivity - Software does computation on data and returns data - The more data we provide to our application, the more relevant its result becomes --- ### Roadmap ---- ![roadmap](overview/media/roadmap.svg) ---- ![roadmap Data](overview/media/roadmap-Data.svg) ---- ![roadmap Compute](overview/media/roadmap-Compute.svg) ---- ![roadmap IO](overview/media/roadmap-IO.svg) ---- ### What We Want - Means for our application to communicate - Compatibility with various devices - Responsiveness (Non-blocking I/O) - Performance (fast transfer, I/O multiplexing) --- ## File Interface --- ### File Interface - `open()` ![open](file-descriptors/media/file-interface-open.svg) ---- - Creates a new session to communicate with the file - Returns a **file handle** - used to reference the **communication channel** --- ### File Descriptor - In higher level languages the file handle is an object that allows working with files - Implementation-wise, it is an integer between **0** and **1023** - To keep track of open files, each process holds a **FDT (File Descriptor Table)** - Each entry in the **FDT** is either null or a pointer to an **Open File Structure** ---- ### File Descriptor Table ![File Descriptor Table](file-descriptors/media/file-descriptor-table.svg) ---- ### Open File Structure - Contains: - Permissions - The offset inside the file - Number of handles referencing it - **inode** (pointer to data and metadata) - The OS keeps track of all **Open File Structures** and stores them in the **Open File Table** --- ### File Interface - `read()`/`write()` ![read write](file-descriptors/media/file-interface-read-write.svg) ---- ### File Interface - `read()` - Uses the **file handle** to read bytes - Return number of read bytes - **-1** - read failed - **0** - EOF reached - **< `num`** - partial read ---- ### File Interface - `write()` - Uses the **file handle** to write bytes - Return number of written bytes - **-1** - write failed - **< `num`** - partial write --- ### File Interface - `close()` ![close](file-descriptors/media/file-interface-close.svg) ---- - Decrements the reference count in the Open File Structure - Deletes the Open File Structure if the reference count is 0 - Flushes OS buffers --- ### File Interface - `lseek()` ![lseek](file-descriptors/media/file-interface-lseek.svg) ---- - Uses the **file handle** to update the **offset** in file - Returns the new **offset** in file - **-1** on error ---- ### Sparse file What happens if the **offset** is bigger than the file size? - `demo/file-interface/sparse-file.c` --- ### File Interface - `ftruncate()` ![ftruncate](file-descriptors/media/file-interface-ftruncate.svg) ---- - Truncates the file indicated by the **file handle** to the indicated **length** - If the file size is smaller than **length**, the file is extended and "filled" with binary zeros to the indicated **length** ---- ### `ftruncate()` demo - `demo/file-interface/truncate.c` --- ### File Interface in Python - `demo/file-interface/py-file-ops.py` ---- ### File Interface in C - `demo/file-interface/c-file-ops.c` --- ### File Interface - `dup()` ![dup](file-descriptors/media/file-interface-dup.svg) ---- - Results in two **file handles** that refer the same Open File Structure - Duplicates a file descriptor into the smallest unused file descriptor ---- ### `open()` vs `dup()` demo - `demo/file-interface/open-vs-dup.c` - `demo/file-interface/close-stdout.c` - What could go wrong between **closing** `stdout` and **calling dup**? - "Everything." ---- ### File Interface - `dup2()` - Duplicates a file descriptor into a **designated** file descriptor - If the new file descriptor is open, it will be closed before being reused - This action will be performed **atomically** --- ## Devices --- ### I/O Devices - A hardware component that communicates with our application through bytes - Various devices fit this description ---- ![mouse keyboard](file-descriptors/media/dev-mouse-keyboard.png) ---- ![hard disk](file-descriptors/media/dev-storage.png) ---- ![sensors](file-descriptors/media/dev-sensors.png) ---- ![devices](file-descriptors/media/devices.png) How does the OS operate them? --- ### Device Types | Char Devices | Block Devices | | :-------------------: | :-----------------------: | | read/write one byte | read/write blocks of data | | slow | fast | | no seek | seek | | no buffering | buffering | ---- ### Representing I/O Devices ![char and block devices](file-descriptors/media/char-block-devices.svg) Linux abstracts an I/O device as a **special file** ---- ### Device Special Files - Device files are in `/dev` - The letter before permissions describes the file type ```console student@OS:~$ ls -l /dev/ crw-rw-rw- 1 root root 1, 3 nov 6 20:31 null crw-rw-rw- 1 root root 1, 8 nov 6 20:31 random brw-rw---- 1 root disk 8, 1 nov 6 20:31 sda1 brw-rw---- 1 root disk 8, 2 nov 6 20:31 sda2 brw-rw---- 1 root disk 8, 5 nov 6 20:31 sda5 brw-rw---- 1 root disk 8, 6 nov 6 20:31 sda6 crw-rw-rw- 1 root tty 5, 0 nov 6 20:31 tty crw-rw-rw- 1 root root 1, 9 nov 6 20:31 urandom crw-rw----+ 1 root video 81, 0 nov 6 20:31 video0 crw-rw----+ 1 root video 81, 1 nov 6 20:31 video1 crw-rw-rw- 1 root root 1, 5 nov 6 20:31 zero ... ``` ---- ### Device Types demo - `demo/devices/read-from-device.sh` --- ### Using Devices - Devices are abstracted as files and follow the **File Interface** - Can we simply read to a device? ```console student@OS:~$ cat /dev/input/mouse1 ``` ---- If not friend, why friend shaped? ![snow leopard](file-descriptors/media/snow-leopard.png) ---- ### Software Stacks All Over Again ![Device Software Stack](file-descriptors/media/device-software-stack.svg) ---- - The **file abstraction** and **file interface** are only the middle part - The overlay is a **communication protocol** describing how to **encode**/**decode** data - The underlay is the **driver interface** communicating with the device through control codes ---- ### `ioctl()` - I/O Control - General purpose interface that uses **control codes** to communicate directly with the device driver - Might be used for every IO operation - **Not intuitive**: requires knowledge on how the device communicates - **Not portable**: arguments depend on the OS version and the device ---- ### `ioctl()` demo - `demo/devices/hwaddr-ioctl.c` - `demo/devices/filesystem-size.c` --- ### Virtual Devices - Not all device files in `/dev` have a corresponding physical device - e.g.: `/dev/zero`, `/dev/null` - We can fully cover their behaviour using software ---- ### `read()` from `/dev/null` ```c static ssize_t read_null(struct file *file, char __user *buf, size_t count, loff_t *ppos) { return 0; } ``` ---- ### `write()` to `/dev/null` ```c static ssize_t write_null(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { return count; } ``` ---- ### `read()` from `/dev/zero` (simplified) ```c static ssize_t read_zero(struct file *file, char __user *buf, size_t count, loff_t *ppos) { size_t cleared = 0; while (count) { size_t chunk = min_t(size_t, count, PAGE_SIZE); size_t left = clear_user(buf + cleared, chunk); cleared += chunk; count -= chunk; } return cleared; } ``` ---- ### `write()` to `/dev/zero` ```c static ssize_t write_zero(struct file *file, const char __user *buf, size_t count, loff_t *ppos) { return count; } ``` --- ### Inter-Process Communication --- ### Inter-Process Communication - A communication channel involves a process and a data endpoint - We can use persistent data and I/O devices as endpoints with the **file interface** - Another possible endpoint is a **process** - [Law of the hammer](https://en.wikipedia.org/wiki/Law_of_the_instrument) - Socket interface ---- ### IPC - File Interface ![IPC through File Interface](ipc/media/IPC-file-interface.svg) - Writing to disk and reading from disk is tedious ---- ### IPC - Pipe ![IPC through pipe](ipc/media/IPC-pipe.svg) - Same idea as above ---- - **Proc1** writes at one end - **Proc2** reads from the other end - Why is it better? - The **pipe** is stored in Kernel Space, we no longer use the disk - Only works for related processes ([fork](https://man7.org/linux/man-pages/man2/fork.2.html)) ---- ### Pipe - Walkthrough ![Pipe walkthrough - 1](ipc/media/pipe-walkthrough-1.svg) ---- ### Pipe - Walkthrough ![Pipe walkthrough - 2](ipc/media/pipe-walkthrough-2.svg) ---- ### Pipe - Walkthrough ![Pipe walkthrough - 3](ipc/media/pipe-walkthrough-3.svg) ---- ### Pipe - Walkthrough ![Pipe walkthrough - 4](ipc/media/pipe-walkthrough-4.svg) ---- ### Pipe - Walkthrough ![Pipe walkthrough - 5](ipc/media/pipe-walkthrough-5.svg) ---- ### `pipe()` demo - `demo/IPC/pipe.c` --- ### IPC - Named Pipe (FIFO) - Bypass unnamed pipes limitations: - Can be used by more than 2 processes - Can be used by unrelated processes - Stored on disk as a **special file** ```console student@os:~$ ls -l my_fifo prw-r--r-- 1 student student 0 nov 22 18:38 my_fifo ``` ---- ### `mkfifo()` demo - `demo/IPC/fifo.c` --- ### IPC - Socket Interface ![Socket Interface](ipc/media/socket-interface.svg) ---- ### Socket - The endpoint in the inter-process communication - Uniquely identifies the process in the communication - Supports multiple transmission types - e.g.: `stream`, `datagram` ---- ### Socket Interface - `socket()` ![socket()](ipc/media/socket-interface-socket.svg) ---- - Creates a socket with given **domain** and **type** - Returns a **file descriptor** - Compatible with `read()`/`write()` operations - Does not support `fseek` ---- ### Socket Attributes - **Domain** - `AF_UNIX` for communication on the **same host** - `AF_INET` for communication over the internet - **Type** - **Stream** establishes a reliable connection and ensures all data is transmitted - **Datagram** sends data faster without checking its integrity ---- ### Stream and Datagram ![Stream and Datagram simplified](ipc/media/stream-datagram-simplified.png) --- ### Client-Server - Socket communication uses the **client-server** model - **Server** - **Bound** to an **address** and **accepts** connections - Answers queries from clients - **Client** - **Connects** to a server using its **address** - Sends queries to the server --- ### Server Workflow ---- #### Server - `socket()` ![socket()](ipc/media/socket-interface-socket.svg) - Create socket ---- #### Server - `bind()` ![bind()](ipc/media/socket-interface-bind.svg) - Assign address to the socket ---- #### Server - `listen()` ![listen()](ipc/media/socket-interface-listen.svg) - Mark the socket **passive** - will be used to accept connections ---- #### Server - `accept()` ![accept()](ipc/media/socket-interface-accept.svg) - Wait for a connection - Accept the connection and create a socket for it --- ### Client Workflow ---- #### Client - `socket()` ![socket()](ipc/media/socket-interface-socket.svg) - Create a socket ---- #### Client - `connect()` ![TCP socket](ipc/media/socket-interface-connect.svg) - Establish connection with server --- ### `send()`/`recv()` ![Socket `send()/recv()`](ipc/media/socket-interface-send-recv.svg) - Same as `read()`/`write()` ---- #### `send()` - Use socket to send bytes - Return number of sent bytes - **-1** - `send()` failed - **< `num`** - partial `send()` ---- #### `recv()` - Use socket to receive bytes - Return number of received bytes - **-1** - `recv()` failed - **0** - connection end (similar to EOF) - **< `num`** - partial `recv()` --- ### Unix Socket - Used in communication on the **same host** - Uses a **special file** on disk as the address - Server is `bound` to the file - Client `connects` to the server using the file ```console student@os:~/.../demo/IPC$ ls -l srwxrwxr-x 1 student student 0 dec 2 16:03 unix_socket ``` ---- ### Unix Socket demo - `demo/IPC/unix_socket_server.c` - `demo/IPC/unix_socket_client.c` --- ### IP Sockets - Used in communication on **different hosts** - Each host is identified by an [**IP address**](https://ocw.cs.pub.ro/courses/uso/laboratoare/laborator-05) - For a host to support multiple applications that require network connection we use **ports** - A port is an integer between **0** and **65536** - An application is identified by an IP address and a port - e.g.: `217.182.27.243:25565` ---- ### Stream Socket demo - `demo/IPC/stream_socket_server.c` - `demo/IPC/stream_socket_client.c` ---- ### Datagram Socket demo - `demo/IPC/datagram_socket_server.c` - `demo/IPC/datagram_socket_client.c` --- ### Socket Summary ![Socket Summary](ipc/media/socket-summary-generated.gif) --- ## IO Buffering --- ## `libc` Buffering ---- ### `stdout` vs `stderr` - `demo/optimizations/stdout-vs-stderr.c` ---- ### Investigate ```console student@os:~/.../demo/optimizations$ ./stdout-vs-stderr Join the dark side! Hello from the other side! ``` - Not what we expected ```console strace --trace=write ./stdout-vs-stderr ``` ---- ### Observations - Each print to `stderr` results in a `write()` - Prints to `stdout` result in one single `write()` ---- ### Behind The Scenes - Printing to `stdout` is **buffered** to avoid multiple context switches - The same is not true for `stderr` as we want to see errors as they occur ---- ### Does It Work? - Attempt to write **10000 bytes** to a file, **1 byte** at a time - `demo/optimizations/fwrite-10000.c` - `demo/optimizations/write-10000.c` ---- ### Can We Do More? --- ## Kernel Buffering ---- ### Context - `libc` buffering reduces the number of context switches - But **synchronizing** the disk after every `write()` is a major bottleneck ---- ### Kernel Buffer - Same as idea as `libc` - Each `write()` fills a buffer in kernel space - Transfer data to disk when - Buffer is full - `fsynk()` is called - Enough time has passed since the last `write()` ```console student@os:~$ cat /proc/sys/vm/dirty_expire_centisecs 3000 ``` ---- ### Overview ![IO Buffering Overview](optimizations/media/io-buffering-overview.svg) ---- ### Does It Work? - Write **10000 bytes** to disk, **1 byte** at a time - with buffering - `demo/optimizations/write-10000.c` - Write **1000 bytes** to disk, **1 byte** at a time - without buffering - `demo/optimizations/write-1000-unbuf.c` --- ## IO Buffering Drawbacks - Transfer of information between kernel buffers requires user space transitions - Operations might **block** - Read when buffer is empty - Write when buffer is full --- ## Kernel Buffers Transfer ![Repeated Copy](optimizations/media/repeated-copy.svg) ---- ### Obvious Problems - We have an intermediary buffer - Data from **read buffer** is copied to **application buffer** - Data from **application buffer** is copied to **socket buffer** - We perform two context switches ---- ### `zero-copy` ![Zero-copy](optimizations/media/zero-copy.svg) ---- - `sendfile()` instructs kernel to copy data from one of its buffers to another - We perform a single context switch - Data is copied a single time --- ## Blocking IO - Reading from an empty buffer - The kernel buffer is filled with information from the device - The library buffer is filled from the kernel buffer - `read()` operation resumes - What happens if the device has no information to share? - `read()` **blocks** ---- ### Non-Blocking IO - `O_NONBLOCK` makes operations return rather than block - `SOCK_NONBLOCK` for sockets - Allows handling of input on multiple file descriptors - Does not scale with number of file descriptors - The thread is busy waiting instead of entering WAITING state - Ugly --- ## `epoll` Interface - Linux interface, non-portable - Kernel keeps an internal structure to monitor file descriptors - The thread enters WAITING state until new connections emerge - User updates the interface using the exposed interface - `epoll_create()` - `epoll_ctl()` - `epoll_wait()` ---- ### `epoll_create()` ![epoll_create()](optimizations/media/epoll-create.svg) --- ### `epoll_ctl()` ![epoll_ctl()](optimizations/media/epoll-ctl-1.svg) ---- #### `epoll_ctl()` ![epoll_ctl()](optimizations/media/epoll-ctl-2.svg) ---- #### `epoll_ctl()` ![epoll_ctl()](optimizations/media/epoll-ctl-3.svg) ---- #### `epoll_ctl()` ![epoll_ctl()](optimizations/media/epoll-ctl-4.svg) ---- #### `epoll_ctl()` ![epoll_ctl()](optimizations/media/epoll-ctl-5.svg) ---- #### `epoll_ctl()` ![epoll_ctl()](optimizations/media/epoll-ctl-6.svg) --- ### `epoll_wait()` ![epoll_wait()](optimizations/media/epoll-wait.svg) --- ## Asynchronous IO --- ## Client-Server 2.0 - Real-life applications use **Client-Server** model - But they take into consideration scalability - Let us see what it takes to go from 1 client to **8** - Use `demo/optimizations/client.py` for the client implementation - Use `demo/optimizations/client_bench.sh` to compare implementations ---- ### Trivial Server - Run server in a loop - `demo/optimizations/server.py` ---- ### Good Old Processes - Create a process to handle each new client - `demo/optimizations/mp_server.py` ---- ### Good Old Lightweight Processes - Create a process to handle each new client - `demo/optimizations/mt_server.py` --- ### Asynchronous IO - `demo/optimizations/async_server.py` ---- - The python asynchronous interface abstracts a lot of the functionality - The idea behind: - The server socket is added to an **`epoll` instance** - The specified **handle** is called for each event (new connection)