This post explores some design and implementation options for a multithread client, in a client-server application, in the context of Linux and the POSIX API, including POSIX threads and signals. Moreover, we are interested in writing some good code and creating a good engineering product, one that allocates resources and frees them when they are no longer necessary. These resources include signal handlers, mutexes, semaphores, sockets, and threads.
And, even though it is simple to acquire and release these resources when they are used individually, when they are used together, the problem of graceful exit, one in which all acquired resources are properly released, becomes far from trivial. And it is such a shame that this is not a trivial problem to solve, given that this seems to be one of the most common use cases in Linux daemon/server programming.
A word on POSIX...
Part 1 - 3 threads: master, input, and output
Let's start with a simple design of our client application. We will be going to collect user input from the standard input, which is going to be connected to a terminal, and are going to send it to the server. At the same time, we are going to collect replies from the server which we will send to the standard output, which is going to be connected to a terminal. This is a fairly simple setup for an initial design.
Let's draw some inspiration from the TTY terminals and let's implement the application as 3 threads: (1) master thread, responsible for acquiring the resources when the application starts and releasing them before the application terminates, (2) input thread, responsible for collecting user input from the standard input and sending it to the server, and (3) output thread, responsible for collecting replies from the server and sending them to the standard output.
So, in a typical run, the master thread will create mutexes, semaphores, sockets, and it will spawn the input and output threads. Then, the master thread will wait for the other threads to finish. Meanwhile, the input and output threads will work on the user input and server replies. At some point, these threads will terminate and will join with the master thread. Then, the master threads proceeds to release the acquired resources and, finally, it terminates the process.
Let's see what it means for the input and output threads to terminate. Let's start with the simplest case, which is the output thread. The output thread is receiving replies from the server and sending data to the standard output. Since we are using blocking IO, the output thread is blocked in a call to "recv" and, if the server closes the connection, this call will either return "0", for an orderly shutdown, or an error code that represents the fact that the connection is closed or broken. In any case, this means there are no more replies to receive and the output thread can terminate.
On the other hand, for the input thread is a bit more complicated because we have to address with two possible cases. The first, the simplest, is when there is no more user data to read from the standard input. So, in this case, the input thread will be blocked on a call to one of the standard C library reading functions, for example, "fgets", and if this function returns "EOF" then that means there is no more user input and the thread can terminate. The other case we have to address is when the input thread tries to send the data inputted by the user to the server. In this case, it is possible that the server closed the connection and a call to "send" will terminate with an error code representing the fact that the connection is broken. In this situation, the input thread can terminate.
Now that we have covered the situations in which the input and output threads terminate, we have to think about how the master thread can join these threads. And this is when things get complicated...
Part 2 - master joins other threads
As I said before, the master is waiting for the input and output threads to terminate. However, we saw that any of the threads can terminate or even both threads can terminate at the same time. And the master does not know which thread will terminate first. Therefore, the master thread cannot simply call "pthread_join" on one of the threads and then on the other because it might block indefinitely.
Moreover, imagine the situation in which the server closed the connection and the output thread has terminated, but the input thread is still blocked in the call to "fgets", waiting for user input. In this situation, we clearly want the input thread to terminate as well, otherwise the user will enter input that cannot be sent to the server and he will only be notified that the connection has been closed long after it has happened. In order to overcome this problem, we are going to look at a solution that uses signals to wake up threads on blocking calls.
Part 3 - signalling input and output threads
Let's revisit the scenario introduced in the previous section. The input thread is blocked on a call to read user input, the output thread is about to terminate due to an orderly shutdown from the server, and the master thread is waiting to join the other threads. At this moment, we need to wake up the input thread and tell it to terminate as well. The output thread has to be the one responsible for waking up the input thread because it is the only running thread. In order to achieve this, before terminating, the output thread will change some global state to indicate that the input thread is to terminate, and then send a signal, for example "SIGUSR1", to the input thread to wake it up. Then, the input thread wakes up, checks this global state, and terminates. Because both threads terminate, the master joins both threads, releases the acquired resources and terminates as well. The world is fantastic! Or is it?
This plan seems simple but in order to get there we need to tackle some implementation details. The first one is signal handling. If a thread is blocked on a system call and receives a signal, this call will be interrupted causing the error code "EINTR" to be returned. Please note that this does not necessarily mean that we want the thread to terminate. It can also be the case that the call was interrupted for another reason. Therefore, at all call sites to system calls, or functions calling system calls, we must check if the error code is "EINTR" and at the same time check the global state to determine whether the call should be repeated or whether the thread should terminate. And, by the way, it is not only cumbersome to update all call sites but it is also easy to forget to check a given call site for the error code and the global state. Finally, we also have to install the signal handler, but that is rather obvious.
Another implementation detail we have to tackle is the change in the global state. We have to consider the fact that the output thread wakes up the input thread, or the other way around. And even though it is not a real problem to send multiple signals to the other threads, this situation should be avoided. Therefore, the change in global state should be synchronized. And, we have to remember that call sites to semaphores and mutexes must also be checked for the "EINTR" error code.
So this plan this like a perfectly reasonable plan and you can even find it in forums throughout the Internet with people advising on this course of action. Unfortunately, it does not work! And the reason why it does not work it is because there is always a race condition that cannot be eliminated this way. The race condition occurs between checking the global state and entering the system call that will block for IO. In other words, when we send a signal to another thread to wake it up, we can't be sure if the thread is already blocking for IO or if it is still running to get there. Of course, you could "sleep" and wait for the thread to block. But that's hand-and-slash, not proper engineering. So let's take a look at other options.
Part 4 - control channel
So the input and output threads block on some file descriptor until there is data available to read or the file descriptor is ready to send data. We could create another file descriptor (e.g., a pipe or a socket) that would function as a control channel. This way, the input or output threads could "poll" or "select" on 2 file descriptors simultaneously, namely, the data channel and the control channel, and when "poll" or "select" terminate they will check if the data comes from the control channel, in which case the threads would terminate. This seems that it could work, although I have not tried to implement it. The problem is that you end up with one additional file descriptor per thread. And if you want your application to scale, you need to save on resources, especially file descriptors. Naturally, this is just a client application and it does make sense to think about scalability, but if you are planning to port some of the design decisions in this post to a server, then you have to think about scalability.
Part 5 - detach and cancel
While exploring the POSIX threads API, I came across with thread detaching and cancellation. These seems to go well together. So thread detaching makes a thread "garbage collectable", in the sense that it does not need to be joined by another thread in order to free up the resources but it is instead free automatically by the operating system when the thread terminates. And thread cancellation means that a thread can be forcibly terminated by another thread when it is blocked on a cancellation point, for example, blocked on a IO call.
With these features, the threads can be created as detached threads (i.e., "garbage collectable") and then, for example, when the input thread is about to terminate, it will cancel the output thread, but synchronized with a change in global state so that the two threads don't cancel each other. The master thread also has to be adapted: so the master thread should be waiting on a semaphore that is posted with the synchronized change in global state.
There is still a race condition but it might be a problem or not depending on what you are doing. The race condition occurs because cancelling a thread is not an immediate action: thread cancellation is a promise than when a thread enters a cancellation point, such as, a blocking IO system call, it will be forcibly terminated. This can be a problem if the master thread, which might be running by now, releases resources that can still be used by the thread that will be cancelled.
Tags: programming