Blocking Sockets and Async

Using async in Rust can lead to bad surprises. I recently came across a particularly gnarly one, and I thought it was interesting enough to share a little discussion. I think that we are too used to the burden of separating async from blocking being on the programmer, and Rust can and should do better, and so can operating system APIs, especially in subtle situations like the one I describe here.

Every async programmer learns early on not to call a blocking function from an async function. If you do, it is a hidden color violation, as I discuss in a previous post. By “hidden,” I mean that unlike other color violations, Rust gives you no compiler-time help. You just have to use discipline. You just have to “make sure not to do it.” You just have to increase your cognitive load. It is a rule that the computer is no help with – which means that you’ll definitely mess it up at some point, possibly at many points.

Unfortunately, it’s also a gnarly problem to debug. The actual blocking function call will quite possibly work just fine. It’ll return when the resource is ready, and block until then – probably exactly what you wanted. It’s the rest of the system that falls apart – other tasks on the same thread starve, tasks that are depending on them for progress also starve, but meanwhile other tasks might proceed without a problem. Worse, there’s no guarantee that the bug will manifest every time, so the bug isn’t readily reproducible.

You might think this is an easy problem to address, either through improvements in the programming language or better programming discipline.

At a programming language level, you could imagine Rust having some sort of generalization of unsafe, or maybe an effects system. Functions that block would have blocking as part of their signature. Calling a blocking function from an async function would then be an error, with a way out for functions like spawn_blocking.

Unfortunately, Rust doesn’t have this feature, so we have to rely on programmer discipline. The discipline seems easy enough: If you’re in an async function, and you call a function that’s going to take some time or do I/O, make sure you’re doing an async call, which in most cases means using the async keyword.

Unfortunately, this doesn’t work 100% of the time, because the operating system isn’t on board. There are system calls that block sometimes, based on dynamic configuration. Does the recv system call block? Well, that depends on whether the socket is a blocking socket, or a non-blocking socket. Fundamentally, recv is run-time polymorphic on socket type, in a way that makes it a different color based on run-time information.

This is bad design: BSD should have split recv into two system calls, recv or recv_nonblock. recv could error if given a non-blocking socket, and recv_nonblock could error if given a blocking one. Linux at least has a flag MSG_DONTWAIT that makes an individual recv call unconditionally non-blocking, but it’s non-standard. It’s not supported on macOS and tokio/mio understandably doesn’t use it.

Most of the time, this isn’t an issue. Sockets controlled through tokio or other async runtimes are always configured with the operating system to be non-blocking, as an invariant on those socket types. Sockets controlled through std or other libraries will be blocking, and will be contained in completely different Rust types. The Rust type system is used to keep track of the distinction even if the operating system won’t.

But this becomes an issue where these boundaries are broken, namely in conversion functions between them. These methods then have whether or not a socket is blocking as part of their contract. For example, the documentation for TcpStream::from_std says:

This function is intended to be used to wrap a TCP stream from the standard library in the Tokio equivalent. The conversion assumes nothing about the underlying stream; it is left up to the user to set it in non-blocking mode.

Thus, as a precondition of calling the from_std function, you must pass a “non-blocking” socket. If you instead did not set the socket as non-blocking – perhaps because you were making it with some extra options you needed, but assumed that tokio would handle the non-blocking part – bad things happen.

If blocking were considered a safety issue, this function would be marked unsafe. But it’s not, and so it’s simply an unchecked precondition – and we’re not used to those in Rust. Most safe functions check their preconditions, either returning a special value (like an Err) or panicking if something is wrong. The ones that don’t are typically marked unsafe. Unchecked preconditions still exist – they cause rogue behavior but not behavior deemed “unsafe” under Rust’s definition – but they are rare, and therefore surprising to a Rust programmer.

Why is it not a checked precondition? That’s easy to answer: Checking it would take an extra system call, as would unconditionally setting it unblocked in that system call itself. System calls are slow, and that would be an unacceptable performance penalty for many applications.

This leads to a dissapointing end result, though. It’s not enough to simply make sure you don’t call I/O methods unless they come with an async version. To be disciplined enough to be an async Rust programmer, you also have to watch out for these extra unchecked preconditions.

Otherwise, you get a hidden color bug that’s even harder to track down because the blocking functions you’re calling don’t look blocking. tokio calls recv, thinking it’s not blocking, but it is. You expect tokio to be correct, but because of this broken invariant, it isn’t. These sorts of issues can be very hard and time-consuming to debug.

If you want to send me something privately and anonymously, you can use my admonymous to admonish (or praise) me anonymously.

Blocking Sockets and Async

Subscribe

Comments