Blocking Sockets and Async
async in Rust can lead to bad surprises. I recently came across
a particularly gnarly one, and I thought it was interesting enough to
share a little discussion. I think that we are too used to the burden
async from blocking being on the programmer, and
Rust can and should do better, and so can operating system APIs,
especially in subtle situations like the one I describe here.
async programmer learns early on not to call a blocking function
async function. If you do, it is a hidden color violation,
as I discuss in a previous post. By “hidden,” I
mean that unlike other color violations, Rust gives you no compiler-time
help. You just have to use discipline. You just have to “make sure not
to do it.” You just have to increase your cognitive load. It is a rule
that the computer is no help with – which means that you’ll definitely
mess it up at some point, possibly at many points.
Unfortunately, it’s also a gnarly problem to debug. The actual blocking function call will quite possibly work just fine. It’ll return when the resource is ready, and block until then – probably exactly what you wanted. It’s the rest of the system that falls apart – other tasks on the same thread starve, tasks that are depending on them for progress also starve, but meanwhile other tasks might proceed without a problem. Worse, there’s no guarantee that the bug will manifest every time, so the bug isn’t readily reproducible.
You might think this is an easy problem to address, either through improvements in the programming language or better programming discipline.
At a programming language level, you could imagine Rust having
some sort of generalization of
unsafe, or maybe an effects system.
Functions that block would have
blocking as part of their
signature. Calling a
blocking function from an
function would then be an error, with a way out for functions like
Unfortunately, Rust doesn’t have this feature, so we have to rely on
programmer discipline. The discipline seems easy enough: If you’re in
an async function, and you call a function that’s going to take some
time or do I/O, make sure you’re doing an async call, which in most
cases means using the
Unfortunately, this doesn’t work 100% of the time, because the operating
system isn’t on board. There are system calls that block sometimes, based
on dynamic configuration. Does the
recv system call block? Well, that
depends on whether the socket is a blocking socket, or a non-blocking
recv is run-time polymorphic on socket type,
in a way that makes it a different color based
on run-time information.
This is bad design: BSD should have split
recv into two system calls,
recv could error if given a non-blocking
recv_nonblock could error if given a blocking one.
Linux at least has a flag
MSG_DONTWAIT that makes an individual
recv call unconditionally non-blocking, but it’s non-standard. It’s
not supported on macOS and
mio understandably doesn’t use it.
Most of the time, this isn’t an issue. Sockets controlled through
or other async runtimes are always configured with the operating system to
be non-blocking, as an invariant on those socket types. Sockets controlled
std or other libraries will be blocking, and will be contained
in completely different Rust types. The Rust type system is used to keep
track of the distinction even if the operating system won’t.
But this becomes an issue where these boundaries are
broken, namely in conversion functions between them. These
methods then have whether or not a socket is blocking
as part of their contract. For example, the documentation for
This function is intended to be used to wrap a TCP stream from the standard library in the Tokio equivalent. The conversion assumes nothing about the underlying stream; it is left up to the user to set it in non-blocking mode.
Thus, as a precondition of calling the
from_std function, you
must pass a “non-blocking” socket. If you instead did not set the
socket as non-blocking – perhaps because you were making it with some
extra options you needed, but assumed that
tokio would handle
the non-blocking part – bad things happen.
If blocking were considered a safety issue, this function would
unsafe. But it’s not, and so it’s simply an unchecked
precondition – and we’re not used to those in Rust. Most
safe functions check their preconditions, either returning
a special value (like an
Err) or panicking if something is wrong.
The ones that don’t are typically marked
unsafe. Unchecked preconditions
still exist – they cause rogue behavior but not behavior deemed
“unsafe” under Rust’s definition – but they are rare, and
therefore surprising to a Rust programmer.
Why is it not a checked precondition? That’s easy to answer: Checking it would take an extra system call, as would unconditionally setting it unblocked in that system call itself. System calls are slow, and that would be an unacceptable performance penalty for many applications.
This leads to a dissapointing end result, though. It’s not enough
to simply make sure you don’t call I/O methods unless they come
async version. To be disciplined enough to be an
Rust programmer, you also have to watch out for these extra unchecked
Otherwise, you get a hidden color bug that’s even harder to track down
because the blocking functions you’re calling don’t look blocking.
recv, thinking it’s not blocking, but it is. You
tokio to be correct, but because of this broken invariant, it
isn’t. These sorts of issues can be very hard and time-consuming to debug.
NewsletterFind out via e-mail when I make new posts!