Blocking Sockets and Async
Using async
in Rust can lead to bad surprises. I recently came across
a particularly gnarly one, and I thought it was interesting enough to
share a little discussion. I think that we are too used to the burden
of separating async
from blocking being on the programmer, and
Rust can and should do better, and so can operating system APIs,
especially in subtle situations like the one I describe here.
Every async
programmer learns early on not to call a blocking function
from an async
function. If you do, it is a hidden color violation,
as I discuss in a previous post. By “hidden,” I
mean that unlike other color violations, Rust gives you no compiler-time
help. You just have to use discipline. You just have to “make sure not
to do it.” You just have to increase your cognitive load. It is a rule
that the computer is no help with – which means that you’ll definitely
mess it up at some point, possibly at many points.
Unfortunately, it’s also a gnarly problem to debug. The actual blocking function call will quite possibly work just fine. It’ll return when the resource is ready, and block until then – probably exactly what you wanted. It’s the rest of the system that falls apart – other tasks on the same thread starve, tasks that are depending on them for progress also starve, but meanwhile other tasks might proceed without a problem. Worse, there’s no guarantee that the bug will manifest every time, so the bug isn’t readily reproducible.
You might think this is an easy problem to address, either through improvements in the programming language or better programming discipline.
At a programming language level, you could imagine Rust having
some sort of generalization of unsafe
, or maybe an effects system.
Functions that block would have blocking
as part of their
signature. Calling a blocking
function from an async
function would then be an error, with a way out for functions like
spawn_blocking
.
Unfortunately, Rust doesn’t have this feature, so we have to rely on
programmer discipline. The discipline seems easy enough: If you’re in
an async function, and you call a function that’s going to take some
time or do I/O, make sure you’re doing an async call, which in most
cases means using the async
keyword.
Unfortunately, this doesn’t work 100% of the time, because the operating
system isn’t on board. There are system calls that block sometimes, based
on dynamic configuration. Does the recv
system call block? Well, that
depends on whether the socket is a blocking socket, or a non-blocking
socket. Fundamentally, recv
is run-time polymorphic on socket type,
in a way that makes it a different color based
on run-time information.
This is bad design: BSD should have split recv
into two system calls,
recv
or recv_nonblock
. recv
could error if given a non-blocking
socket, and recv_nonblock
could error if given a blocking one.
Linux at least has a flag MSG_DONTWAIT
that makes an individual
recv
call unconditionally non-blocking, but it’s non-standard. It’s
not supported on macOS and tokio
/mio
understandably doesn’t use it.
Most of the time, this isn’t an issue. Sockets controlled through tokio
or other async runtimes are always configured with the operating system to
be non-blocking, as an invariant on those socket types. Sockets controlled
through std
or other libraries will be blocking, and will be contained
in completely different Rust types. The Rust type system is used to keep
track of the distinction even if the operating system won’t.
But this becomes an issue where these boundaries are
broken, namely in conversion functions between them. These
methods then have whether or not a socket is blocking
as part of their contract. For example, the documentation for
TcpStream::from_std
says:
This function is intended to be used to wrap a TCP stream from the standard library in the Tokio equivalent. The conversion assumes nothing about the underlying stream; it is left up to the user to set it in non-blocking mode.
Thus, as a precondition of calling the from_std
function, you
must pass a “non-blocking” socket. If you instead did not set the
socket as non-blocking – perhaps because you were making it with some
extra options you needed, but assumed that tokio
would handle
the non-blocking part – bad things happen.
If blocking were considered a safety issue, this function would
be marked unsafe
. But it’s not, and so it’s simply an unchecked
precondition – and we’re not used to those in Rust. Most
safe functions check their preconditions, either returning
a special value (like an Err
) or panicking if something is wrong.
The ones that don’t are typically marked unsafe
. Unchecked preconditions
still exist – they cause rogue behavior but not behavior deemed
“unsafe” under Rust’s definition – but they are rare, and
therefore surprising to a Rust programmer.
Why is it not a checked precondition? That’s easy to answer: Checking it would take an extra system call, as would unconditionally setting it unblocked in that system call itself. System calls are slow, and that would be an unacceptable performance penalty for many applications.
This leads to a dissapointing end result, though. It’s not enough
to simply make sure you don’t call I/O methods unless they come
with an async
version. To be disciplined enough to be an async
Rust programmer, you also have to watch out for these extra unchecked
preconditions.
Otherwise, you get a hidden color bug that’s even harder to track down
because the blocking functions you’re calling don’t look blocking.
tokio
calls recv
, thinking it’s not blocking, but it is. You
expect tokio
to be correct, but because of this broken invariant, it
isn’t. These sorts of issues can be very hard and time-consuming to debug.
Subscribe
Find out via e-mail when I make new posts! You can also use RSS (RSS for technical posts only) to subscribe!
Comments
If you want to send me something privately and anonymously, you can use my admonymous to admonish (or praise) me anonymously.
comments powered by Disqus