I have been working on a serialization project recently that involves endianness (also known as byte order), and it caused me to explore parts of the Rust standard library that deals with endianness, and share my thoughts about how endianness should be represented in a programming language and its standard library, as I think this is also something that Rust does better than C++, and also makes for a good case study to talk about API design and polymorphism in Rust.

To start with, let’s discuss endianness a little. I assume most of my audience has some familiarity with endianness; nevertheless, I’d like to explain it from first principles. That way, we can subsequently apply the insights from the explanation to API design. That, and I want practice explaining concepts, even if they are basic. I’ll try to keep it interesting, but also feel free to skim the next section.

Big End, Little End#

I first encountered the concept of endianness when I was first learning to program using the DEBUG.EXE program on DOS. When a 16-bit value was displayed as a 16-bit value, it was just normal hexadecimal, but when it was displayed as two 8-bit bytes, something weird happened with the display.

Here’s a C++ snippet that demonstates the effect:

template<typename T>
void display_bytes(const T &val) {
    char bytes[sizeof(T)];
    memcpy(bytes, &val, sizeof(T));
    for (auto byte : bytes) {
        printf("%2x ", byte);
    }
    printf("\n");
}

int main() {
    int value = 0x12345678;
    printf("%x\n", value);
    display_bytes(value);
}

When run on any little-endian processor (the vast majority of processors), we get:

12345678
78 56 34 12

The least significant byte is first, so if you print out the individual bytes in order, you have to read it backwards – though each individual byte is still forwards.

If you were to run the same code on a big-endian processor, you would get:

12345678
12 34 56 78

At this point, little endianness as I understood it was a weird thing that Intel processors did for reasons I didn’t understand, that made me do a little extra work when reading hex dumps. At the time, this was fine: I thought that having to apply this extra arcane knowledge was cool for its own sake. But also, I thought of little endianness as the weird way to do things that required extra work, and big endianness as the more natural design. It wasn’t until much later that I got some nuance on that opinion.

See, the writing system we use for numbers is big endian. Instead of dividing a number into bytes (base 256), we divide it into digits. We consider the left of the page to come before the right of the page, and we write the most significant digit first. This is all taught explicitly in grade school:

1234 = 1*10^3 + 2*10^2 + 3*10^1 + 4*10^0

There’s a certain, very human logic to this system: the more important information comes first, then the details. Mathematically, though, we see decreasing numbers for the powers of 10: first we specify a factor for 10 to the third, then to the second, then to the first, then to the zeroth. We could instead imagine where we wrote our decimal numbers big endian, where the same number would be written 4321, and still mean “one thousand two hundred thirty-four,” where we’d count like this:

0
1
2
3
...
9
01
11
21
...
91
02
12
...
99
001

This would have the advantage that the first digit, digit zero, would be multiplied by 10^0, digit one by 10^1. Not what humans would normally decide to do, but it has a certain logic. And if you think about languages like Hebrew or Arabic, which are written right from left, but which write numbers the same direction we do, the least significant digit is actually reached first in the normal direction of reading: when they see “100” in the midst of the text, the zeroes are “before” the “1”. (I am told that this is not how most people think of it; that they instead just think of numbers as going the other way from other text, but it just goes to demonstrate how based on convention all of this stuff is).

So all of this is to say that, the weird effect we had before with big endian looking “normal” and little endian looking “weird” has nothing to do with the intrinsic logic of big vs little endian, but rather with the fact that we’re mixing a little endian processor with a big endian writing notation. If we instead were to print the digits of each number in increasing significance – that is, if we were to use little endian as our printing convention – we’d get:

# Little Endian Machine
87654321
87 65 43 21
# Big Endian Machine
87654321
21 43 65 87

The mismatch between writing the whole word in hex and writing the individual bytes in order, in hex is caused by a mismatch between the endianness of the system (normally little in practice) and the endianness of the writing system (normally big in practice). When the writing system isn’t a factor, little endian makes more mathematical sense, is easier to reason about in circuitry and code, and therefore has won out over big endian in every major processor architecture.

The only real exception is network byte order, which uses big endian. This is convenient for manually reading hex dumps of packets, but probably has more to do with the fact that the Internet developed when this question was much less settled. Due to the presence of network byte order, however, and the fact that the endianness of Intel and modern ARM is opposite of the endianness of most human writing systems, the concept remains with us.

When is endianness relevant?#

In writing numbers, a digit has no endianness: 8 means the same thing as a single digit number. Similarly, a byte is indivisible in a processor. Bytes are made up of bits, but outside of special instructions, the ordering of those bits is not relevant. One of them is most significant, one of them is least, but unless we’re indexing them for a special instruction, or sending them over a wire one by one, there is no way to say which such bit comes “first.”

Indeed, if we want to display a byte as a series of bits, we as the programmer get to choose the endianness, and the program runs identically on either a big endian or a little endian platform. The little endian version is a little more intuitive, as “2 to the N” is an operation that’s easy to write on computers, and in little endian the N increases as the index increases:

fn byte_as_bits_le(byte: u8) -> [u8; 8] {
    let mut res = [0u8; 8];
    for i in 0..8 {
        let mask = 1 << i;
        if byte & mask == 0 {
            res[i] = 0;
        } else {
            res[i] = 1;
        }
    }
    res
}

Nevertheless, at no point is this function relying on the endianness of the hardware, and it does the same thing on either types of hardware.

Why do I bring this up? Well, I don’t think it makes any sense to speak of the endianness of a (multi-byte) word per se. The endianness of the word only comes into play when it is stored as – and accessible as – a series of bytes.

So from that point of view, what are the operations where endianness is relevant? Given a word, what series of bytes comprises it? And then, given a series of bytes, what word is it?

In Rust terms, these are to_be_bytes (for big endian)/to_le_bytes (for little) in the one direction, and from_be_bytes/from_le_bytes in the other. These methods are all bundled together in the Rust documentation for – in this case – the primitive u32 type, along with ne which gives whatever the native endianness of the processor is.

These are the APIs I’m going to be discussing. But before discussing how they might be improved, I’m going to point out an API that I think doesn’t make as much sense: to_be. This method takes in a word, a u32, and outputs a u32, and yet claims to change the endianness of that word, which as I mentioned, does not have endianness per se, only in that it’s represented by bytes.

I know what they mean by it. On a little endian platform, it will replace 0x12345678 with 0x78563412. But what does that actually mean? In its form as a u32, as I have argued above, a number has no endianness. So what is this number 0x78563412? It is the number that, if stored in bytes in the native endianness, will store the original number in big endian.

That’s a mouthful, I know, because it’s actually a complicated concept. That is to say, it’s a hack. We want to write a number – say, 2000 – in big endian, but we don’t want to think of it as bytes, yet. We want to be able to load the whole number into a register, and when we write it, we want it to be 2000 in big endian. So we byte swap the number, and instead of storing 2000, we store 3490119680, so that if we write it using the processor’s normal mechanism for writing, it comes out to 2000 in big endian.

Basically, to_be does the equivalent of u32::from_le_bytes(input.to_be_bytes()), and using it looks like this:

let input: u32 = 2000;
// These two invocations do the same thing
let be = input.to_be();
let be2: u32 = u32::from_le_bytes(input.to_be_bytes());
println!("{} {}", be, be2); // 3490119680 for both

// The result can be written using native (little endian) byte
// order, and it will give 2000 in big endian byte order.
assert_eq!(be.to_le_bytes(), input.to_be_bytes());

This is arguably a useful hack – though I’m not fully convinced – but it is definitely a hack. I do not think the description is sufficiently rigorous. The output of to_be is not a number “in big endian,” it is a different number that resembles the big endian representation of the original number. The description is a simplification, and I think a conceptually incoherent one – which is understandable because the concept at play here is so hackish.

It appears that to_be was in Rust 1.0, and to_be_bytes was introduced later. This to me is a good sign, as to_be_bytes, I think, makes much more sense as an interface. And as to why we started out with the to_be type of interface in Rust, that makes sense as well, because in C the traditional (POSIX but not ANSI C) functions for these conversions have similar semantics, such as htonl (host to network long), where we have this conceit of storing a “big endian” or “network byte order” value in a uint32_t (C for u32). This always struck me as the wrong abstraction, but it is justified – or at least more understandable – for C as we simply can’t pass around things like char[4] (C for [u8; 4]) by value in C.

There are other technical and historical reasons why htonl and to_be and friends exist, even if conceptually messy, but in any case, since I’m talking about API design, and to_be_bytes and friends are a better match for the concepts at hand, I am now going to pretend to_be is deprecated (it is not), and move on to discussing the design of to_be_bytes and to_le_bytes.

Policies#

So the first thing I notice is that there’s six methods that deal with fundamentally one topic:

from_be_bytes
from_le_bytes
from_ne_bytes
to_be_bytes
to_le_bytes
to_ne_bytes

But really they vary in two ways, namely:

  • which operation is performed (from_X_bytes vs to_X_bytes)
  • which endianness is required (le, be, and ne)

For us humans, this is clear from the names, but to the compiler, these names do not form a pattern that it is capable of recognizing. There are simply 6 separate functions named with 6 separate combinations of characters.

Now, having separate functions for separate operations makes sense; that’s what functions are for. But for the same operation but with different endianness, it might make more sense to indicate that it is one operation with several possible endiannesses by making the endianness into a parameter.

The obvious way to do this would be via run-time parameter. A fairly literal translation of this API would be something like:

enum Endian {
    Little,
    Big,
    Native,
}

impl u32 {
    fn to_endian_bytes(self, endianness: Endian) -> [u8; 4] {
        match endianness {
            Endian::Little | Endian::Native => { ... }
            Endian::Big => { ... }
        }
    }

    fn from_endian_bytes([u8: 4], endianness: Endian) -> Self {
        match endianness {
            Endian::Little | Endian::Native => { ... }
            Endian::Big => { ... }
        }
    }
}

This would also allow us to implement the concept of “native” byte order a little differently, and create more names for byte orders:

enum Endian {
    Little,
    Big,
}
static NATIVE_ENDIAN: Endian = Endian::Little;
static NETWORK_ENDIAN: Endian = Endian::Big;

So, besides simplifying away the need for a separate implementation for the ne functions, and making the code more in sync with what’s happening, what other positive things have we accomplished? Well, given that we now have a parameter, we can now make more complicated code parametric on it. Imagine we have an entire structure to write out, and we want to write the entire structure as big-endian or little-endian, perhaps because the protocol in question changed endianness at some version. Or perhaps we just want to make clear to the reader that one endianness is used for the entire structure. We can now do something like this:

struct Structure {
    a: u32,
    b: u32,
    c: u32,
}

impl Structure {
    fn serialize(&self, endianness: Endian) -> [u8; 12] {
        let mut res = [0u8; 12];
        [0..4].copy_from_slice(self.a.to_endian_bytes(endianness));
        [4..8].copy_from_slice(self.b.to_endian_bytes(endianness));
        [8..12].copy_from_slice(self.c.to_endian_bytes(endianness));
        res
    }

    pub fn serialize_old_version(&self) -> [u8; 12] {
        self.serialize(Endian::Big)
    }

    pub fn serialize_new_version(&self) -> [u8; 12] {
        self.serialize(Endian::Little)
    }
}

The alternative would be to write two separate serializers, and duplicate all the logic of how to arrange the layout. Duplication is bad, because bug fixes don’t necessarily get to all the duplicate copies. So, to save on duplication, we’d have to basically wrap to_be_bytes and to_le_bytes in a version of this; it would be more convenient if the standard library had done this for us.

What is the downside of this? Well, the implementation didn’t really get any simpler. Actually, in the normal case, where you don’t change your mind about endianness, the implementation got more complicated. We now have a match expression in our two simplified functions, which theoretically indicates a run-time decision. We could trust the optimizer to fold the decision in through inlining and constant-propagation, but trusting the optimizer is suspicious and unnecessary.

Nothing we’ve done so far requires this decision to be made at run-time, and so we can instead make the decision at compile-time. Where we had a run-time parameter, we can now have a compile-time parameter.

Now, although Rust has rudimentary support for other kinds of compile-time parameters, the archetypical compile-time parameter is a type, bound by a trait. Our enum from before would then have to be lifted into the type space, as a trait and a few types:

trait Endianness { }

struct BigEndian;
impl Endianness for BigEndian { }

struct LittleEndian;
impl Endianness for LittleEndian { }

Now, we need a compile-time equivalent for match. This is a little harder, as at the time of this writing stable Rust does not have the most direct equivalent of match for implementors of traits, that is, “specialization.” But Rust does allow something similar: the code for each branch of the match must go in each type’s implementation of that trait, and the fact of the match must be provided in the trait itself.

This will also help us simplify the implementation. This to_be_bytes/ to_le_bytes API is not just implemented for u32, but for all primitive types. Currently, these mostly-similar implementations are stamped out by a macro, along with other methods for primitive types. But we might imagine that there are two things going on in the implementation:

  • write out the type into an array of bytes
  • either swap the bytes, or not, based on whether we’re using the hardware endianness

We could then make the trait come into play – with the decision made at compile time – for the swapping part.

trait Endianness {
    fn possibly_swap(bytes: &mut [u8]);
}

struct BigEndian;
impl Endianness for BigEndian {
    fn possibly_swap(bytes: &mut [u8]) {
        // actually swap here
    }
}

impl Endianness for LittleEndian {
    fn possibly_swap(_: &mut [u8]) {
        // no need to do anything here
    }
}

We have now moved some of the implementation into a trait, where the specifics of the implementation are determined by which type implements that trait. This is an example of the policy pattern, where a portion of the code is abstracted out into a policy, and the policy and the main body of the function are sewn together – in this case, at compile-time – into many variations of a function that execute similarly to what an implementor might have written by hand.

Note that there is no possibility of doing run-time endianness determination in this version. This trait methods does not take a self parameter, and would have to be invoked as T::possibly_swap. This is possible in Rust because we are doing compile-time polymorphism, not run-time, so there is no need to make this trait object-safe.

Our previous example serializer, with the two versions, now looks something like this:

struct Structure {
    a: u32,
    b: u32,
    c: u32,
}

impl Structure {
    fn serialize<T: Endianness>(&self) -> [u8; 12] {
        let res = [0u8; 12];
        (&mut res[0..4]).copy_from_slice(self.a.to_endian_bytes::<T>());
        (&mut res[4..8]).copy_from_slice(self.b.to_endian_bytes::<T>());
        (&mut res[8..12]).copy_from_slice(self.c.to_endian_bytes::<T>());
    }

    pub fn serialize_old_version(&self) -> [u8; 12] {
        self.serialize::<BigEndian>()
    }

    pub fn serialize_new_version(&self) -> [u8; 12] {
        self.serialize::<LittleEndian>()
    }
}

The policy pattern is a fairly common pattern in generic programming just like in object-oriented programming, but when generic programming is implemented through monomorphization, as it is in Rust, it can be just as efficient as hand-implementing the combinations of policy and code, while allowing for more policies.

For example, if there were a platform where 4-byte chunks were split into 2-byte chunks little endian, but 2-byte chunks were split into 1-byte chunks big endian, we could write a new policy for this platform and all the existing code would support it.

A much more complicated example of the policy pattern is serde, where the generated serializers and deserializers for each structure are all polymorphic on what serialization format should be used. If a new serialization format comes out with serde support, all existing Serialize instances can then be used with the new format without modification.

Now, in practice, there are often processor instructions that do byte swaps. The hardware uses an interface analogous to the hackish, conceptually messy to_be(), which at a hardware level makes sense because elegance of abstraction is not as an important goal as performance. This converts 0x12345678 into 0x78563412, and similar. So, this implementation is not actually what the policy would look like in a production context. Nevertheless, the endianness argument could definitely be passed in by a trait-constrained type parameter; the implementation would just be more complicated.

Traits#

I mentioned before that u32 is not the only type that implements this set of methods, this convention, this informal protocol of to_be_bytes, to_le_bytes, etc. This means that if we were writing in C++, we would have enough from this informal protocol to write a function that did something like “write this value in big endian twice, and little endian twice, to different locations” that was agnostic to the type provided, as long as it implemented this informal interface. It would look something like this:

template <typename T>
void write_four_times(T val) {
    write_to_location_1(val.to_be_bytes());
    write_to_location_2(val.to_be_bytes());
    write_to_location_3(val.to_le_bytes());
    write_to_location_4(val.to_le_bytes());
}

This would allow you to call write_four_times on any type for which that code made sense, as C++ templates are literally templates, and the T is filled in before type-checking. The protocol here is implicit in the structure of the function – it is compile-time duck typing.

Rust generic functions are type-checked before monomorphization, so we can’t do this in Rust. Instead of defining to_le_bytes() and friends separately on each type, this function would require them to be in a trait, maybe EndianBytes:

fn write_four_times<T: EndianBytes>(val: T) {
    write_to_location_1(&val.to_be_bytes());
    write_to_location_2(&val.to_be_bytes());
    write_to_location_3(&val.to_le_bytes());
    write_to_location_4(&val.to_le_bytes());
}

EndianBytes would have to define at least those methods:

trait EndianBytes {
    fn to_be_bytes(self) -> [u8; ???];
    fn to_le_bytes(self) -> [u8; ???];
}

Unfortunately, as the ??? shows, the different output arrays have different lengths – a u16 would be 2 bytes and a u64 8 bytes – and so the Rust trait system at the time of this writing is (to my knowledge) not powerful enough to represent this trait as is. Instead, it would have to return a slice, which introduces an additional run-time value (the length) into the mix that we’d rather avoid in this exercise on compile-time generic programming.

Run-Time Endianness#

What if we want to make decisions about endianness at run-time, say, because we are implementing DBus? This is, as Linus Torvalds pointed out in one of his famously angry emails, a stupid idea for a protocol, but we don’t always get to choose what protocol we implement. Even though choosing one endianness and sticking to it would have avoided the run-time cost of making a decision (which as Torvalds points out is more than the cost of either decision), the developers of DBus did not do that. UTF-16 also didn’t – it also does run-time endianness adjustment with a sentinal character at the top of the text block to indicate the endianness.

The most obvious solution is to use the run-time parameterized version we discussed towards the beginning of this post, and have an enum Endianness parameter. This would be parsed in each message (or connection, or whatever duration of time endianness is configured) and then passed through to all the serializing and deserializing code, which would look something like our original serialization example:

fn serialize(&self, endianness: Endian) -> [u8; 12] {
    let res = [0u8; 12];
    (&mut res[0..4]).copy_from_slice(self.a.to_endian_bytes(endian));
    (&mut res[4..8]).copy_from_slice(self.b.to_endian_bytes(endian));
    (&mut res[8..12]).copy_from_slice(self.c.to_endian_bytes(endian));
}

pub fn serialize_old_version(&self) -> [u8; 12] {
    self.serialize(Endian::Big)
}

pub fn serialize_new_version(&self) -> [u8; 12] {
    self.serialize(Endian::Little)
}

We can do better than that, though. This has one copy of the serialization code in the source, and one copy in the binary. What we could do instead, is expand the more sophisticated compile-time version of the serialization code, and move the match into a wrapper serialize method:

fn serialize_impl<T: EndiannessTrait>(&self) -> [u8; 12] {
    let res = [0u8; 12];
    (&mut res[0..4]).copy_from_slice(self.a.to_endian_bytes::<T>());
    (&mut res[4..8]).copy_from_slice(self.b.to_endian_bytes::<T>());
    (&mut res[8..12]).copy_from_slice(self.c.to_endian_bytes::<T>());
}

pub fn serialize(&self, endianness: EndiannessEnum) -> [u8; 12] {
    match endianness {
        EndiannessEnum::Big => self.serialize_impl::<BigEndian>(),
        EndiannessEnum::Little => self.serialize_impl::<LittleEndian>(),
    }
}

This generates two serializers from one serializer function (thus mitigating the biggest problem with code duplication – that of maintainability), and makes the run-time decision further up in the call tree. This ability – to adjust between finer-grained run-time decisions and duplication of run-time code – is one of the greatest powers of C++ and of Rust. We can effectively – in the DBus case – create two entire DBus deserializers – one for little-endian, one for big endian – and then decide between the two deserializers at run-time on a per-message basis, which, because fewer run-time decisions are being made, will be much more efficient than making the run-time deserialization decision at every deserialization site.

Of course, for serialization we can simply write one serializer and always generate little-endian DBus messages.