2022-11-14

Rust enums: a bit about size and optimisations

🔗Under the hood

An enum is made up of zero or more variants and a discriminator.

enum Choice {
    A,        // <-|
    B,        // <-|- variants
    C         // <-|
}

The discriminator is an integer indicating which variant is held.

Given the following code: let a = Choice::A; the discriminant would be zero.

Given the following code: let c = Choice::C; the discriminant would be two.

🔗Fieldless enums

An enum without associated data is a "fieldless" enum.

It is possible to assign a custom discriminant value to a fieldless enum:

enum Choice {
    A = 10,
    B = 20,
    C,
}

assert_eq!(Choice::A as usize, 10);
assert_eq!(Choice::B as usize, 20);
assert_eq!(Choice::C as usize, 21);

If no specific value is given as a discriminant the value is that of the previous variant plus one.

Note that Choice::C is the same as Choice::B + 1.

🔗Zero variant enum

An enum with no variants can never be created. There are few use cases for such an enum but they do exist.

One such use case is the error type on a Result<T, E> that can never fail, but the return type still has to be a Result<T, E>.

There is currently an enum in the standard library that exists for this reason: Infallible.

🔗Size

The size of an enum is the size of its largest variant + the discriminant (unless the discriminant can be optimised away, more on this below).

Given the following enum, we can look at the size of each variant:

use std::mem::size_of; // returns the size of a type

enum Data {
    SingleByte(u8), // size of u8
    Bytes(Vec<u8>), // size of Vec<u8>
    NoData          // zero size
}

The size of Data::SingleByte(u8) is one (as one byte is needed to store a u8).

A Vec<T> contains more information:

On a 64 bit architecture a usize is a 64 bit integer, making it eight bytes in size.

use std::mem::size_of;

let size = size_of::<usize>()  // first the pointer
         + size_of::<usize>()  // then the capacity
         + size_of::<usize>(); // then the length

// usize = 8 bytes on a 64 bit architecture
assert_eq(size, 24);

This makes Data::Bytes(byte_vec) the largest variant, taking up 24 bytes (assuming 64 bit arch).

The values inside the vector is not taken into consideration when talking about the size of the vector here.

It might seem wasteful that [Data::NoData, Data::NoData] takes up the same amount of space as [Data::Bytes(bytes), Data::Bytes(more_bytes)], but it makes sense from the perspective of a stack frame, or storing multiple enums in a collection (like an array or a vector).

Given the following:

let data: Vec<Data> = get_data();
let some_payload = &data[4];

Without a fixed size it would be rather difficult to jump straight to the fifth entry in the vector, nor would it be possible to determine the size of a stack frame, something Rust needs to know at compile time.

🔗Optimisations

The compiler can perform optimisations on an enum depending on the variants.

If the size of an enum is the size of the largest variant plus the discriminant, and the discriminant is eight bytes by default, the following enum should have a size of 32 bytes (the Vec<u8> is 24, and the discriminant is eight).

enum Data {
    Bytes(Vec<u8>),
    NoData
}

However this enum has a size of 24 bytes.

To observe this it's possible to transmute the enum to bytes:

let data = Data::Bytes(vec![1, 2, 3]);
let bytes: [u8; mem::size_of::<Data>()]  = unsafe { mem::transmute(data) };
println!("{:?}", bytes);

// Bytes:
[
    // Pointer to the first value on the heap
    160, 10, 43, 168, 200, 85, 0, 0, 
    // Capacity
    1, 0, 0, 0, 0, 0, 0, 0,
    // Size
    1, 0, 0, 0, 0, 0, 0, 0
]

There is no discriminant in this data.

If the second variant was transmuted to bytes the output would look something like this:

[
    0, 0, 0, 0, 0, 0, 0, 0,
    47, 236, 38, 139, 176, 127, 0, 0,
    16, 170, 58, 244, 208, 85, 0, 0
]

Note that the first eight bytes here are zero, a value that would be invalid for the vector. Thus the discriminant can be determined by the first bytes, rather than adding an additional eight bytes specifically for it.

A vector is composed of three values.

The first value is a pointer to the heap. This value can never be zero, as that is reserved for the null pointer, and since Rust doesn't allow null values it would be an invalid vector if the pointer component was 0x0.

The compiler can perform something called "niche filling".

A niche in this context means a bit pattern that isn't valid.

Take the following enum:

enum State {
    Processing(u8, bool),
    Done,
}

The first variant has two types: u8 and bool. This means the first byte can not be used to represent the discriminant as any byte would be valid here.

However the second type is a boolean, and there are only two valid values for a boolean: zero or one. Therefore we can use two to represent the second variant: State::Done.

// State::Processing(42, false)
[42, 0]
// State::Processing(42, true)
[42, 1]
// State::Done
[16, 2] // here 16 is an arbitrary value, only 2 is relevant

The layout of types in Rust is not considered stable as Rust has an unstable ABI (Application Binary Interface).

This means that the output of std::mem::transmute might not be the same between different versions of the compiler.

The advantage of this is that new optimisations can be added to the compiler without breaking existing code.

The disadvantage is that it's not safe to transmute a series of bytes into a type as the byte order, or even the size of the type might change from one compiler to another.

🔗The end

Thanks to twitch.tv/museun for reading and correcting my mistakes