Mr. McCat's
Mysterious Coprolite Emporium |
||||||||||
Home | Blog | About |
2025-01-08
Historically, we have two major options for storing numbers as bytes. I'm talking, of course, about endianness.
This goes not only for their storage in-memory, but also for files as well.
Now, what does it have to do with bitstream parsing?
In simpler formats, it's largely a no-brainer because simpler formats operate on
standard data types, like bytes or DWORD
or WORD
or whatever. Just read and re-order,
what is your problem, right?
Right, but many formats employ bit-fields to compress their structures.
That's where the matter of endianness becomes interesting.
The commonly adopted convention is to label bits from MSB to LSB, e.g. the highest bit is #0, and the lowest is, well, #N.
With big-endian, it's all pretty straightforward, as bits are assumed to go from highest to lowest in the byte, and the bytes are assumed to go from the highest to the lowest in the stream, so it all flows in the same direction, so we just shift and overlay as we read, keeping track of how many bits we've already read.
Little-endian is a whole different game. Let's dig.
Now, how do we read, say, 13 bits?
Suppose we're skipping one bit and then reading another thirteen, so, technically speaking, bits from #1 to #13 inclusive.
First, since we're dealing with bits, MSB, LSB, et al, let's consider an arithmetic representation of our little two-byte stream:
Arithmetic representation:
*------------------------------------------------->*
[ 0] | [ 1][ 2][ 3][ 4][ 5][ 6][ 7][ 8][ 9][10][11][12][13] | [14][15]
high low
Okay, if we already have the u16
to read from, this is trivial cloak and dagger mask and shift thing.
Let's take a look at big-endian representation. What shall we find?
Big-endian:
*---------------------------------------------------->*
[ 0] | [ 1][ 2][ 3][ 4][ 5][ 6][ 7] [ 8][ 9][10][11][12][13] | [14][15]
high low
It all aligns really well. In big-endian, bit-reader is very straightforward with a few bitmasks and all that.
Now, how about little-endian? Things get a bit interesting. Assuming we use MSB-to-LSB bit labeling and mind the ordering, we would get a bit diagram looking like this:
Little-endian:
*--------------------->* *------------------------->*
[ 8][ 9][10][11][12][13][14][15] [ 0][ 1][ 2][ 3][ 4][ 5][ 6][ 7]
high low high low
See the problem yet?
That's right, we have to know in advance the number of bytes to consider, because the ones we need first live downstream from current position.
Basically, we have to read bytes and reorder first, to get an arithmetic representation. That is the one reason all the bit-diddling codecs use little-endian in their bitstream format, because many bitstreams are so tightly packed they have to assume continuous stream of bits from MSB to LSB, and not a delimited structure you have to read piece-wise and byte-swap, like word-stream or dword-stream or some such.
Is there a way around this? Could we still bit-read in a streaming mode?
Sadly, no. Because of the widely-adopted MSB-first convention, the first bits to be read would be downstream from our location. We would have to know the size of the underlying number, read it first, re-order accordingly, and then extract bits out of it, starting with MSB. And then keep it for further bit fields coming after the current one.
That's part of the reason most little-endian specifications supply bit diagrams in the description of their structures, usually being aligned to 32-bit integers or some other metric. This can get epsecially confusing because often the data is not specifically aligned to a fixed integer layout but is rather relying on the reader's intuition to discern where the appropriate byte boundary lies when it comes to bit fields.
In line with me never neglecting a chance to dunk on retarded formats, it is prudent to note that Microsoft specs are especially prone to containing this type of conundrum in non-trivial amounts.
All in all, using ambiguously aligned bit fields with little-endian storage is the very first
Mortal Sin of Binary Formats
, namely what we would from now on call
the Labyrinthian Sin
, because if you try to parse such a format without
a guiding thread of proper insight into its layout, which is often omitted in a
few corners and left as an exercise for the reader, you're as good as cooked and
are in for a bunch of disgusting trial-and-error guesswork.