CompSci question: Does endianness ever still matter?

I’ve always been a bit flabbergasted by endianness and wonder about it from time to time (exciting life, I know). Is it ever still an issue that regular IT people or hobbyists (meaning people aside from chip manufacturers) have to deal with? Like so many other little technical details, it just seems like one of those things that would’ve made everyone’s lives easier if one or the other won out early enough – doesn’t really matter which one.

Was there a holy war over this back in the day? Do people still care now?

Yes. Most modern chips (Intel, ARM) are little-endian, but by popular convention, transmission formats (most notably TCP/IP) usually specify big-endian as part of their protocol definition. Only relatively low-level programmers (interpreting a stream of bytes as numbers, for example) will care: for most devs, the APIs you’re using to communicate (or store) cross-platform data will take care of reordering for you when necessary, and it’ll all just “look right” to you at the end.

Yep, endianness matters. A lot of the time the details are hidden from you, and most hardware these days is little-endian. But if you’re working with ARM processors, or doing network stacks, you still end up with big-endian stuff that you have to handle correctly.

ARM is actually an odd case, the chips are bi-endian (they can be configured to use either endianness.) In my experience, they’re usually used little-endian, these days, but not always.

And to answer the OP’s other question: Yes, IIRC, there were holy wars.

Very few software developers have to worry about it. I’m a software engineer and I don’t recall it ever being an issue. At this point it’s like worrying if the hard drive is spinning at the right speed: somebody has to make sure it does but I certainly don’t.

Yeah, it mostly comes up in the context of writing your own binary file format or network protocol. I work in embedded development, so I deal with it somewhat frequently. If you’re doing stuff like web development or enterprise administration software, then you can probably go for years without ever having to deal with it directly.

Still, it’s one of those things which most programmers probably will encounter at least a few times in their career, and then if they’re not aware that it’s an issue, the results can be messy. E.g. I once had to deal with an in-house developed image file format, and it turned out that the guy who designed the format never made an explicit decision about endianness, so each file just had the endianness of the machine it had been created on and there wasn’t any field in the header indicating which it was. So you had to go over all the numeric values in the file and make a statistic guess about which endianness led to a more plausible interpretation…

I haven’t had to deal with endianess since…well, it’s been a few days. But not many.
Then again, I’m one of those low level programmers that actually deals with chip to chip communication, as well as process to processor communication.

I deal with streaming protocols for stock market data from dozens of various sources. Some send the data in text. Some in binary. For those that send it in binary, which-endian definitely matters.

I design industrial control equipment and we run into endian problems quite frequently. We had to do byte and word swaps to handle endianness differences between our main processor and an I/O interface that mapped into the system via shared memory.

We also run into it a lot using the Modbus TCP protocol. The protocol itself specifies bits and 16 bit words, but doesn’t specify floats. However, it has become common for the protocol to use two sequential 16 bit registers to store a 32 bit IEEE float. Since there is no standard for it though, we have had to deal with both byte swap problems and word swap problems. There’s even one I/O processor we deal with that has the floats in two different formats within the same controller. Those floats that are generated by the controller are in one format, and floats that are passed up through other I/O protocols are in another.

Back in the day, two of the most common chips were the Intel x86 family and the motorola 68000 series. Intel was little endian and motorola was big endian. It made for a lot of interfacing problems. Still does.

I remember in the 80’s/90’s, it mattered if you wrote in assembly language. Motorola 68XXX vs. Intel 808XX. Someone help me here -> The Intel architecture was also segmented (64K chunks in unprotected mode). Something to be concerned with if your were writing under MS-DOS. Maybe someone could elaborate.

Byte order isn’t really important except for reading encoded binary data, which usually define which byte order they’re using. (And then there’s the .wav file format, whose header is partially in little endian and partially in big endian because reasons). Ideally, any code dealing with byte order should be shoved off into some functions like [En|De]codeBinary(data, bufferEndianness), which automagically decodes a stream into <your system’s endianness> and then never thought about again. Meaning: as a developer you really shouldn’t have to worry about the endianness of your machine in general practice.

See The Byte Order Fallacy

The caveat is that this may be less true in fiddly embedded or micro devices or naked assembly.

The 8086 addressed everything as segment + offset. Since the segment registers were 16 bits and the offset was also 16 bits, that left you with 64k of offsets in each segment. The segments also overlapped. To form the “real” address, you shifted the segment left by 4 bits and added the offset to that, for a total of 20 addressing bits. That gave you 1 meg of address space. The boot rom and video interface and a few other things were in the upper part of that 1 meg space, and what remained below it was the infamous 640k of RAM.

The Motorola 68k series used a single 32 bit offset for the address, so it had a “flat” memory model (instead of segment + offset). The original M68000 only had 24 address bits coming out of the processor, so the upper 8 bits were basically just ignored.

The 68k addressing was much simpler and easier to learn, but grouping things into segments did have advantages.

When the Intel 80286 came out, it used “protected mode” which, in addition to having “protected” code space and data spaces also had what is called the Global Descriptor Table. When in protected mode, everything still used a 16 bit “segment” and a 16 bit offset. However, instead of the “segment” being just shifted and used to form the real address, it instead was a pointer into the GDT, and the GDT entry contained the actual base address of the segment. This allowed direct addressing of significantly more than 1 meg of addressing space.

Windows version 3.x used 286 style 16 bit protected mode.

MS-Dos ran in 8086 mode. This is usually referred to as “real” mode, not unprotected mode. Later processors, all the way up to modern 64 bit processors still boot to real mode for compatibility.

Ah, was just working on something that reminded me of the far more common modern reason why endianness matters: Unicode. For Unicode encoding formats larger than UTF-8 (which is just a byte stream), endianness matters, and lots of dev tools and editors use UTF-16.

16-bits gives you a good compromise of not wasting too much space, and getting a fairly large character set with minimal use of surrogates–a surrogate is basically a multi-chunk encoding that maps to a single character too long to fit in the standard character size for the format. For example, in UTF-8 (8 bits per “character”), there are only a couple hundred different characters you can encode in a single byte. Basically any non-latin text will start to use surrogates, in which you have to take two, three, or even more bytes to encode a single character. UTF-16 (two bytes per “character”) gives you thousands of characters that don’t require the surrogate treatment, but now you care about the endianness of each character. (There is also UTF-32 and a bunch of other encodings).

Ideally, a properly formed UTF-16 document should have a Byte Order Marker (BOM) as the first pair of bytes, and a properly conforming application is required to handle either endianness. In practice, lots of apps don’t write the BOM, and/or skip over it when reading and just “assume” the byte order is little-endian, sometimes with hilarious results. (Even more lazy apps don’t bother handling surrogates, which makes them useless for many languages, which means they went to the bother of converting to Unicode for basically nothing.)

UTF-8 doesn’t use surrogates. It’s a variable-length encoding with between one and four bytes per character. (The first 128 codepoints being encoded the same way as ASCII.)

UTF-16 just encodes the BMP (codepoints 0000-FFFF) as two-byte characters, in either endianness, preferably marked with a BOM. Surrogate pairs are only needed for characters outside the BMP.

UTF-8 and UTF-16 are the same size, in that you can encode all the same characters in both encoding formats. That’s why they’re both used. That’s the only sensible way to compare ‘size’ in this context.

Actually, for most Western (Latin alphabet) texts, where most of the characters used are in ASCII, UTF-8 is more efficient because UTF-16 always uses a minimum of 16 bits, so something in ASCII-only English will be twice the size if encoded in UTF-16 as it would be in UTF-8. French text will expand, too, but not quite as much, ditto for German and Spanish.

Also, calling what UTF-8 does with its bytestream scheme “surrogates” is pretty misleading: The word “surrogate” is already used in this context, in the form of the term “surrogate pairs”, which is what UTF-16 does to represent codepoints beyond the 16-bit Basic Multilingual Plane. UTF-8 doesn’t have or need surrogate pairs, because it is bytestream-oriented and goes to multi-byte encodings more gracefully.

I’ve often wondered, why was it only a 4 bit shift? A one byte shift would seem more natural. And it would have gotten you a 16 Mbyte address space, which would be enough for forever.

As an operating system developer, I wholeheartedly disagree with that article. Manually decoding messages in the manner that he describes is extremely error-prone and difficult to read. He offers a strawman to tear apart, gleefully ignoring the canonical way to handle byte ordering:


i = letoh32(*((uint32_t*)data));

Where letoh32 stands for “convert a little-endian 32-bit integer to host byte order”. Of course, this isn’t a standard function so the name will very between implementation, but I guarantee you that all sane people are handling endianess in a similar way. So let’s review his arguments:

My way is even less code, and it’s pretty well impossible to screw up may way, where as his way has all kinds of places for subtle bugs (hope he doesn’t typo 25 for 24 somewhere in his code…)

A valid point. The implementation that I deal with has a separate le32dec() function that decodes a mis-aligned byte array to a host-order integer. In the example above it would have been safer to use it. letoh32() is to be called when you already have an integer and know that it’s aligned safely. I should not be blithely casting to uint32_t.

This is a total strawman. He wrote that code. He just as easily could have used the <stdint.h> types (as I did above) and eliminated the bug, but he didn’t.

Seriously, <stdint.h> has been a part of the C language for 15 years now. There’s no excuse for not knowing of its existence, and in the unlikely case that you are targeting a system that does not provide that header, it is not difficult to write your own.

My version automatically selects the highest-performance version of the code for the targeted architecture. If I can use betoh32() then I have better performance than him on little-endian and at worst tie him on big endian machines (and I may win, depending on how good the compiler is at optimizing his version). If I have to use le32dec() then we tie (in fact, le32dec() will almost certainly be defined exactly as his manual decoding).

By using a standard API, I don’t have to test the API myself. I can trust that the author of the API has tested it appropriately and I am only responsible for testing my own code.

His example is far too vague for me to make any sense of it. But in any case, if your code deals with data that has a specific byte ordering, you must test the behaviour on machines with different byte orderings. It’s all well and good to say, “I wrote my code to be independent of byte ordering” but that’s just a specific way of saying “Of course my code is bug-free”, which is never actually true.

I’m not an OS dev so I can’t refute anything you said, but I will point out that he’s also an OS dev. That’s from Rob Pike, who worked on Unix, Plan9, and Inferno.

There are cases where the overlap is useful, or where you don’t need the full 64kB of a segment and it would be a shame to waste the remainder. Back then, using the existing 640kB of a typical PC as efficiently as possible, was considered a more important design goal than worrying about how to address the 16MB that future generations of PCs might have.