ONLamp.com    
 Published on ONLamp.com (http://www.onlamp.com/)
 See this if you're having trouble printing code examples


How an Accident of Hardware Design Encouraged Open Source

by Mark Rosenthal
02/22/2007

I shall be telling this with a sigh
Somewhere ages and ages hence:
Two roads diverged in a wood, and I,
I took the one less traveled by,
And that has made all the difference.
—Robert Frost

Back in the early 1970s, the hardware engineers at Digital Equipment Corporation made a decision about how their new computer, the PDP-11, would address memory. I believe their decision had the unintended, butterfly-effect consequence of helping to bring the open source software movement into existence.

A Bit of History

In the 1970s, IBM was the behemoth of computing. It sold big expensive mainframe computers known as the IBM 360, and later the IBM 370. By contrast, Digital Equipment Corporation was the prevalent company selling small, inexpensive minicomputers. DEC's PDP-8 was to computing what Ford's Model-T had been to automobiles. Although very limited in its capabilities (its opcode was only 3 bits, and a typical configuration was either 8K or 12K 12-bit words), it was cheap in comparison to IBM's big iron. It was inexpensive enough that virtually every college in the country was able to afford several of them. DEC followed the PDP-8 with the PDP-11; its low price and nicely orthogonal instruction set made it DEC's biggest seller throughout the 1970s.

For computers produced prior to IBM's 360 and DEC's PDP-11, the smallest addressable unit of memory was whatever the processor's wordsize happened to be, which generally wasn't a multiple of 8 bits. One architecture used a 36-bit word; another used a 12-bit word. This made it cumbersome to write software that manipulated text. For example, Figure 1 shows the PDP-8's representation of the string DIGITAL. To write code to walk through the characters of a string, you had to fetch the low order 8 bits of the first word of the string, then the low order 8 bits of the second word of the string, then glue together the high order 4 bits of the first word to the high order 4 bits of the second word. For the fourth character, you started the process all over again.

data
G
(hi bits)
D
0100 01000100
G
(low bits)
I
0111 01001001
A
(hi bits)
I
0100 01001001
A
(low bits)
T
0001 01010100
 
 
L
0000 01001100
word address 0 1 2 3 4

Figure 1. Character string representation of "DIGITAL" as packed into PDP-8 12-bit words.

Needless to say, writing code to manipulate strings on such an architecture was quite cumbersome. Although the idea may seem self-evident today, giving each byte its own unique address was an important innovation. The 360 and the PDP-11 were the first machines IBM and DEC had produced with memory organized into 8-bit groups known as bytes, and an addressing scheme that allowed each byte of memory to be individually addressed. But the IBM and DEC hardware engineers made different decisions about how to number the bytes within words (larger groupings of bytes). The decision by the PDP11's designers to not use the same scheme used by the IBM 360 had far-reaching and unforeseeable consequences.

Little-Endian vs. Big-Endian Byte Order

The IBM 360 designers had numbered the bytes within a word (4 bytes on the 360) based on the way English words are written -- from left to right. So if text is stored in a word of memory, the first character is stored in the byte that would hold the most significant bits of an integer if that word were used to store a binary integer.

By convention, each bit in a word is numbered according to the power of 2 that it represents. Thus, in a four-byte word, the lowest order bit is numbered 0 (because its value is 1, that is 20), and the highest order bit is numbered 31 (because its value in an unsigned integer is 231) (see Figure 2).

Because the design that numbered the bytes left-to-right but numbered the bits right-to-left was IBM's mainframe architecture, this came to be known as big-endian byte order.

character data (the word UNIX) U N I X
binary data (the
number 259)
 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  1  1
byte address 0 1 2 3
bit number 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0

Figure 2. Bit-numbering and byte-numbering on the IBM-360, and the word
"UNIX" and the binary representation of 259 as stored in one word on the IBM-360

The PDP11 designers, on the other hand, thought it more logical to number the bytes within a word in the same order as the bits within a byte. Being an inexpensive minicomputer, the PDP11 had 16-bit words instead of 32-bit words. In a 2-byte integer, the byte that held the least significant bits had a lower numbered address than the byte that held the most significant bits.

binary data (the
number 259)
 0  0  0  0  0  0  0  1  0  0  0  0  0  0  1  1
byte address 1 0
bit number 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0

Figure 3. Bit-numbering and byte-numbering on the DEC PDP-11, and the
binary representation of 259 as stored in one word on the DEC PDP-11.

character data (the word UNIX) X I N U
binary data (the number 259)  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  1  1
byte address 3 2 1 0
bit number 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1  0

Figure 4. Bit-numbering and byte-numbering on the DEC PDP-11, and the
word "UNIX" as stored in two words on the DEC PDP-11.

Because the design that numbered both the bytes and the bits right-to-left was DEC's minicomputer architecture, this came to be known as little-endian byte order.

The PDP-11's byte order being exactly backward from the IBM 360's byte order is what created the little-endian/big-endian byte-ordering nightmare. In the 1980s, I ported a lot of C code from one machine to another, and at least half of my effort went into fixing byte-order dependencies in the code.

Portable OS and Incompatible Byte Order Leads to Preference for ASCII

As I think about it, I believe that this problem of byte-order incompatibility ended up being a major driving force that resulted in the openness we find in the descendants of Unix. OSes prior to Unix were written in assembly language and were thus inextricably tied to the instruction set specific to the CPU they were written to run on. Unix was the first major OS written in a higher level language, and so it was easy to compile it to instruction sets for many different kinds of CPUs. Because other OSes ran only on a single architecture, they never encountered the byte-order incompatibility problem. But Unix's portability guaranteed that it would run into the problem, big time. When people tried to move data between big-endian and little-endian machines, bytes within words came through scrambled. The scrambled results became the name of the problem; it was known as either the XINU problem or the NUXI problem, depending on the specific architectures of the two machines.

Programmers tried a variety of approaches to deal with the problem that data formats that depended on byte order didn't transfer well between architectures. A common approach was to build something into the data format to indicate the byte-order. The first two bytes of a TIFF image file, for example, are either "II" if integers in the file are in little-endian byte order or "MM" if integers in the file are in big-endian byte order.

Another way to solve the byte-order problem was to avoid it entirely by only dealing with data in one-byte chunks. People quickly noticed that programs written to fetch their data from files that contained printable ASCII worked fine without any extra programming effort. Representing numeric data as plain ASCII had the disadvantage of taking up a lot more space than a binary representation. Given the memory constraints of the time, this was a much bigger deal back then than it is now. It also had the disadvantage of requiring additional CPU time to convert the data between ASCII and binary. Again, this was a bigger deal back then than it is now, due to the comparatively slow CPU speeds of the time.

But memory and disk capacities were increasing, so the advantage of ASCII quickly came to outweigh its disadvantages. In all but the most storage-intensive applications, file formats that were byte-order dependent tended to wither and die. The experience of cpio vs. tar illustrates this.

In the early `80s, cpio and tar were two common Unix archiving utilities, which was right around when a lot of commercial interest in using Unix began. Because lots of different companies were building their own hardware, each one inventing their own instruction set and addressing scheme, getting Unix to run on their machines required each of them to make the Unix source code compile, link, and then run properly on their hardware--a process known as porting. It was often necessary to copy many files at a time, some ASCII and some binary, back and forth between machines with different architectures. That meant aggregating many files into a single file for transfer among different machines. Both cpio and tar were available to do this job. However, where cpio's file format contained binary integers, the design of tar's file format had avoided binary and instead stored integers as ASCII strings. A cpio backup only worked where the destination machine had the same byte-order and wordsize as the source machine. Tar format, unlike cpio format, could be used to move files between machines with different byte-order and wordsize. Although you still had to deal with byte-order dependencies inside any binary files within the tar archive after they were on the destination machine, this was vastly better than the situation with cpio where you couldn't even transfer the files. Unsurprisingly, people gravitated toward tar. As a result, cpio fell into disuse. tar, on the other hand, is still in common use today, a quarter of a century later.

Storing Data in Binary Fosters Closed Proprietary Systems; ASCII Fosters Openness

Representing data in 1-byte chunks rather than multibyte chunks turned out to have an interesting side-effect. When storing numeric data in multibyte chunks (binary), you have a stream of groups of bytes that may be of varying size. You might have 1-, 2- or 4-byte integers, 4- or 8-byte floating point numbers, and character strings, all mixed in with each other. Unless you have documentation on the file format, there's no way to tell where one piece of data ends and the next begins.

If you're storing your data in 1-byte chunks, the most natural representation is displayable ASCII text. That means that the data typically contains line terminators (LF on Unix, CR/LF on DOS, CR on the Mac) and field delimiters (TAB, ",", etc.). It also means that a sequence of characters containing only digits is probably a number, and a sequence containing non-digits is clearly not a number.

In an environment in which it's easy to make your data and configuration files incomprehensible (unless you make a point of providing detailed documentation on your file format), programmers and the companies they work for expect that everything can be kept secret, even from other programmers, unless they can gain some business advantage by documenting the format of some particular file.

On the other hand, in an environment in which data files, and especially configuration files, are commonly ASCII rather than binary, programmers come to expect that they'll be able to figure out a lot about what's in a file, even in the absence of detailed documentation on the file format.

Although Richard Stallman probably would have founded the GNU project anyway, GNU's progress would have been greatly impeded in an environment where all data was stored in binary. Many of the F/OSS projects that are GNU's progeny may never have happened in an environment in which undocumented binary file formats predominated. As just one example of how companies use undocumented binary formats to protect their turf and impede the progress of others, note that although Linux can read from an NTFS filesystem (the filesystem format that Microsoft has moved to), the ability to reliably create or modify files in an NTFS filesystem has been a longstanding problem for Linux. (See Does Linux Support NTFS? and HOWTO: NTFS with read/write support using the ntfs-3g.) Microsoft has managed to lock Linux out simply by using sufficiently complex structures inside NTFS and keeping the documentation secret.

Conclusion

The IBM 360 and DEC's PDP-11 were the predominant big and little computers of the 1970s. If they had both used the same byte-ordering scheme, it would have been possible to transfer multi-byte chunks of data between the two architectures without encountering byte-ordering problems. If the PDP-11's designers had decided to copy the byte-ordering used by IBM, the Unix world would not have been pushed toward encoding data in ASCII, and the unintended consequential openness in the Unix/GNU/Linux world might never have come about.


Return to ONLamp.com.

Copyright © 2009 O'Reilly Media, Inc.