The C++ source files for the stand-alone base64 encoder and decoder discussed in this post, plus a separate implementation of quoted-printable (RFC 2045, section 6.7), and the hex string converter I presented last year, can be found here:
There is a quote that goes “Standards are great! Everyone should have one.” or something along those lines. (Somewhat ironically, this quote, too, has many different variations, and has many attributions. The earliest I’ve found attributes it to George Morrow in InfoWorld 21 Oct 1985).
A case in point is the base64 encoding. Put simply, it’s a method of encoding an array of 8-bit bytes using an alphabet consisting of 64 different printable characters from the ASCII character set. This is done by taking three 8-bit bytes of source data, arranging them into a 24-bit word, and converting that into four 6-bit characters that maps onto the 64-character alphabet (since 6 bits is 0-63).
The original implementation was for privacy-enhanced e-mail (RFC 1421), then altered slightly for MIME (RFC 2045), and again in its own standard (RFC 4648).
When I was looking at base64, I was interested in three different varieties or flavours, namely the MIME version, the (per RFC 4648) standard base64, and base64url. These differ in how they handle line breaks and other illegal characters, what characters are used in the 64-character alphabet, and the use of padding at the end to make up an even triplet of bytes.
During the writing of my last post, I did the due dilligence thing and considered alternative implementations and algorithms to solve the problem at hand (converting a string representation of an 8-bit hexadecimal value to an unsigned 8-bit integer value). Because I was, in effect, documenting code written some years ago, I can’t recall exactly what other options, if any, I tried at the time.
I think I first tried using a
std::stringstream, but gave up on that as being too slow, and went with
strtoul instead. I might also have played around with using a
std::map lookup table, with all the headaches that brought in terms of storage and initialisation, and decided against it.
What I didn’t try was a straight, non-clever switch-based lookup table to find the integer value of a hexadecimal character digit:
inline unsigned char hex_digit_to_nybble(char ch)
case '0': return 0x0;
case '1': return 0x1;
case '2': return 0x2;
case 'f': return 0xf;
case 'F': return 0xf;
default: throw std::invalid_argument();
Here’s a problem that tends to crop up in a lot of communication domains: how do you transfer binary data in a protocol which limits what characters are permitted? The answer is to encode it into permissible characters (for historical reasons often 7-bit printable ASCII), and because there are few things this wonderful industry likes more than re-inventing the wheel, there’s a plethora of binary-to-text encoding schemes around. Each has its own trade-offs in terms of speed and space efficiency, and almost every one has a more or less glorious history of being the favoured scheme on some platform, or in some protocol or application.
The simplest encoding is (in my opinion) the “hexadecimal text” encoding. It’s so simple, it doesn’t even have a fancy or clever name. You simply take each byte and type its value as a hexadecimal number. Working on the assumption that a byte is 8 bits, its value can be expressed in two characters – 0x00-0xff. Assuming that a character occupies one byte, we see that the size of the data will double by writing it as hexadeximal text, so it’s not very efficient space-wise. But it is simple to understand and implement, and quite useful, so I wrote a pair of encoding/decoding functions.