All your base64 are different to us

Code, CodeProject

The C++ source files for the stand-alone base64 encoder and decoder discussed in this post, plus a separate implementation of quoted-printable (RFC 2045, section 6.7), and the hex string converter I presented last year, can be found here:

There is a quote that goes “Standards are great! Everyone should have one.” or something along those lines. (Somewhat ironically, this quote, too, has many different variations, and has many attributions. The earliest I’ve found attributes it to George Morrow in InfoWorld 21 Oct 1985).

A case in point is the base64 encoding. Put simply, it’s a method of encoding an array of 8-bit bytes using an alphabet consisting of 64 different printable characters from the ASCII character set. This is done by taking three 8-bit bytes of source data, arranging them into a 24-bit word, and converting that into four 6-bit characters that maps onto the 64-character alphabet (since 6 bits is 0-63).

The original implementation was for privacy-enhanced e-mail (RFC 1421), then altered slightly for MIME (RFC 2045), and again in its own standard (RFC 4648).

When I was looking at base64, I was interested in three different varieties or flavours, namely the MIME version, the (per RFC 4648) standard base64, and base64url. These differ in how they handle line breaks and other illegal characters, what characters are used in the 64-character alphabet, and the use of padding at the end to make up an even triplet of bytes.

That’s three mostly similar but slightly different algorithms, so how to design an implementation? A number of designs are possible, of course, like:

  1. three copy-pasted and tweaked sets of functions
  2. hugely parameterised functions where every variation in algorithm or data can be altered in the call
  3. an inheritance tree, where virtual overridden functions alter behaviour

I chose to use a design that in my opinion takes the best from all of those, with none of the disadvantages.

The Wikipedia article has a handy table giving a summary of the differences between the variants, which gives eight possible differences of data or algorithm. I’ve elected to ignore the last of those, which is the addition of a checksum only used for OpenPGP (RFC 4880), since that is easily added outside the base64 coding. The sixth difference on Wikipedia’s list – line separator – can also be ignored since it can be inferred from the maximum line length. The remaining areas of difference are:

  • Character for index 62
  • Character for index 63
  • Character to use for padding
  • Whether fixed line length is used
  • Maximum line length
  • Handling of illegal characters

Fortunately, these are all integer types (bool and char are both integer types), so can be used as template parameters. In other words, I can define a type to hold all possible variants:

  char Tchar62         // Character for index 62 in alphabet
  char Tchar63         // Character for index 63 in alphabet
  char TcharPad        // Character to use for padding, or 0 if 
                       // padding is not used
  bool TfixLineLength  // false if line length is fixed. If true, 
                       // maxLineLength is used
  int  TmaxLineLength  // Maximum (or fixed) line length. 0 if not 
                       // used. If used, CR+LF is used as linebreak
  bool TignoreIllegal> // if false, discards characters not in 
                       // alphabet; if true throws error on finding 
                       // illegal character
struct base64_variants
  // This type only has type information accessors
  // Get character for index 62 in alphabet
  static char char62()  
    { return Tchar62; } 
  // Get character for index 63 in alphabet
  static char char63()  
    { return Tchar63; }          
  // Get padding character, if any
  static char charPad() 
    { return TcharPad;  }
  // Check if padding is used
  static bool pad()     
    { return TcharPad > 0;  }
  // Check if line length is fixed
  static bool fixLineLength() 
    { return TfixLineLength;  }
  // Get maximum line length
  static int  maxLineLength() 
    { return TmaxLineLength;  }
  // Check if lines should be broken
  static bool lineBreaks()    
    { return TmaxLineLength>0;  }
  // Check if illegal characters may be ignored
  static bool ignoreIllegal() 
    { return TignoreIllegal;  }

What is the benefit of this? Well, it lets us express distinct sets of parameters to correspond to particular standards, like this:

// MIME variety of base64 (RFC 2045) 
typedef base64_variants<'+', '/', '=', false, 76, true> base64MIME;

// Plain standard (RFC 4648) base64
typedef base64_variants<'+', '/', '=', false, 0, false> base64;

// URL variety of base64 (RFC 4648)
typedef base64_variants<'-', '_', 0, false, 0, false> base64url;

This means that there is no risk of mixing up parameters in function calls – once a set is defined (and tested and verified to be correct) that can be used everywhere. The actual coding functions then get a very simple interface, with an input, an output, and variant type as only parameters:

// Encode C-style array of bytes
template<typename T>
void bytes_to_base64(const unsigned char* data, 
        size_t length, 
        std::string& str);

// Encode byte vector
template<typename T>
void bytes_to_base64(const std::vector<unsigned char>& data, 
        std::string& str);

// Decode to byte vector        
template<typename T>
void base64_to_bytes(const std::string& str, 
        std::vector<unsigned char>& data);

And this is how the use of these would look in code:

// Using test vectors from RFC 4648
const char* src = "fooba";
std::vector<unsigned char> data(src, src+5);
std::string result;

bytes_to_base64<base64>(data, result);

assert(result == "Zm9vYmE=");

base64_to_bytes<base64>(result, data);

result = std::string(data.begin(), data.end());
assert(result == "fooba");

Below is a short extract of the implementation, illustrating how the variant specification is used:

// Encode
template<typename T>
void bytes_to_base64(const unsigned char* data, size_t length, 
  std::string& str)
  // Calculate maximum expected size (bytes, padding, line breaks) 
  str.reserve((length * 4 ) / 3 + 3 + 
    (T::lineBreaks() ? 2 * length / T::maxLineLength() : 0) );

(I won’t post the whole implementation here. While it’s only 200 airy and thoroughly commented lines, it’s better to give you links to all the files, which I do at the beginning of this post.)

Now, the astute reader will note that I only declared the coding functions earlier, and talked about the implementation as if it was separate. That’s the normal way of doing things, but that won’t work with templates, right?

Here’s the thing with template functions and classes: they are not just the one thing. A normal, non-template function is one single thing, fully defined once and once only. Therefore, it can be compiled in one compilation unit, and then linked to. It can be declared elsewhere, and that declaration is essentially a promise that somewhere this thing is defined, which is all the compiler cares about.

A template function is effectively a new function for each template parameter (or combination of parameters) it is used with. (As far as the compiler is concerned, a template class is just a way of saying that all member functions have the same set of template parameters.) To the compiler, there’s no such thing as a “template function”. There is only “template function with these template parameters”.

This means that there can’t be a single compiled variant that can be linked to, only specialisations that use this or that set of template parameters. Even if only one variant, only one specialisation, is used in your project, the compiler can’t know that.

So instead, the compiler compiles these inline; it’s effectively replacing each call to a template function with a copy of that function, in which the template parameters are those used in that particular call.

In other words, C++ templates are just a way of bullying the compiler into doing the copy-paste programming you are ashamed of doing yourself.

As it happens, though, that also provides the solution to separationg definitions and declarations, provided you know what template parameters you’ll be using. All you need to do is declare the function with the parameters you want, in the same compilation unit as the template function definition.

Here’s a simple example:

-- In my_template.h file
template <typename T>
T my_temp_func(const T& t);

-- In my_template.cpp file

#include "my_template.h"

// Definition
template <typename T>
T my_temp_func(const T& t)
  return t;

// Instantiation declaration
int my_temp_func<int>(const int& t);

-- In using_my_template.cpp
#include "my_template.h"
int i = my_temp_func<int>(4); // works
string s = my_temp_func<string>("Abob"); // link error

In the example above, the declaration in the last line of my_template.cpp tells the compiler there’ll be a variant of the template function that uses int as template parameter. Okay, says the compiler, I’ll put an inline copy there. Since the generic definition is right there in the same compilation unit (ie my_template.cpp), this is something the compiler can do – it has all the information it needs.

The result of that is that in the compiled file (probably called my_template.obj) there is now a function that has the signature int my_temp_func(const int& t). This is a fully defined specialisation of a template function, so to the linker it looks just like a normal function.

However, the linker won’t be able to find a string specialisation, so this will generate a linker error.

This illustrates both how to use this trick, and its limitation. It only works if you list all specialisations you are going to use, which makes it unfeasible for generic libraries.

In this case, though, it’s ideal. I have my three variants of base64 defined – base64MIME, base64 and base64url – and those are the only one I’ll need.

Actually, I might as well add a definition for the original variant:

// PEM variety of base64 (RFC 1421)
typedef base64_variants<'+', '/', '=', true, 64, false> base64PEM;

So I have my four variations defined, and they are the only variations of base64 I’m interested in, so I’ll just have to declare them in the same .cpp file as the function definitions are in:

template void bytes_to_base64<base64>(const unsigned char* data, 
                    size_t length, std::string& str);
template void bytes_to_base64<base64>(const std::vector<unsigned char>& data, 
                    std::string& str);
template void base64_to_bytes<base64>(const std::string& str, 
                    std::vector<unsigned char>& data);

template void bytes_to_base64<base64MIME>(const unsigned char* data,
                    size_t length, std::string& str);

This lets the linker find compiled varieties for all those base64 definitions.

Should you want to implement a slightly different base64 coder, you could use the code I’ve written. It wouldn’t be enough to declare a new definition type, but you would also have to add a declaration using that type to the .cpp file. But the source code is both open and free, so help yourself.

(I should note that like with so much else, base64 is something there are lots and lots of implementations of available on the net, but most of the ones I’ve found tended to be very lax and lack strict checking of syntactical correctness, or implement just one flavour. Hence, writing my own.)

As always, if you found this interesting or useful, or have suggestions for improvements, please let me know.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s