Hex strings to raw data and back

Code, CodeProject

Here’s a problem that tends to crop up in a lot of communication domains: how do you transfer binary data in a protocol which limits what characters are permitted? The answer is to encode it into permissible characters (for historical reasons often 7-bit printable ASCII), and because there are few things this wonderful industry likes more than re-inventing the wheel, there’s a plethora of binary-to-text encoding schemes around. Each has its own trade-offs in terms of speed and space efficiency, and almost every one has a more or less glorious history of being the favoured scheme on some platform, or in some protocol or application.

The simplest encoding is (in my opinion) the “hexadecimal text” encoding. It’s so simple, it doesn’t even have a fancy or clever name. You simply take each byte and type its value as a hexadecimal number. Working on the assumption that a byte is 8 bits, its value can be expressed in two characters – 0x00-0xff. Assuming that a character occupies one byte, we see that the size of the data will double by writing it as hexadeximal text, so it’s not very efficient space-wise. But it is simple to understand and implement, and quite useful, so I wrote a pair of encoding/decoding functions.

Let’s start with the encoding function, as that’s simplest. I’ll use std::string to store the resulting string here, with the above assumptions. (Those assumptions – 8-bit memory bytes, 8-bit characters – are quite reasonable, in that if you’re working on a platform where they’re not true, you probably know about it.)

#include <limits> // For char size

// Encode data buffer to string of hexadecimal values
void bytes_to_hex(const std::vector<unsigned char>& data, 
  std::string& str)
{
  // Just wrapping the more "raw" function
  bytes_to_hex(&data[0], data.size(), str);
}

void bytes_to_hex(const unsigned char* data, size_t length,
  std::string& str)
{
  // Sanity check
  static_assert<8 == CHAR_BIT>::valid_expression();
  
  // Clear output
  str.clear();
  
  // No data? Then we're done
  if (0 == length)
    return;

  // Output is twice the length of input length
  str.resize(length * 2, ' ');
  
  // Working with 4-bit nybbles, we can use the value as
  // index to character
  static const std::string hex_char = "0123456789abcdef";

  for (size_t i = 0; i < length; ++i)
  {
    // High nybble
    str[i<<1] = hex_char[(data[i] >> 4) & 0x0f];
    // Low nybble
    str[(i<<1) + 1] = hex_char[data[i] & 0x0f];
  }
}

As you see, it’s very simple. Given a buffer of bytes {7, 233, 57, 42, 198} the string “07e9392ac6” is generated into the output parameter.

While there is a standard way of turning a number into a string – with std::stringstream – I elected to write the code to do it myself here, since it’s a very simple and safe algorithm.

First, I set up an array of the sixteen hexadecimal digits, and then simply use the high and low nybble as an index into this array to get the character corresponding to the value of the nybble. The high nybble has to be shifted down to get the correct range, and that’s all there is to it.

Since I had already resized the output string, I can write directly into the correct position, instead of appending (which would likely be significantly slower).

It’s tempting to write the decoding function as a straight reverse, but this stumbles on the character-to-value lookup. How do you get from a character to its corresponding nybble? The naive solution looks as follows:

std::string hex = "f3"; // For instance
...
char hi_nybble = hex[0];
char lo_nybble = hex[1];
unsigned char result = 0;

// First for high nybble, then low
// Numeric or alphabetic?
if (hi_nybble > '9')
  result |= (hi_nybble - 'a' + 0xa) << 4;
else
  result |= (hi_nybble - '0') << 4;
if (lo_nybble > '9')
  result |= (lo_nybble - 'a' + 0xa);
else
  result |= (lo_nybble - '0');
...

Ignoring for the moment that there’s no sanity checking of the input data, assuming only the characters [0-9,a-f] will be present, this function still fails the “good engineering” test by making assumptions about character ordering and values. It’s not safe to use, and may stop working when used with a different character set.

An alternative would be to make a proper lookup table, mapping characters to nybble values, with separate entries for upper and lower case characters (a-f), either by populating a std::map or a big switch:

inline unsigned char hex_digit_to_nybble(char ch)
{
  switch (ch)
  {
    case '0': return 0x0;
    case '1': return 0x1;
    case '2': return 0x2;
...
    case 'f': return 0xf;
    case 'F': return 0xf;
    default: throw std::invalid_argument();
  }
}

Then, after I have the nybbles, I could shift the high one up, and do a bitwise OR to join them. But frankly, while this works, it feels clunky. And besides, there are lots of standard ways to convert a string to a number; from the standard C library functions atoi and strtol, to the standard C++ std::stringstream (and even boost::lexical_cast which isn’t standard, but fairly popular). However, only two of those can handle numbers in bases other than decimal – strtol and std::stringstream – and of those, the latter is much more powerful, and therefore likely to be slower.

The strtol function expects a character string, so I’ll have to copy each pair of characters into a zero-terminated buffer, and use that as input to get a byte. That’s simple enough, but what do I do if there isn’t a pair of characters, but a single one?

In other words, if I have the hex string “3da” to convert, the function should treat it like “03da”, and produce {03, da} rather than {3d, a0}. Rather than making a copy of the string with an extra “0” prepended, I’ll treat this as a special case.

Any other potential problems with using strtol? Well, yes, the matter of what characters count as valid input. I’ll use the unsigned version, strtoul, since I’m expecting an unsigned output, but even this is far too lenient in what it accepts: [whitespace][{+|–}] [0[{x|X}]][digits].

Since I’m not converting numbers, but encoding bytes, I can’t accept any whitespace, signs, or anything that isn’t a hexadecimal digit. Looking into this further, it’s less of a problem than it would first appear, as strtoul will let me know if there’s an un-parsed character at the end. In other words, “-3” is fully parsed, but for “3-” it will only parse the first digit and then stop, so that’s a simple thing to check for. Furthermore, by telling it explicitly what base to use, it will disallow an initial “0x”.

Still, we need to make sure the initial character is valid hex, which means breaking out isxdigit to make sure we only accept hexadecimal digits. Now, this is a function that is both available from the standard C library, via the <cctype> header, and from the standard C++ library, via the <locale> header. The difference is that the std::isxdigit takes a std::locale, which I assume is just for completeness’ sake, as the C isxdigit is not affected by any changes to locale. At best, it’s just a through call only adding a level of indirection, and at worst, it adds more unnecessary computing, so I’ll just stick to the C version.

#include <cctype> // For isxdigit

void hex_to_bytes(const std::string& str, 
  std::vector<unsigned char>& data)
{
  // Sanity check
  static_assert<8 == CHAR_BIT>::valid_expression();
  
  // Clear output
  data.clear();
  
  // No data? Then we're done
  if (str.empty())
    return;

  // Must be prepared that string can have odd number of 
  // nybbles, in which case the first is treated like the low 
  // nybble of the first byte
  size_t lengthOverflow = str.length() % 2;

  // This also affects the length of the data buffer we
  // allocate (need full  byte for nybble)
  const size_t length = lengthOverflow + str.length() / 2;
  data.resize(length);

  // Buffer for byte conversion
  static char buf[3];
  buf[2] = 0;
  // End of input
  char* pend = &buf[2];

  // Iterators for input and output
  size_t i = 0;
  size_t c = 0;

  // If the first nybble is a low, we'll do it separately
  if (1 == lengthOverflow)
  {
    buf[0] = '0';
    buf[1] = str[c++];
    unsigned char x = static_cast<unsigned char>
      (strtoul(buf, &pend, 16));
    
    // Parsing should stop at terminating zero
    if (pend != &buf[2])
    {
      std::string e = "Invalid character in hex string: \'";
      e += *(pend);
      e += "'";
      throw std::invalid_argument(e);
    }
    data[i++] = x;
  }
  
  // For each output byte, we use two input characters for 
  // high and low nybble, respectively
  for (; i < length; ++i)
  {
    buf[0] = str[c++];
    // strtoul accepts initial whitespace or sign, we can't
    if (!isxdigit(buf[0]))
    {
      std::string e = "Invalid character in hex string: \'";
      e += buf[0];
      e += "'";
      throw std::invalid_argument(e);
    }

    buf[1] = str[c++];
    unsigned char x = static_cast<unsigned char>
      (strtoul(buf, &pend, 16));

    // Parsing should stop at terminating zero
    if (pend != &buf[2])
    {
      std::string e = "Invalid character in hex string: \'";
      e += *(pend);
      e += "'";
      throw std::invalid_argument(e);
    }

    data[i] = x;
  }
}

(As it happens, when writing this post I came across a very interesting blog entry by Tino Didriksen testing different methods of converting strings to integers in C++, using decimal strings, so all the methods I mention above are timed. Looking at those results there’s little to recommend std::stringstream in terms of speed, which was a nice validation of code I wrote the first version of years ago.

I also used the benchmark code from Tino as a to compare the usage of strtol to the hex_digit_to_nybble outlined above, and found that the latter was almost twice as fast. I’m currently pondering what drawbacks there might be in using what was quickly concieved as a rethorical strawman while writing this article.)

Advertisements

4 thoughts on “Hex strings to raw data and back

  1. For the decode version, there’s always the somewhat insane (but ought-to-be-fast) method of declaring a static array of 1<<CHAR_BITS elements (ideally int), initializing all elements to -1, then do something like:

    decode_array['0'] = 0;

    decode_array['f'] = 15;

    decode_array['F'] = 15;

    Don't know if it will necessarily make things faster, but MAY be worth a try (at 256 bytes of static storage, that's small enough that'd I'd half consider doing it even under some memory pressure). Still needs a check to see if the array value is negative, but at least some error checking is possible.

    1. Yes, that ought to be as fast as or faster than the switch, but suffers from a need to set up the table in the first place. While it can be done as a static variable, so it only needs populating once, it would still be a hit the first call. Given how fast the lookup would be, the delay induced by setting the table up would, I guess, be at least a couple of orders of magnitude.

      1. It could PROBABLY be set up compile-time, if you really wanted. But that’d probably require either writing it for a specific character set or using a code generator.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s