Dynamically sized struct

Code

It is quite common in C APIs to have structs that contain dynamically sized buffers. An initial call is made in which a size parameter is populated, after which a larger chunk of memory is allocated, the pertinent struct parameters are copied across, and a second call is made in which the data is retrieved.

This article presents a generic way to eliminate many of the risks inherent in writing code calling such APIs.

Consider the USB_NODE_CONNECTION_NAME struct, which is used in Windows to retrieve the link name of a connected USB hub.

typedef struct _USB_NODE_CONNECTION_NAME {
  ULONG ConnectionIndex;
  ULONG ActualLength;
  WCHAR NodeName[1];
} USB_NODE_CONNECTION_NAME, *PUSB_NODE_CONNECTION_NAME;

Using only minimal error checking for brevity (don’t try this at home), typical usage would look like this:

bool GetUsbConnectionName(HANDLE hDevice, 
                          ULONG index, 
                          std::wstring& name)
{
    ULONG nBytes;
    
    // One struct of default size to query required size
    USB_NODE_CONNECTION_NAME connectionName;
    
    // And one dynamically created struct
    PUSB_NODE_CONNECTION_NAME connectionNameP;

    // 1. Initialise struct
    connectionName.ConnectionIndex = index;

    // 2. Query actual length
    BOOL success = DeviceIoControl(hDevice,
        IOCTL_USB_GET_NODE_CONNECTION_NAME,
        &connectionName,    // input data (e.g. index)
        sizeof(connectionName),
        &connectionName,    // output data (e.g. length)
        sizeof(connectionName),
        &nBytes,
        NULL);
        
    if (!success)
        return false;

    // 3. Allocate required memory
    size_t required = sizeof(connectionName) + 
                      connectionName.ActualLength -
                      sizeof(WCHAR);
    connectionNameP = (PUSB_NODE_CONNECTION_NAME)malloc(required);
    
    // 4. Initialise struct
    connectionNameP->ConnectionIndex = index;
    
    // 5. Query name
    success = DeviceIoControl(hDevice,
        IOCTL_USB_GET_NODE_CONNECTION_NAME,
        connectionNameP,    // input data (e.g. index)
        required,
        connectionNameP,    // output data (e.g. name)
        required,
        &nBytes,
        NULL);
        
    if (!success)
        return false;

    // 6. Copy data (from the second struct, not the first)
    name = std::wstring(connectionNameP->NodeName, 
                        connectionNameP->NodeName + 
                        connectionName.ActualLength / sizeof(WCHAR));
                    
    // 7. Release memory
    free(connectionNameP);

    return true;
}

There are three problems with this approach: the struct we’re using is initialised twice, we must remember to free the memory, and we have two structures to keep track of.

In C++, we can use the power of templates and managed memory to improve on the code above. We can use a std::vector<char> to take the place of a buffer created dynamically on the heap, and take advantage of the fact that if we make it larger, the existing data is unchanged.

template <typename T>
class dynamic_struct
{
    // Actual memory in which the struct is held
    std::vector<char> buffer;
public:
    // Contained type
    typedef T Type;
    
    // Default constructor ensures minimum buffer size
    dynamic_struct()
    : buffer(sizeof(T))
    {}
    
    // Parameterised constructor for when the size is known
    dynamic_struct(std::size_t size)
    {
        resize(size);
    }

    // Change size of buffer allocated for struct
    void resize(std::size_t size)
    {
        if (size < sizeof(T))
            throw std::invalid_argument("Size too small for struct");
        buffer.resize(size, 0);
    }

    // Get current buffer size (never less than struct_size)
    std::size_t size() const
    {
        return buffer.size();
    }

    // Get struct template type size
    static std::size_t struct_size()
    {
        return sizeof(T);
    }

    // Access struct
    const T& get() const
    {
        return *reinterpret_cast<const T*>(&buffer.front());
    }

    // Access struct
    T& get() 
    {
        return *reinterpret_cast<T*>(&buffer.front());
    }
};

Using this handy class, the function to get the name can be simplified:

bool GetUsbConnectionName(HANDLE hDevice, 
                          ULONG index, 
                          std::wstring& name)
{
    ULONG nBytes;
    dynamic_struct<USB_NODE_CONNECTION_NAME> connectionName;

    // 1. Initialise struct
    connectionName.get().ConnectionIndex = index;

    // 2. Query actual length
    BOOL success = DeviceIoControl(hDevice,
        IOCTL_USB_GET_NODE_CONNECTION_NAME,
        &connectionName.get(), // input data (e.g. index)
        connectionName.size(),
        &connectionName.get(), // output data (e.g. length)
        connectionName.size(),
        &nBytes,
        NULL);
        
    if (!success)
        return false;

    // 3. Allocate required memory
    size_t required = sizeof(connectionName) + 
                      connectionName.get().ActualLength;
    connectionName.resize(required);
    
    // 4. Query name
    success = DeviceIoControl(hDevice,
        IOCTL_USB_GET_NODE_CONNECTION_NAME,
        &connectionName.get(), // input data (e.g. index)
        connectionName.size(),
        &connectionName.get(), // output data (e.g. name)
        connectionName.size(),
        &nBytes,
        NULL);
        
    if (!success)
        return false;

    // 5. Copy data
    name = std::wstring(connectionName.get().NodeName, 
                        connectionName.get().NodeName + 
                        connectionName.get().ActualLength / sizeof(WCHAR));

    return true;
}

In this case, there is no risk of forgetting to initialise the second struct, no risk of getting confused about which struct to copy data from, and no risk of memory leaks, even in the presence of exceptions.

Advertisement

Read user input or Enter

Code

In an interactive shell application, you might want to ask the user for value, or let them stick with the default or current value. In such cases, it’s quite handy to be able to accept a press on the Enter key as a shorthand for keeping the current.

Number of runs (5): 
Snoffle variant (pink): red	

In the example above, the user has pressed enter without giving a value when prompted for number of runs, and has thus accepted the current setting of 5. In contrast, the user has decided to change the snoffle variant from pink to red. Presumably the user (and the programmer) knows the significance of all this.

The problem is that while the standard input stream std::cin is good at reading and parsing data, it will read as much as it needs, and no more. This means that for strings, it will read until the first whitespace character, so you only get one word. For ints and floats, it will ignore any initial whitespace characters, like space and newline, until it finds the beginning of a number (or an invalid character), and leave anything after the number, which may lead to there being a newline in the buffer when you come next to read a string.

The solution lies in always reading everything until newline, which is what the standalone function std::getline is for, into a string, and then, once we have the whole user input, attempt to parse it. By using a std::stringstream, we’ll be relying on the same parsing routines as std::cin would feed into. Making it a template function is only natural, since the streams (both std::cin and std::stringstream) are designed to work with ttemplate types.

#include <iostream>
#include <sstream>

/*! Read user input until Enter is pressed. 
    \tparam T type of data to read
    \param val will hold the given input, if any
    \return false if Enter was pressed at once, true if data was given
*/
template <typename T>
bool get_user_input(T& val)
{
    std::string s;
    std::getline( std::cin, s);
    if (s.empty())
        return false;
    std::stringstream ss;
    ss << s;
    ss >> val;
    return true;
}

This is very simple and straigtforward to use:

  int runs = get_number_of_runs();
  std::cout << "Number of runs (" << runs << "): ";
  if (get_user_input(runs))
    set_number_of_runs(runs);
  std::string snoffle = get_snoffle();
  std::cout << "Snoffle variant (" << snoffle << "): "
  if (get_user_input(snoffle))
    set_snoffle(snoffle);

You’ll note that we get the current value, so we can display it, and only replace it if we have been given a new one.

A synchronous observer of asynchronous events

Code, CodeProject

Introduction

In the Observer design pattern, a subject holds a list of interested parties – the observers – which it will notify about changes in status. Simply put, it’s a form of subscription, and this design comes up in all sorts of places (which is one of the definitions of the term ‘design pattern‘). It’s well suited for handling asynchronous events, like user interaction in a GUI, sensor information, and so on.

There is, however, often a need to re-synchronise asynchronous events. For instance, you might keep the latest status update until it’s actually needed for display, storage or some calculation. By doing this, you disregard the asynchronous nature of its source, and treat it as just another variable, as if it had been read from the subject right then. In other words, you synchronise a status from the past with the present. Sometimes, though, you don’t want the last value, but the next, which is a bit more complex, as it requires you to wait for the future to happen before we can say it’s the present.

In this article, we will write a simple multi-threaded example implementation of the Observer pattern, and show how to re-synchronise a past event to look current. Then we’ll demonstrate a technique to treat future events like they’re current, too.

GetLastError as std::string

Code

If you haven’t a function for this already, feel free to re-use this. Putting it here so I don’t have to look around for it next time I need it.

// Needs Windows constant and type definitions
#include <windows.h>

// Create a string with last error message
std::string GetLastErrorStdStr()
{
  DWORD error = GetLastError();
  if (error)
  {
    LPVOID lpMsgBuf;
    DWORD bufLen = FormatMessage(
        FORMAT_MESSAGE_ALLOCATE_BUFFER | 
        FORMAT_MESSAGE_FROM_SYSTEM |
        FORMAT_MESSAGE_IGNORE_INSERTS,
        NULL,
        error,
        MAKELANGID(LANG_NEUTRAL, SUBLANG_DEFAULT),
        (LPTSTR) &lpMsgBuf,
        0, NULL );
    if (bufLen)
    {
      LPCSTR lpMsgStr = (LPCSTR)lpMsgBuf;
      std::string result(lpMsgStr, lpMsgStr+bufLen);
      
      LocalFree(lpMsgBuf);

      return result;
    }
  }
  return std::string();
}

This function retrieves the last error code, if any, and gets the text message associated with it, which is then converted to a standard string and returned. The main benefits of using this function is that it saves you from having to remember the syntax of FormatMessage, and that the memory used is tidied up.

Note that the FORMAT_MESSAGE_FROM_SYSTEM flag means only system error messages will be given. If you want to include error messages from your own modules, you’ll need to add the FORMAT_MESSAGE_FROM_HMODULE flag, and provide the handle to the module. See the FormatMessage documentation for details.

Splitting strings again – strtok redeemed

Code, CodeProject

The C++ source files for the string tokenisers discussed in this post and the Splitting strings post, plus the code for Removing whitespace and Static assert in C++, can be found here:
http://coolcowstudio.co.uk/source/cpp/utilities.zip.

One of the more curious omissions from the C++ standard library is a string splitter, e.g. a function that can take a string and split it up into its constituent parts, or tokens, based on some delimiter. There is one in other popular languages ((C# – String.Split, Java – String.split, Python – string.split etc), but C++ programmers are left to roll their own, or use one from a third-party library like the boost::tokenizer (or the one I presented in Splitting strings).

There are many ways of going this; the Stack Overflow question How do I tokenize a string in C++? has 23 answers at the time of writing, and those contain 20 different solutions (boost::tokenizer and strtok are suggested multiple times).

The strtok recommendations, however, all have comments pointing out the problems with this function – it’s destructive, and not reentrant (it can’t be nested or run in parallell on multiple strings). As functions go, strtok has a rather poor reputation – there’s even a popular reentrant version, strtok_r, available in many C library implementations, though it’s not a standard function.

All your base64 are different to us

Code, CodeProject

The C++ source files for the stand-alone base64 encoder and decoder discussed in this post, plus a separate implementation of quoted-printable (RFC 2045, section 6.7), and the hex string converter I presented last year, can be found here:
http://coolcowstudio.co.uk/source/cpp/coding.zip.

There is a quote that goes “Standards are great! Everyone should have one.” or something along those lines. (Somewhat ironically, this quote, too, has many different variations, and has many attributions. The earliest I’ve found attributes it to George Morrow in InfoWorld 21 Oct 1985).

A case in point is the base64 encoding. Put simply, it’s a method of encoding an array of 8-bit bytes using an alphabet consisting of 64 different printable characters from the ASCII character set. This is done by taking three 8-bit bytes of source data, arranging them into a 24-bit word, and converting that into four 6-bit characters that maps onto the 64-character alphabet (since 6 bits is 0-63).

The original implementation was for privacy-enhanced e-mail (RFC 1421), then altered slightly for MIME (RFC 2045), and again in its own standard (RFC 4648).

When I was looking at base64, I was interested in three different varieties or flavours, namely the MIME version, the (per RFC 4648) standard base64, and base64url. These differ in how they handle line breaks and other illegal characters, what characters are used in the 64-character alphabet, and the use of padding at the end to make up an even triplet of bytes.

Removing whitespace

Code, CodeProject

Here’s a std::string, please remove all whitespace from it. How would you do it? Despite its seeming simplicity, it’s an interesting question, because it can be done in so many ways.

To start with, how do you identify whitespace? Let’s have a look at some different approaches (all of which I’ve seen in the wild):

// Simple
bool iswhitespace1(char c)
{
  // Is it  space   or    tab      or    return   or    newline?
  return (c == ' ') || (c == '\t') || (c == '\r') || (c == '\n');
}
// Cute attempt at cleverness
bool iswhitespace2(char c)
{
  // Is it one of the whitespace characters?
  static const std::string spaces(" \t\r\n");
  return (std::string::npos != spaces.find(c));
}
// Probably ok, for English at least
bool iswhitespace3(char c)
{
  // Using C function, from <cctype>
  return ::isspace(c);
}
// As above, but standard C++ instead of standard C
bool iswhitespace4(char c)
{
  // Using current locale, and std function from <locale>
  static const std::locale loc;
  return std::isspace(c, loc);
}

If we were to run through these four functions with values of c from 0 to 255, the first two would produce the same result, and the latter two would (probably) produce the same result, but those wouldn’t be the same as for the first two.

Splitting strings

Code, CodeProject

Back in the dawn of time, when men were real men, bytes were real bytes, and floating point numbers were real, um, reals, the journeyman test of every aspiring programmer was to write their own text editor. (This was way before the concept of “life” had been invented, so no-one knew they were supposed to have one.)

Nowadays, we know better, and don’t write new code to solve problems that have already been solved. Well, unless we need an XML parser – everybody (including myself, but that’s a post for another time) has written one of those – or at least a string tokeniser (aka splitter).

Other languages get tokenisers for free (C# – String.Split, Java – String.split, Python – string.split, and so on, and even C has strtok), but not C++. Which is why it’s something almost every C++ programmer writes, at some point or other.

Of course, you can use the rather nifty boost::tokenizer, if the place where you work is okay with using Boost (a surprising number of places aren’t, for various reasons), or find one of the numerous example implementations out there. Like this one, for instance:

Redux: Hex strings to raw data and back

Code, CodeProject

During the writing of my last post, I did the due dilligence thing and considered alternative implementations and algorithms to solve the problem at hand (converting a string representation of an 8-bit hexadecimal value to an unsigned 8-bit integer value). Because I was, in effect, documenting code written some years ago, I can’t recall exactly what other options, if any, I tried at the time.

I think I first tried using a std::stringstream, but gave up on that as being too slow, and went with strtoul instead. I might also have played around with using a std::map lookup table, with all the headaches that brought in terms of storage and initialisation, and decided against it.

What I didn’t try was a straight, non-clever switch-based lookup table to find the integer value of a hexadecimal character digit:

inline unsigned char hex_digit_to_nybble(char ch)
{
  switch (ch)
  {
    case '0': return 0x0;
    case '1': return 0x1;
    case '2': return 0x2;
...
    case 'f': return 0xf;
    case 'F': return 0xf;
    default: throw std::invalid_argument();
  }
}

Hex strings to raw data and back

Code, CodeProject

Here’s a problem that tends to crop up in a lot of communication domains: how do you transfer binary data in a protocol which limits what characters are permitted? The answer is to encode it into permissible characters (for historical reasons often 7-bit printable ASCII), and because there are few things this wonderful industry likes more than re-inventing the wheel, there’s a plethora of binary-to-text encoding schemes around. Each has its own trade-offs in terms of speed and space efficiency, and almost every one has a more or less glorious history of being the favoured scheme on some platform, or in some protocol or application.

The simplest encoding is (in my opinion) the “hexadecimal text” encoding. It’s so simple, it doesn’t even have a fancy or clever name. You simply take each byte and type its value as a hexadecimal number. Working on the assumption that a byte is 8 bits, its value can be expressed in two characters – 0x00-0xff. Assuming that a character occupies one byte, we see that the size of the data will double by writing it as hexadeximal text, so it’s not very efficient space-wise. But it is simple to understand and implement, and quite useful, so I wrote a pair of encoding/decoding functions.