2012/10/17
Splitting strings again – strtok redeemed
The C++ source files for the string tokenisers discussed in this post and the Splitting strings post, plus the code for Removing whitespace and Static assert in C++, can be found here:
http://coolcowstudio.co.uk/source/cpp/utilities.zip.
One of the more curious omissions from the C++ standard library is a string splitter, e.g. a function that can take a string and split it up into its constituent parts, or tokens, based on some delimiter. There is one in other popular languages ((C# – String.Split, Java – String.split, Python – string.split etc), but C++ programmers are left to roll their own, or use one from a third-party library like the boost::tokenizer (or the one I presented in Splitting strings).
There are many ways of going this; the Stack Overflow question How do I tokenize a string in C++? has 23 answers at the time of writing, and those contain 20 different solutions (boost::tokenizer and strtok are suggested multiple times).
The strtok recommendations, however, all have comments pointing out the problems with this function – it’s destructive, and not reentrant (it can’t be nested or run in parallell on multiple strings). As functions go, strtok has a rather poor reputation – there’s even a popular reentrant version, strtok_r, available in many C library implementations, though it’s not a standard function.
2010/08/12
Removing whitespace
Here’s a std::string, please remove all whitespace from it. How would you do it? Despite its seeming simplicity, it’s an interesting question, because it can be done in so many ways.
To start with, how do you identify whitespace? Let’s have a look at some different approaches (all of which I’ve seen in the wild):
// Simple
bool iswhitespace1(char c)
{
// Is it space or tab or return or newline?
return (c == ' ') || (c == '\t') || (c == '\r') || (c == '\n');
}
// Cute attempt at cleverness
bool iswhitespace2(char c)
{
// Is it one of the whitespace characters?
static const std::string spaces(" \t\r\n");
return (std::string::npos != spaces.find(c));
}
// Probably ok, for English at least
bool iswhitespace3(char c)
{
// Using C function, from <cctype>
return ::isspace(c);
}
// As above, but standard C++ instead of standard C
bool iswhitespace4(char c)
{
// Using current locale, and std function from <locale>
static const std::locale loc;
return std::isspace(c, loc);
}
If we were to run through these four functions with values of c from 0 to 255, the first two would produce the same result, and the latter two would (probably) produce the same result, but those wouldn’t be the same as for the first two.
Read on…
2010/08/10
Splitting strings
Back in the dawn of time, when men were real men, bytes were real bytes, and floating point numbers were real, um, reals, the journeyman test of every aspiring programmer was to write their own text editor. (This was way before the concept of “life” had been invented, so no-one knew they were supposed to have one.)
Nowadays, we know better, and don’t write new code to solve problems that have already been solved. Well, unless we need an XML parser – everybody (including myself, but that’s a post for another time) has written one of those – or at least a string tokeniser (aka splitter).
Other languages get tokenisers for free (C# – String.Split, Java – String.split, Python – string.split, and so on, and even C has strtok), but not C++. Which is why it’s something almost every C++ programmer writes, at some point or other.
Of course, you can use the rather nifty boost::tokenizer, if the place where you work is okay with using Boost (a surprising number of places aren’t, for various reasons), or find one of the numerous example implementations out there. Like this one, for instance:
Read on…
2010/08/05
Redux: Hex strings to raw data and back
During the writing of my last post, I did the due dilligence thing and considered alternative implementations and algorithms to solve the problem at hand (converting a string representation of an 8-bit hexadecimal value to an unsigned 8-bit integer value). Because I was, in effect, documenting code written some years ago, I can’t recall exactly what other options, if any, I tried at the time.
I think I first tried using a std::stringstream, but gave up on that as being too slow, and went with strtoul instead. I might also have played around with using a std::map lookup table, with all the headaches that brought in terms of storage and initialisation, and decided against it.
What I didn’t try was a straight, non-clever switch-based lookup table to find the integer value of a hexadecimal character digit:
inline unsigned char hex_digit_to_nybble(char ch)
{
switch (ch)
{
case '0': return 0x0;
case '1': return 0x1;
case '2': return 0x2;
...
case 'f': return 0xf;
case 'F': return 0xf;
default: throw std::invalid_argument();
}
}
2010/08/04
Hex strings to raw data and back
Here’s a problem that tends to crop up in a lot of communication domains: how do you transfer binary data in a protocol which limits what characters are permitted? The answer is to encode it into permissible characters (for historical reasons often 7-bit printable ASCII), and because there are few things this wonderful industry likes more than re-inventing the wheel, there’s a plethora of binary-to-text encoding schemes around. Each has its own trade-offs in terms of speed and space efficiency, and almost every one has a more or less glorious history of being the favoured scheme on some platform, or in some protocol or application.
The simplest encoding is (in my opinion) the “hexadecimal text” encoding. It’s so simple, it doesn’t even have a fancy or clever name. You simply take each byte and type its value as a hexadecimal number. Working on the assumption that a byte is 8 bits, its value can be expressed in two characters – 0×00-0xff. Assuming that a character occupies one byte, we see that the size of the data will double by writing it as hexadeximal text, so it’s not very efficient space-wise. But it is simple to understand and implement, and quite useful, so I wrote a pair of encoding/decoding functions. Read on…
