2010/08/10

Splitting strings

Posted in Code, CodeProject tagged , at 12:22 by Orjan

Back in the dawn of time, when men were real men, bytes were real bytes, and floating point numbers were real, um, reals, the journeyman test of every aspiring programmer was to write their own text editor. (This was way before the concept of “life” had been invented, so no-one knew they were supposed to have one.)

Nowadays, we know better, and don’t write new code to solve problems that have already been solved. Well, unless we need an XML parser – everybody (including myself, but that’s a post for another time) has written one of those – or at least a string tokeniser (aka splitter).

Other languages get tokenisers for free (C# – String.Split, Java – String.split, Python – string.split, and so on, and even C has strtok), but not C++. Which is why it’s something almost every C++ programmer writes, at some point or other.

Of course, you can use the rather nifty boost::tokenizer, if the place where you work is okay with using Boost (a surprising number of places aren’t, for various reasons), or find one of the numerous example implementations out there. Like this one, for instance:

void tokenise_string(const std::string& str, 
  const std::string& separator, 
  std::vector<std::string>& tokens, 
  bool empty /* = false */)
{
  const std::string::size_type strlength = str.length();
  const std::string::size_type seplength = separator.length();

  std::string::size_type prev = 0;
  std::string::size_type next = str.find(separator, prev);

  while (std::string::npos != next)
  {
    if (empty || prev != next)
      tokens.push_back(str.substr(prev, next - prev));
    prev = next + seplength;
    next = str.find(separator, prev);
  }
  if (empty || prev != strlength)
    tokens.push_back(str.substr(prev, strlength - prev));
}

There’s not that much to say about this. Pass in a string to split up into tokens, what separator to look for, and an output parameter which will hold the tokens when we’re done. What makes this implementation slightly different from some the others is that the separator is a std::string, and treated as such. Other implementations I’ve seen take a char (or even std::string::value_type) as a separator, or a string which is treated as a list of possible separators (like “.!?” to split a text into sentences).

I dislike the latter, as it’s ambiguos – is the separator used as a full string or as an array of characters? Rather, I’d prefer to make it explicit by overloading the function

void tokenise_string(const std::string& str, 
  const std::vector<std::string::value_type>& separators, 
  std::vector<std::string>& tokens, 
  bool empty /* = false */)
{
  const std::string::size_type strlength = str.length();
  const std::string::size_type seplength = 1;
  const std::string sep(separators.begin(), separators.end());

  std::string::size_type prev = 0;
  std::string::size_type next = str.find_first_of(sep, prev);

  while (std::string::npos != next)
  {
    if (empty || prev != next)
      tokens.push_back(str.substr(prev, next - prev));
    prev = next + seplength;
    next = str.find_first_of(sep, prev);
  }
  if (empty || prev != strlength)
    tokens.push_back(str.substr(prev, strlength - prev));
}

However, there is a problem here, in that std::string can be implicitly created from a native array of characters, and std::vector can’t:

std::vector<std::string> output;
std::string input = "What, me worry? Nah.";
char separators[] = {'.','?'};

// Will call std::string separator version, which we probably don't intend
tokenise_string(input, separators, output);

// Must set up a vector explicitly
std::vector<char> sep_array(&separators[0], &separators[2]);

// Will call std::vector separator version
tokenise_string(input, sep_array, output);

For now, that is. C++ 1x will have an initializer_list constructor which will make things interesting here.

By the way, the benefit of treating a separator string as one single separator is, of course, that it lets us parse telegrams:

std::vector<std::string> output;
std::string input = "NO TIME FOR WRENCHES STOP HAMMER TIME STOP";
std::vector<std::string> separators = "STOP";
tokenise_string(input, separators, output);
// Now output has two strings

I should probably mention, too, that the empty parameter lets us specify whether to include empty tokens in the output. In most cases, I don’t want to, but there are times it’s significant, if only to indicate whether the string started or ended with a separator.

Finally, here’s a function you see implemented and talked about a lot less often than its counterpart. If you want to split, presumably you’ll also want to merge, at some point. While it’s a very simple function, I’ve found it handy to have it available, so the merging is consistently done:

void merge_tokens(const std::vector<std::string> &tokens, 
  const std::string& separator, 
  std::string& output)
{
  if (!tokens.empty())
  {
    output = tokens.front();
    for (std::vector<std::string>::const_iterator i = 
      ++(tokens.begin()); 
      i != tokens.end(); ++i)
    {
      output += separator + *i;
    }
  }
}

Here we see the difference the empty flag makes in a call, by the way:

std::vector<std::string> split1, split2;
std::string input = "/usr/tmp";
std::string separator = "/";

tokenise_string(input, separator, split1);
tokenise_string(input, separator, split2, true);

std::string merged1, merged2;

merge_tokens(split1, separator, merged1);
merge_tokens(split2, separator, merged2);

assert(input != merged1);  // Initial / removed
assert(input == merged2);  
About these ads

1 Comment »

  1. [...] C++ source files for the string tokenisers discussed in this post and the Splitting strings post, plus the code for Removing whitespace and Static assert in C++, can be found [...]


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

%d bloggers like this: