Removing whitespace

Code, CodeProject

Here’s a std::string, please remove all whitespace from it. How would you do it? Despite its seeming simplicity, it’s an interesting question, because it can be done in so many ways.

To start with, how do you identify whitespace? Let’s have a look at some different approaches (all of which I’ve seen in the wild):

// Simple
bool iswhitespace1(char c)
{
  // Is it  space   or    tab      or    return   or    newline?
  return (c == ' ') || (c == '\t') || (c == '\r') || (c == '\n');
}
// Cute attempt at cleverness
bool iswhitespace2(char c)
{
  // Is it one of the whitespace characters?
  static const std::string spaces(" \t\r\n");
  return (std::string::npos != spaces.find(c));
}
// Probably ok, for English at least
bool iswhitespace3(char c)
{
  // Using C function, from <cctype>
  return ::isspace(c);
}
// As above, but standard C++ instead of standard C
bool iswhitespace4(char c)
{
  // Using current locale, and std function from <locale>
  static const std::locale loc;
  return std::isspace(c, loc);
}

If we were to run through these four functions with values of c from 0 to 255, the first two would produce the same result, and the latter two would (probably) produce the same result, but those wouldn’t be the same as for the first two.

There are two reasons for this. First of all, the C and C++ isspace functions include a couple of often forgotten whitespace characters – the vertical tab ('\v', 0x0b) and the form feed ('\f', 0x0c). They don’t tend to see that much use nowadays, but are still defined as whitespace in both the C and C++ standards.

The second reason the results from isspace may differ from a hard-coded solution is that they are both dependent on what locale is in use. A changed locale will never indicate that any of the standard list of whitespace characters (" \t\r\n\v\f") is not a whitespace character, but may indicate that some further characters are also whitespace.

Since the functions already exist in the standard, it’s rather silly of us to write our own, so let’s just use isspace. Unless you muck about and change locales (and let’s not, if we can avoid it), both the C and C++ version behave the same way, so which you use is up to you.

Knowing how to identify whitespace characters, we only need to remove them. How do we do that? Well, that depends on whether we want to modify the string, or create a copy. In either case, let’s avoid the simplistic, completely hand-made solutions again:

// Working on std::string str

// Altering original
std::string::size_type p = 0;
while (p < str.size())
{
  // If character at p is space erase it, otherwise go to next
  if (isspace(str[p]))
    str.erase(p, 1);
  else
    ++p;
}

...

// Making a copy
std::string output;
for (std::string::size_type i = 0; i < str.size(); ++i)
{
  if (!isspace(str[i]))
    output += str[i];
}

Both these solutions work, but there are well established and standardised ways of doing these things using algorithms:

// Working on std::string str

// Altering original
str.erase(std::remove_if(str.begin(), str.end(), 
  &::isspace), str.end());

...

// Making a copy
std::string output;
std::remove_copy_if(str.begin(), str.end(), 
  std::back_inserter(output), &::isspace);

Simple!

No? Ok, let’s break it up. The functions in the C++ <algorithm> header generally work on three types of parameters: iterators, predicates and function objects (aka functors). In the code above, we’re not using any functors, so we’ll put them aside for the moment.

&::isspacepredicate. This is simply a pointer to a function that takes one parameter and returns a bool, in this case indicating whether a given character is whitespace or not, as discussed earlier.

str.begin(), str.end()iterators, in this case indicating where to start and stop running the algorithm. We want to go through the whole string, so we start at the beginning, and end at the, well, end.

str.erase(std::remove_if(...), str.end()); – this is the erase-remove idiom. Because the remove_if function only takes iterators, it can’t actually remove anything. What it can do is re-shuffle, and put all the elements (or characters in the string, in this case) that match the predicate (is whitespace) at the end of the given range. It then returns an iterator that gives the first position of these predicate-fulfilling characters. This iterator is then given to the erase member function of the string, as the start of the characters to erase, and str.end() as the end.

std::back_inserteriterator. This is a handy little helper that gives an output iterator for the given container (i.e. an iterator that can be used to insert elements in a containiner). (Unfortunately, Microsoft’s documentation still says the container given to it must be a std::vector, std::list or std::deque, which is not true. The only thing required is that the container has the member function push_back, which std::string does. Given how popular their development tools are, it’s surprising this hasn’t been amended.)

std::remove_copy_if – this is an amazimgly poorly named function, which ought to be called std::copy_if_not. What it does is: go through the range given (i.e. begin to end), call the predicate (i.e. isspace) with each element in the range, and if the predicate returns true, don’t copy it. It doesn’t remove anything from the input range (it can’t, as it only has iterators), and in fact doesn’t change anything at all on the range it’s given. I guess that conceptually, it removes an element for which the predicate is true from a list of elements to copy. Except, there is no such list. In short: horrible name, copies elements not fulfilling the predicate.

So, there we are. Two simple and useful functions to remove whitespace:

void remove_whitespace(std::string& str)
{
  str.erase(std::remove_if(str.begin(), str.end(), 
    &::isspace), str.end());
}

void remove_whitespace(const std::string& input, std::string& output)
{
  output.clear();
  std::remove_copy_if(input.begin(), input.end(), 
    std::back_inserter(output), &::isspace);
}

(Of course, if you really want to use std::isspace with std::locale, things start to get a bit… well, complicated. I might return to that at some later point.)

Advertisements

One thought on “Removing whitespace

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s