How to Split Strings and Extract Tokens in C?

Robert Torok January 10, 2019
How to Split Strings and Extract Tokens in C?

Introduction

When working with strings, it is often necessary to split them by space or any other arbitrary character. Just think of processing CSV files — you'll need to use some tokenizer to extract its fields.

In C there's a function called strtok for this purpose, however, it has some drawbacks.

We will find out in this post when we can use strtok to break strings into tokens and when we should look for something else.

Additionally, for practicing purposes we'll also create our own split function. We'll discover and address some design issues during the implementation.

The strtok function

Signature

According to the manual page of strtok its signature looks like as follows:

char *strtok(char *str, const char *delim);

It's part of the string.h header file.

The first parameter str points to the string we want to tokenize by the set of characters pointed by delim. That is, multiple characters can be given as separator. For example, if delim is pointing to " ,;" then the string will be fragmented when whitespace, comma or semicolon character is hit.

This function is not that simple though: The documentation also says that the pointer to the source string should be passed in the first call. The return value will point to the first token. The subsequent calls will then need a NULL value in the first parameter and it will return the next token. This should be repeated until there's no more token to read. If that happens, the function returns NULL.

Next token and subsequent calls, huh? If you have Java or C# background, you might have expected that this function simply returns an array of the tokens so that you can reference them by typing result[0] and so on. However, that's not the case, it behaves really differently.

If it doesn't return an array then how it works? It can be best looked at through an example.

Example

Let's assume you want to parse a string like One Two Three. In order to receive the individual tokens separated by a whitespace character you have to call strtok three times:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(void)
{
    int i = 0;
    char str[1024] = "One Two Three";
    char *p = strtok(str, " ");

    while (p) {
        printf("%i. token = %s\n", i, p);
        p = strtok(NULL, " ");
        i++;
    }
}

Compile and run the code above. You'll see the following:

~ $ gcc split4.c -Wall -o split && ./split 0. token = One
  1. . token = Two2. token = Three ~ $ If you read through the source code, you can interpret it as follows:
  2. The source string is stored in an array of chars (that's important, in the next chapter we'll see why)
  3. The first call passes the pointer to the source string and returns a pointer to the first token (One)
  4. If there's (more) token to read, then
    • Print the token
    • Call the function again, but this time passing NULL as the first parameter. It'll instruct the strtok to continue parsing the previous source string (that is, the string that was passed at the last invocation)
  5. Caveats

    Considering the signature of this function, it would be tempting to write something like that:

    char *str = "One Two Tree";
    char *ret = strtok(str, " ");
    printf("First token: %s\n", ret);
    

    As the first token, you expect the word One. But let's actually see what happens if you try to run this program:

    The result is segmentation fault

    The program crashes! Why? The reason is surprising but rather simple: strtok actually tries to modify the input array pointed by the first parameter. Usually, a string literal created like above lives in the read-only section of the memory. That is, when the strtok tries to modify that data, the program crashes.

    Keep that in mind and be very careful with that behavior because even when applying the -Wall flag (show all warnings), we did not get any warning message from the compiler!

    The first lesson here is that you should always call strtok with a parameter that points to a char array that can be modified.

    And that's not all. We run into another problem when it comes to parsing different strings at the same time.

    Let's consider the following example: We want to print the first names of employees and their corresponding job titles from a file. A line in the file might look like as follows:

    John Doe|Chief Executive Officer
    

    We expect the following output:

    John: Chief Executive Officer
    

    If we take into account these requirements and our current knowledge of the strtok function, we might wind up writing this:

    void extract_name(char *full_name, char *first_name)
    {
        char *token = strtok(full_name, " ");
        strcpy(first_name, token);
    }
    
    /* ... */
    
    p_line_token = strtok(line, "|");
    while (p_line_token)
    {
        if (cnt_token == 0)
        {
            extract_name(p_line_token, name);
        }
        else if (cnt_token == 1)
        {
            printf("%s: %s\n", name, p_line_token);
        }
    
        p_line_token = strtok(NULL, "|");
        cnt_token++;
    }
    

    Let's compile and see if it works!

    ~/work $ gcc split5.c -Wall -o split5 && ./split5
    John: Doe
    ~/work $
    

    As the output tells, something went wrong, this is not what we wanted to see. We got back the last name instead of the proper job title.

    Where's the problem? If we look into the code, we see that in the loop the extract_name function is called which also invokes the strtok function! Calling strtok again with the full_name parameter resets its internal state. This means, when the subsequent call happens in the main loop with NULL parameter to parse further the job title, it'll rather parse the string last pointed by the first parameter. Which is, in this case, the full_name and not the p_line_token.

    The second lesson is that strtok has an internal state and thus cannot parse multiple strings at the same time.

    In order to address the second problem, though, there's a variant called strtok_r which receives a third parameter to store the internal state.

    Summing up

    The best thing you can do is you don't use strtok at all. If possible, use its improved version, the strtok_r.

    Nevertheless, if you must use strtok (e.g. working on legacy systems where the newer version is not available), you should be very familiar with the behavior of strtok:

    • It modifies its input
      • The input string should be writable. That is, using string literals will likely lead to segmentation fault.
      • If you intend to reuse the input, you should copy it before using strtok
    • It has internal (global) state
      • Trying to use it in multiple contexts (i.e. different inputs) at the same time will produce (hard to catch!) errors.

    Custom Split Function

    Now that we see the limitations of strtok, let's write something that's more convenient to use and is similar to Java's split function.

    Suppose we want to get back an array containing the tokens by a simple call. Of course, we also intend to avoid side-effects, i.e. we do not modify the source string.

    The signature of our split function will be the following:

    char **split(char *src, char *sep, int *cnt_token)
    

    In general, the return value is just a two-dimensional array of strings. The parameter cnt_token is used to communicate to the caller how many elements the array has so that it can iterate it through.

    Here's an example of the intended usage of our brand-new split function:

    int i = 0;
    int cnt_token = 0;
    char **arr = split("Hello World", " ", &cnt_token);
    for (i = 0; i < cnt_token; i++)
    {
        printf("%d. %s\n", i, arr[i]);
    }
    

    Result:

    Writing Substring From Scratch

    Storing the Tokens

    Alternative Solution for Performance

    Benchmarking The Solutions

    Benchmarking the solutions

    It surprised me that even the non-optimized version with the memory allocation statements ran only in 553 ms.

    We won more than 100 ms by using the custom memory pool implementation.

    The Java version was also pretty fast though. The JVM is lightning fast compared to the old days.

    And then the variants complied with -O3 flag are taking the first two places in our comparison report. It'd be worth looking into assembly code the compiler generates because the first version is basically more than 4 times faster compared to its non-optimized counterpart.

    Conclusion

    We've seen how we can extract tokens in C by using the standard library's strtok function. We've also become familiar with the behavior of this function and learned that we have to be very cautious when using it. The best is to simply avoid it and use its refined strtok_r version.

    In order to practice, we have also created our own string tokenizer with different implementations.

    The source code of the article is available here.