Parsing CSV Files in C

Robert Torok September 14, 2018
Parsing CSV Files in C

Introduction

In this post we'll implement a CSV file parser in C. The abbreviation CSV stands for Comma Separated Values — that is, a file format where the data fields are separated by a comma. Despite its age (defined in the early 80s) this format is still quite popular because it's simple. Really simple. It's a text-based and non-proprietary format. This format is often the common denominator when it comes to exchanging data between applications: The format that (almost) all programs understand.

There are of course many CSV parser libraries out there. In this post, however, we don't plan to compete and challenge them, not at all. The goal of this post is just to show how CSV files can be parsed with the C programming language without any third party libraries used. Why? It helps you develop your algorithmic and general C programming skills.

This tutorial is rather for beginner programmers who have written their first programs. However, some part of the article may be useful for a broader audience too.

We'll process an imaginary company's employee records file. During the journey we will:

  • Read the file as a simple text-file and print each record line
  • Start handling the file as a real CSV
  • Learn how to escape commas inside the text and properly process double quotes

What You'll Need

Just your favourite editor and a C compiler. This article assumes basic knowledge of writing and compiling C programs, however, as a quick recap you can compile the source files as follows:

$ gcc source.c -g -Wall -o out

The output file can be executed then:

$ ./out

The Sample CSV File

Suppose the imaginary company's employee records file contains the following information:

  • Name (first and last name);
  • Phone number;
  • Job title.

These fields are separated by a comma, practically forming a CSV file. The sample file is as follows:

first_name,last_name,phone,job_title
John,Doe,555-444,CTO
Jane,Doe,444-555,Director

Notice the file starts with the header, each column name is listed here.

Even though that in this tutorial we're aiming to read the CSV file, but this is the perfect opportunity to tell you to always add headers if you happen to create CSV files. The header line is often missing from CSV files, but the data itself, without headers, is not self-descriptive enough and hard to understand for the other parties.

That said, it's a good practice to add header to your CSV file.

Read the File and Display its Raw Content

Let's start jumping into some actual code. We'll open the sample CSV file and print each line:

#include <stdio.h>

int main(void) {
	FILE *fp = fopen("sample.csv", "r");
	char buf[1024];

	if (!fp) {
		printf("Can't open file\n");
		return 1;
	}

	while (fgets(buf, 1024, fp)) {
		printf("%s\n", buf);
	}

	fclose(fp);

	return 0;
}

The #include <stdio.h> line (more precisely preprocessing directive) tells the compiler that we're going to use the functions defined in the C standard library, namely printf and fopen. The latter is for printing text to the standard output (stdout), the former is, as its name suggests, for opening a file.

In the next section we define the main function. By default, the compiler always looks for the main function since this will be the entry point to the application. This function runs when our program is executed. In this particular program we don't accept parameters from the console, that's exactly what the void parameter indicates in the function signature.

Next, we open the file with the fopen function. The first parameter is the file name we want to open, the second one r indicates that we open the file in read-only mode, we don't plan to change it.

The return value is a pointer to the file. In case the file cannot be opened (for example, it doesn't exist), the fopen returns with NULL. The following if block prepares for that case; if the fp variable is set to NULL (!fp), then the program terminates with return code 1. In general, return code 1 means an error.

In the while loop each line is read and displayed on the standard output. The fgets function reads the next 1024 characters from the file and stores these characters in the buf variable. That's why we previously defined the buf variable as an array of characters with size of 1024 . Note that, if fgets encounters a new line or the file ends, it'll stop and return.

Then we close the file and terminate the application.

I'd like to mention the importance of the fclose function, more generally closing and freeing up the resources. If you happen to be a beginner developer, you may forget to call the cleanup methods like this fclose . It doesn't really have any visible effect, right? Your program runs perfectly without it too.

However, allocated resource or memory that is never freed will lead to memory leaks and might eventually result in serious problems that are extremely hard to debug. I've seen enterprise systems where memory leaks started causing problems after weeks of running.

With that said, always pay extra attention to freeing up the resources and memory and call the appropriate methods, even if you write a five-liner program. Why? Because, if you always call the cleanup methods it'll eventually become a routine and will minimize the risk that one day you'll forget it in an important production application. If you develop the habit of properly closing the resources and freeing up the memory, you'll write better programs.

Now that we've understood the code, compile and run it to see the results:

$ gcc csv_reader1.c -g -pedantic -Wall -o csv_reader1
$ ./csv_reader1
first_name,last_name,phone,job_title

John,Doe,555-444,CTO

Jane,Doe,444-555,Director

$

Great! We managed to print the raw contents of the sample file.

Read and Process the File as a CSV File

We'll start handling the file as a real CSV file. To do so, we'll use the strtok function to split up the current line by commas. Each time we call this function we get back the next token pointed by a char* variable. If there's no more token, the strtok returns NULL.

Let's first look at the code then we take a deeper look:

#include <stdio.h>
#include <string.h>

int main(void) {
	FILE *fp = fopen("sample.csv", "r");

	if (!fp) {
		printf("Can't open file\n");
		return 0;
	}

	char buf[1024];
	int row_count = 0;
	int field_count = 0;
	while (fgets(buf, 1024, fp)) {
		field_count = 0;
		row_count++;

		if (row_count == 1) {
			continue;
		}

		char *field = strtok(buf, ",");
		while (field) {
			if (field_count == 0) {
				printf("First Name:\t");
			}
			if (field_count == 1) {
				printf("Last Name:\t");
			}
			if (field_count == 2) {
					printf("Phone Number:\t");
			}
			if (field_count == 3) {
				printf("Job Title:\t");
			}

			printf("%s\n", field);
			field = strtok(NULL, ",");

			field_count++;
		}
		printf("\n");
	}

	fclose(fp);

	return 0;
}

First of all, we'll need to count the rows and columns. We count the rows because we know that the first line is the header and we don't want to process it. On the other hand, we count the columns to know what field we are processing.

Then comes the strtok function that splits up the line. Do you spot something interesting? For the first time, the first parameter of strtok is the line that we want to split up. However, when we call it again in the while loop, notice that the first parameter is NULL. This is how the strtok works by specification.

I remember when I was studying strtok I didn't understand how it would know what to parse next if I pass a NULL parameter. It turned out that strtok uses a global variable to store the unparsed part of the string. This global variable will keep its data after strtok returned and can be re-used when strtok is called again.

If you compile and run the program you'll see he following output:

$ ./out
First Name:	John
Last Name:	Doe
Phone Number:	555-444
Job Title:	CTO


First Name:	Jane
Last Name:	Doe
Phone Number:	444-555
Job Title:	Director

How to Escape Commans and Parse Double Quotes

Say the employee records database has been updated since the last time we processed it. First, a new member joined the company that has the job title Support, Level 2.

Second, an existing employee got promoted and now holds the job title VP of "Advanced Technologies" . Note the double quotes.

The new input file that represents these changes would look like this:

first_name,last_name,phone,job_title
John,"Doe",555-444,CTO
"Jane","Doe",444-555,"VP of Department of ""Advanced Technologies"""
John,Doe Jr,444-333,"Support, Level2"

Note the "Support, Level 2" is double quoted — that is, if you want to have commas inside the fields, you'll need to enclose the fields in double quotes. According to the specification, though, you can also use double quotes even if you don't have commas in the text.

So, what will happen if we try to parse the new file with the tool?

$ ./out
...
First Name:	John
Last Name:	Doe Jr
Phone Number:	444-333
Job Title:	"Support
 Level2"

Certainly, this is not what we want. First, the double quotes appear and the job title was split up.

A new approach to properly parse the CSV file could be the following:

#include <stdio.h>
#include <string.h>

void process_field(int field_count, char *value) {
	if (field_count == 0) {
		printf("First Name:\t");
	}
	if (field_count == 1) {
		printf("Last Name:\t");
	}
	if (field_count == 2) {
			printf("Phone Number:\t");
	}
	if (field_count == 3) {
		printf("Job Title:\t");
	}
	printf("%s\n", value);
}

int main(void) {
	char buf[1024];
	char token[1024];

	int row_count = 0;
	int field_count = 0;
	int in_double_quotes = 0;
	int token_pos = 0;
	int i = 0;

	FILE *fp = fopen("sample.csv", "r");

	if (!fp) {
		printf("Can't open file\n");
		return 0;
	}


	while (fgets(buf, 1024, fp)) {
		row_count++;

		if (row_count == 1) {
			continue;
		}

		field_count = 0;
		i = 0;
		do {
			token[token_pos++] = buf[i];

			if (!in_double_quotes && (buf[i] == ',' || buf[i] == '\n')) {
				token[token_pos - 1] = 0;
				token_pos = 0;
				process_field(field_count++, token);
			}

			if (buf[i] == '"' && buf[i + 1] != '"') {
				token_pos--;
				in_double_quotes = !in_double_quotes;
			}

			if (buf[i] == '"' && buf[i + 1] == '"')
				i++;

		} while (buf[++i]);

		printf("\n");
	}

	fclose(fp);

	return 0;
}

The first thing we notice is that we extracted the logic that interprets the fields into a separate function called process_field. This is a good practice, we always want to separate our business logic (process_field in this case) from the utility functions.

Second, we no longer use the strtok function. The reason is that we need more adjusted parsing logic that takes into account the double quotes. In addition, it should also decently handle (escape) the double quotes in quoted fields. With that said, we'll rather process the line character by character.

The logic in the loop could read as follows:

  • Save the current character in a temporary field
  • If we're NOT in a quoted text and we encounter the delimiter character OR end of line, then
    • Properly terminate the temporary field by token[token_pos - 1] = 0
    • Reset the token position to 0 (for the next token)
    • Pass the token to the business logic
  • If the current character is a double quote AND the next one is NOT - Implies that it is a quoted field
    • Starting and ending double quotes should not be saved in the temporary token. This will be taken care of the token_pos-- ; instruction
    • Flip the in_double_quotes flag
  • If the current character is a double quote AND the next one IS a double quote too - Quoted text is starting or ending, it should be properly escaped
    • Save only the first double quote character. To achieve this, just jump over the next character. This is done by i by increasing the loop variable
  • If there's more character to read, jump to point #1

As a recap, let's remember that strings in C should always be terminated with the 0 character. That indicates the end of the string.

Compile and run the application:

$ ./out
First Name:	John
Last Name:	Doe
Phone Number:	555-444
Job Title:	CTO

First Name:	Jane
Last Name:	Doe
Phone Number:	444-555
Job Title:	VP of Department of "Advanced Technologies"

First Name:	John
Last Name:	Doe Jr
Phone Number:	444-333
Job Title:	Support, Level2

Looks much better! We've managed to correctly parse the CSV file.

Further Read

The official RFC specification can be found at here.