Using sed to remove a column from a CSV

lojkyelo · August 8, 2014, 10:40am

I found that the following works to remove the first column in my file when my CSV is delimited with a simple comma:

sed -i 's/[^,]*,//' file.csv

However, I have a new file where the fields are encapsulated with double quotes - general example of file:

"Internal ID", "External ID", "Name"
"123", "ABC", "ABC Incorporated"

Desired outcome:

"External ID", "Name"
"ABC", "ABC Incorporated"

Can the above sed be modified to handle the "," delimiter? If so, how? Or are there better alternatives?

Any help be appreciated.

Corona688 · August 8, 2014, 11:54am

sed isn't really the right tool for dealing with columns. awk would be the correct tool, it understands columns as columns without weird regex convolutions.

As long as your CSV is actually comma separated -- uses , to separate columns and nowhere else -- this may work:

awk -F"," -v OFS="," '{ $1="" ; $0=substr($0,2) } 1' inputfile > outputfile

If your CSV isn't actually a CSV, a recursive parser that understands quotes is required and things start getting hard.

lojkyelo · August 8, 2014, 12:08pm

I can trust that the first column will never contain a comma within the "ABC" Internal ID field but not in the following fields, Name is likely suspect.

I tried you suggested command out, looks to have worked preserving the remainder of my record.

Thanks,

Franklin52 · August 8, 2014, 1:22pm

Another approach:

awk -F", " 'sub($1 FS,x)' file

bakunin · August 8, 2014, 1:48pm

I beg to disagree. Actually sed is a very good tool for manipulating any text data, regardless of being in tables or not. If one prefers awk or sed is more a matter of taste, because both languages are Turing-complete.

lojkyelo:

I found that the following works to remove the first column in my file when my CSV is delimited with a simple comma:
sed -i 's/[^,]*,//' file.csv
However, I have a new file where the fields are encapsulated with double quotes - general example of file:
"Internal ID", "External ID", "Name"
"123", "ABC", "ABC Incorporated"
Desired outcome:
"External ID", "Name"
"ABC", "ABC Incorporated"

Let us first examine what your regexp does:

s/[^,]*,//

This searches for a string of non-commas in the length of zero (=empty field) or more ("[^,]*", so in fact this makes this string optional), followed by a comma. The resulting string will be deleted (replaced by a null-string).

Let us have a look at the composition of your fields: any single field is "a sequence of zero or more non-, followed by a comma. This is true regardless of the field being enclosed in double quotes or not. So it seems that you do not have to change your regexp at all. A double-quote is just a character like any other.

What might happen is that commata enclosed in quotes should not be treated as field-separators, like this:

"abc,def","ghi,jkl,mno","..."

As your script is now (and, btw., the awk script too) this would be interpreted as the first field ending after "c", the second field after "f", etc.. But probably such a line should be interpreted as 3 fields, ending after "f", "o" and end-of-line. To accomodate for this you need to enhance your definition of what a "field" is a little: a field is a sequence of zero or more strings enclosed in double-quotes mixed with zero or more non-commata, followed by a comma. The following regexp is based on this definition:

s/\(\("[^"]*"\)*[^,]*\)*,//

Let us peel this apart. Basically it is quite easy, but the nesting level makes it somewhat difficult to understand:

"[^"]*"

This is a quoted string: a double quote, followed by zero or more non-double quotes, followed by another double quote.

\("[^"]*"\)*[^,]*

The double-quoted string grouped in brackets, so that the following "*" means zero or more occurences of this expression. This is followed by zero or more non-commas, which would be characters outside the double-quoted strings, so that the expression even allows for mixed quoted and non-quoted field contents.

\(\("[^"]*"\)*[^,]*\)*,

This whole expression is again grouped, made optional with the "*" (so that we allow for empty fields) and followed by a final ",", which is the record separator.

You could use this as a blueprint to manipulate other fields easily, not just the first. Suppose you would want to change the fourth field to "@@@". Use another grouping to bring one whole field together, skip the first three occurences of a field and work on the fourth. To preserve the first three fields contents we have to use another grouping:

s/\(\(\(\("[^"]*"\)*[^,]*\)*,\)\{3\}\)\(\("[^"]*"\)*[^,]*\)*/\1@@@/

I hope this helps.

bakunin

Franklin52 · August 8, 2014, 2:33pm

This should work too:

sed 's/[^,]*, \(.*\)/\1/' file

Don_Cragun · August 8, 2014, 2:56pm

Which is more simply written as:

sed 's/[^,]*, //' file

Which is where we started, except the code given in the 1st message in this thread was intended to handle cases where the field separator was a comma; while the input file we're processing here has a comma followed by a space sas the field separator.

Franklin52 · August 8, 2014, 3:02pm

Indeed

bakunin · August 8, 2014, 4:07pm

Correct. I thought i made it clear in my posting that the original regexp already covered the double-quoted fields too. I wanted to explain how to expand this to the - more general - case where double-quoted field separator chars would not count as field separators. For this (and this alone) to achieve the more complex regexp is necessary.

bakunin