Filter a .CSV file based on the 5th column values

I have a .CSV file with the below format:

"column 1","column 2","column 3","column 4","column 5","column 6","column 7","column 8","column 9","column 10
"12310","42324564756","a simple string with a , comma","string with or, without commas","string 1","USD","12","70%","08/01/2013",""
"23455","12312255564","string, with, multiple, commas","string with or, without commas","string 2","USD","433","70%","07/15/2013",""
"23525","74535243123","string , with commas, and - hypens and: semicolans","string with or, without commas","string 1","CAND","744","70%","05/06/2013",""
"46476","15467534544","lengthy string, with commas, multiple: colans","string with or, without commas","string 2","CAND","388","70%","09/21/2013",""

5th column of the file has different strings. I need to filter out the file based on the 5th column value. Lets say, I need a new file from the current file which has records only with the value "string 1" in its fifth field.

For this I tried the below command,

awk -F"," ' { if toupper($5) == "STRING 1") PRINT  }' file1.csv > file2.csv

but it was throwing me an error as following:

awk:  { if toupper($5) == "STRING 1") PRINT  }
awk:       ^ syntax error
awk:  { if toupper($5) == "STRING 1") PRINT  }
awk:                                ^ syntax error

I also tried the following but it did not help much:

awk -F"," ' { if ($5 == "string 1") print  }' file1.csv > file2.csv

I then used the following which gives me an odd output.

awk -F"," '$5="string 1" {print}' file1.csv > file2.csv

Output:

"column 1" "column 2" "column 3" "column 4" string 1 "column 6" "column 7" "column 8" "column 9" "column 10
"12310" "42324564756" "a simple string with a   comma" string 1  without commas" "string 1" "USD" "12" "70%" "08/01/2013" ""
"23455" "12312255564" "string  with string 1  commas" "string with or  without commas" "string 2" "USD" "433" "70%" "07/15/2013" ""
"23525" "74535243123" "string   with commas string 1 "string with or  without commas" "string 1" "CAND" "744" "70%" "05/06/2013" ""
"46476" "15467534544" "lengthy string  with commas string 1 "string with or  without commas" "string 2" "CAND" "388" "70%" "09/21/2013" ""

P.S: I used toupper command to be on the safe side, as I am not sure if the string will be in lower or higher case. Also, Please advise if the space in the string matters while searching for a pattern using AWK... Thanks in advance.

You have at least two problems here:

First, you are telling awk that your field separator is a comma, but some commas in your input file are not field separators.

Second, you are telling awk to look for lines where the string string 1 is the 5th field in the line; but that string is never between commas in your input (even if we only counted the commas that are meant to be field separators. Each of your intended fields contains double quotes and the strings "string 1" and string 1 are not the same. To look for a string containing quotes, you have to put escaped quotes in your match string. For example:

$5 == "\"string 1\""

The output you got from the command:

awk -F"," '$5="string 1" {print}' file1.csv > file2.csv

may have seemed strange to you, but it is exactly what I would have expected. Note the difference between $5="string 1" and $5=="string 1" . With a single equal sign, you set the 5th field to string 1 rather than testing if the 5th field was string 1 .

Try the following:

awk -F'"' 'toupper($10) == "STRING 1"' OFS='"' file1.csv

Using double quote as the field separator, $1 and $NF will be empty strings, other odd fields will be commas, and the even fields will be the data between pairs of double quotes. So $2 will be the data between the 1st pair of double quotes, $4 will be the data between the second pair of double quotes, ..., $10 will be the data between the 5th pair of double quotes, ...

1 Like

If your awk version allows for multi-char-FS, try

awk -F'","' 'toupper($5)=="STRING 1"' OFS='","' file
1 Like