How to delete the commas in a .CSV file that are enclosed in a string with double quotes?

Okay, I would like to delete all the commas in a .CSV file (TEST.CSV) or at least substitute them with empty space, that are enclosed in double quote.

Please see the sample file as below:

column 1,column 2,column 3,column 4,column 5,column 6,column 7,column 8,column 9,column 10
12310,42324564756,"a simple string with a , comma","string with or, without commas",string 1,USD,12,70%,08/01/2013,
23455,12312255564,"string, with, multiple, commas","string with or, without commas",string 2,USD,433,70%,07/15/2013,
23525,74535243123,"string , with commas, and - hypens and: semicolans","string with or, without commas",string 1,CAND,744,70%,05/06/2013,
46476,15467534544,"lengthy string, with commas, multiple: colans","string with or, without commas",string 2,CAND,388,70%,09/21/2013,

I was trying the below code but, it is deleting all the commas in the file:

awk -F'"' '{gsub (/,/,"\001",$0)}1' OFS='"' TEST.CSV

Using either SED or AWK I am expecting the output as below:

column 1,column 2,column 3,column 4,column 5,column 6,column 7,column 8,column 9,column 10
12310,42324564756,"a simple string with a  comma","string with or without commas",string 1,USD,12,70%,08/01/2013,
23455,12312255564,"string with multiple commas","string with or without commas",string 2,USD,433,70%,07/15/2013,
23525,74535243123,"string with commas and - hypens and: semicolans","string with or without commas",string 1,CAND,744,70%,05/06/2013,
46476,15467534544,"lengthy string with commas multiple: colans","string with or without commas",string 2,CAND,388,70%,09/21/2013,
awk -F'"' -v OFS='"' '{for(i=2;i<NF;i+=2) gsub(",", "", $i)}1'
1 Like

@in2nix4life: The code works great, however, I will really appreciate if you could please help me understand the for condition? like, why was the i assigned 2 as the initial value and also, the variable was assigned at the beginning of the awk, and how is this used in the later part of the code?

By setting FS and OFS to a double quote character ( -F'"' -v OFS='"' ), in2nix4life told awk to use the double quote character as the field separator when lines are being read from standard input file and written to standard output. This causes the text before the 1st double quote to be treated as field 1, the text between the 1st and 2nd double quotes as the 2nd field, etc. So odd numbered fields contain data outside of the double quoted strings and even numbered fields correspond to data inside the double quoted strings.

The for loop:

for(i=2;i<NF;i+=2)
        gsub(",", "", $i)

calls the global substitution function ( gsub ) to change all occurrences of a comma ( "," ) to an empty string ( "" ) in field i ( $i ) for even numbered fields (start with 2 [ i=2 ], increment the field number by 2 at the end of processing each time through the loop [ i+=2 ] and continue processing as long as i is less than the number of fields on the current line [ i<NF ]).

2 Likes

Wrong solution sorry :o

 perl -pe 's/\"(.+?),(.+?)\"/\"$1$2\"/g'  filename

@Don Cragun: Great explanation, no one could have done it better than this... Thank you so much.

@rk4k: please share your thoughts. Do you mean this approach does not work in all the cases? Because, I find the above code given by in2nix4life works in my case... Please let us know...

@pravin27: thanks for providing a perl code too... however, I assume, it is only substituting the first occurrence of comma and replacing it with a blank space. Your code gives the output as:

column 1,column 2,column 3,column 4,column 5,column 6,column 7,column 8,column 9,column 10
12310,42324564756,"a simple string with a  comma","string with or without commas",string 1,USD,12,70%,08/01/2013,
23455,12312255564,"string with, multiple, commas","string with or without commas",string 2,USD,433,70%,07/15/2013,
23525,74535243123,"string  with commas, and - hypens and: semicolans","string with or without commas",string 1,CAND,744,70%,05/06/2013,
46476,15467534544,"lengthy string with commas, multiple: colans","string with or without commas",string 2,CAND,388,70%,09/21/2013,

Try :

Modified version of your old thread on csv Remove the values from a certain column without deleting the Column name in a .CSV file Akshay Hegde - Shell Programming and Scripting - Unix Linux Forums

awk '
      {
  	$0 = $0 ","                           
  	while($0)                             
  		{
    		  match($0, / *"[^"]*" *,|[^,]*,/)    
    		  temp = substr($0, RSTART, RLENGTH) 
    		  gsub(/^ *?|? *,$/, "", temp)  
    		  gsub(/,/," ",temp)
		  gsub(/([, ])$/,x,temp)
    		  s = s ? s "," temp : temp
    		  $0 = substr($0, RLENGTH + 1)  
  		}
		  $0 = s
		  s  = ""
      }1
    ' file

Resulting

column 1,column 2,column 3,column 4,column 5,column 6,column 7,column 8,column 9,column 10
12310,42324564756,"a simple string with a   comma","string with or  without commas",string 1,USD,12,70%,08/01/2013,
23455,12312255564,"string  with  multiple  commas","string with or  without commas",string 2,USD,433,70%,07/15/2013,
23525,74535243123,"string   with commas  and - hypens and: semicolans","string with or  without commas",string 1,CAND,744,70%,05/06/2013,
46476,15467534544,"lengthy string  with commas  multiple: colans","string with or  without commas",string 2,CAND,388,70%,09/21/2013,

You could also use the csv parse/create functions I posted for you here: Remove the double quotes from a .CSV file except for values enclosed in double quotes | Unix Linux Forums | Shell Programming and Scripting

Calling them like this:

{
  num_fields = csv_parse($0, csv, ",", "\"", "\"", "\n", 1)
  for(i=1; i<=num_fields; i++) gsub(/,/," ",csv)
  print csv_create(csv, num_fields)
}

Note it also removes the quotes, as they aren't needed as the fields don't contain commas any more:

$ awk -f remove_commas.awk infile3
column 1,column 2,column 3,column 4,column 5,column 6,column 7,column 8,column 9,column 10
12310,42324564756,a simple string with a   comma,string with or  without commas,string 1,USD,12,70%,08/01/2013,
23455,12312255564,string  with  multiple  commas,string with or  without commas,string 2,USD,433,70%,07/15/2013,
23525,74535243123,string   with commas  and - hypens and: semicolans,string with or  without commas,string 1,CAND,744,70%,05/06/2013,
46476,15467534544,lengthy string  with commas  multiple: colans,string with or  without commas,string 2,CAND,388,70%,09/21/2013,