AWK removing away needed double quotes.

pinnacle · May 7, 2009, 11:13am

The below code is to convert csv file to pipe delimited.
It replaces comma with pipe if it is not in double quotes; If comma is in double quotes it doesnot replace the comma with a pipe.
The code works fine except it eat away the double quotes in the output file.

BEGIN { FS=SUBSEP; OFS="|" }
{
  result = setcsv($0, ",")
  print
}
# setcsv(str, sep) - parse CSV (MS specification) input
# str, the string to be parsed. (Most likely $0.)
# sep, the separator between the values.
#
# After a call to setcsv the parsed fields are found in $1 to $NF.
# setcsv returns 1 on sucess and 0 on failure.
#
# By Peter Strvmberg aka PEZ.
# Based on setcsv by Adrian Davis. Modified to handle a separator
# of choice and embedded newlines. The basic approach is to take the
# burden off of the regular expression matching by replacing ambigious
# characters with characters unlikely to be found in the input. For
# this the characters "\035".
#
# Note 1. Prior to calling setcsv you must set FS to a character which
#         can never be found in the input. (Consider SUBSEP.)
# Note 2. If setcsv can't find the closing double quote for the string
#         in str it will consume the next line of input by calling
#         getline and call itself until it finds the closing double
#         qoute or no more input is available (considered a failiure).
# Note 3. Only the "" representation of a literal quote is supported.
# Note 4. setcsv will probably missbehave if sep used as a regular
#         expression can match anything else than a call to index()
#         would match.
#
function setcsv(str, sep, i) {
  gsub(/""/, "\035", str)
  gsub(sep, FS, str)
  while (match(str, /"[^"]*"/)) {
    middle = substr(str, RSTART+1, RLENGTH-2)
    gsub(FS, sep, middle)
    str = sprintf("%.*s%s%s", RSTART-1, str, middle,
      substr(str, RSTART+RLENGTH))
  }
  if (index(str, "\"")) {
    return ((getline) > 0) ? setcsv(str (RT != "" ? RT : RS) $0, sep) : !setcsv(str "\"", sep)
  } else {
    gsub(/\035/, "\"", str)
    $0 = str
    for (i = 1; i <= NF; i++)
      if (match($i, /^"+$/))
        $i = substr($i, 2)
    $1 = $1 ""
    return 1
  }
}

Appreciate help.

devtakh · May 7, 2009, 11:51am

Try this:

awk 'BEGIN{FS1=",";FS2="\""}$0 !~ /"/{FS=FS1;gsub(/,/,"|");print}
$0 ~ /"/{FS=FS2;gsub(/,"/,"|\"");gsub(/",/,"\"|");print}' file

cheers,
Devaraj Takhellambam

pinnacle · May 7, 2009, 1:44pm

The code works only for the file i gave earlier.
If the file has this kind of data.

it produces output

Required output

The code i pasted earlier works fine but is eating away the double quotes and is versatile for any file

Thanks

devtakh · May 7, 2009, 1:50pm

Ok. I got it.

pinnacle · May 7, 2009, 3:17pm

I got this on unix.com

nawk -F"\"" '{for(i=1;i<=NF;i++){if(i%2)gsub(",","|",$i)}}1' OFS= file1

I modified this to

nawk -F"\"" 'BEGIN{OFS="\""}{for(i=1;i<=NF;i++){if(i%2)gsub(",","|",$i)}}1'  file

this works fine.
Would appreciate someone can interpret this.

{if(i%2)gsub(",","|",$i)

Thanks

devtakh · May 7, 2009, 11:40pm

the fiels seperator is made as "
you will loop through each fields in the record.
if the field no is divisible by 2, then dont make the substitute for , with a |
print the record.

Lets take an example:
camel,Unix,12345,12345,12345,"This is a awk tesing. It allows for tesing awk, sed,cat and sort.","name,age,thing",time is,"one,two""
In the above, the BLUE color text are odd no fields, whereas the red ones are the even fields. so the if (i%2) will substitute the , to a | only in the odd no fields.

Hope this helps.

cheers,
Devaraj Takhellambam

ghostdog74 · May 7, 2009, 11:56pm

if you have Python, here's an alternative

import csv
writer = csv.writer(open("output.csv","w"), delimiter="|")
filename = "file"
reader = csv.reader(open(filename))
for row in reader:    
    writer.writerow(row)

output:

# more file
filed1,field2,field3
ABC,"hello,world",time
ABC,"hello,america",time1
XYZ,"god,bless",time2
INJ,are,youthere
# ./test.py
# more output.csv
filed1|field2|field3
ABC|hello,world|time
ABC|hello,america|time1
XYZ|god,bless|time2
INJ|are|youthere