How do I format a Date field of a .CSV file with multiple commas in a string field?

dhruuv369 · October 15, 2013, 11:53pm

I have a .CSV file (file.csv) whose data are all enclosed in double quotes. Sample format of the file is as below:

column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","11-OCT-11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","12-OCT-11","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","13-OCT-11","232"

The 9th field is the date field in the format "DD-MMM-YY". I have to convert it to the format YYYY/MM/DD. I am trying to use the below code, but of no use.

awk -F, '
 BEGIN {
 split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
 for (i=1; i<=12; i++) mdigit[month]=i
 }
 { m=substr($9,4,3)
 $9 = sprintf("%02d/%02d/"20"%02d",mdigit[m],substr($9,1,2),substr($9,8,20))
 print
 }' OFS="," file.csv > temp_file.csv

The out put of the file temp_file.csv after executing the above code is as shown below.

column1,column2,column3,column4,column5,column6,column7,Column8,00/00/2000,Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1,00/00/2000,"890","88","11-OCT-11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455",00/00/2002, name","12","455","12-OCT-11","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3,00/00/2000,"333","22","13-OCT-11","232"

As far as I understand, the issue is with the commas in the double quote as my code is taking them into consideration too... Please suggest on the below questions:

1) Does the double quoting all the values in all the fields make any difference? If they make any difference, how do I get rid of them from all the values except the strings with commas in them?
2) Any modifications to my code so I could format the 9th field which in the format "DD-MMM-YYYY" to YYYY/MM/DD

CarloM · October 16, 2013, 9:49am

From here and your date code.

$ cat datefile.csv
column1,column2,column3,column4,column5,column6, column7, Column8, Column9, Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","11-OCT-11","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","12-OCT-11","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","13-OCT-11","232"

$ awk -f csv2.awk datefile.csv
column1,column2,column3,column4,column5,column6, column7, Column8,"00/00/2000", Column10
"12","B000QRIGJ4","4432","string with quotes, and with a comma, and colon: in between","4432","author1, name","890","88","10/11/2011","12"
"4432","B000QRIGJ4","890","another, string with quotes, and with more than, two commas: in between","455","author2, name","12","455","10/12/2011","55"
"11","B000QRIGJ4","77","string with, commas and (paranthesis) and : colans, in between","12","author3, name","333","22","10/13/2011","232"

$ cat csv2.awk
BEGIN {
        if (NUMCOLS == "") NUMCOLS=32
        if (DELIM == "") DELIM = "\t"
        if (REPL == "") REPL = "~"
         split("JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC", month, " ")
        for (i=1; i<=12; i++) mdigit[month]=i
}
{
        gsub(DELIM, REPL)
        $0 = gensub(/([^,])\"\"/, "\\1'", "g")
        out = ""
        n = length($0)
        for (i = 1;  i <= n;  i++) {
                if ((ch = substr($0, i, 1)) == "\"") {
                        inString = (inString) ? 0 : 1
                }
                out = out ((ch == "," && ! inString) ? DELIM : ch)
        }
        nfields=split(out,outfields,DELIM);

         # Do stuff with date
        m=substr(outfields[9],5,3)
        outfields[9] = sprintf("\"%02d/%02d/"20"%02d\"",mdigit[m],substr(outfields[9],2,2),substr(outfields[9],9,20))

        for (i=1;(i<=nfields)&&(i<=NUMCOLS);i++) {
                if (i > 1) {
                        printf (",");
                }
                printf ("%s", outfields);
        }
        printf ("\n");
}

(this CSV processing isn't mine, but I forget where I got it from)

blackrageous · October 16, 2013, 1:27pm

If input is in file x.x, then....

cat x.x | sed -e 's/,"/\|/g' | cut -f9 -d\| | tr -d \" | awk -F\- 'BEGIN{MON["JAN"] = "01"; MON["FEB"] = "02"; MON["MAR"] = "03"; MON["APR"] = "04"; MON["MAY"] = "05"; MON["JUN"] = "06"; MON["JUL"] = "07"; MON["AUG"] = "08"; MON["SEP"] = "09"; MON["OCT"] = "10"; MON["NOV"] = "11"; MON["DEC"] = "12"} { printf("20%s/%s/%s\n",$3,MON[$2],$1) }'

or create file x.awk with

BEGIN{MON["JAN"] = "01"; MON["FEB"] = "02"; MON["MAR"] = "03"; MON["APR"] = "04"; MON["MAY"] = "05"; MON["JUN"] = "06"; MON["JUL"] = "07"; MON["AUG"] = "08"; MON["SEP"] = "09"; MON[
"OCT"] = "10"; MON["NOV"] = "11"; MON["DEC"] = "12"} 
{ printf("20%s/%s/%s\n",$3,MON[$2],$1) }

and execute like

cat x.x | sed -e 's/,"/\|/g' | cut -f9 -d\| | tr -d \" | awk -F\- -fx.awk

---------- Post updated at 12:27 PM ---------- Previous update was at 11:47 AM ----------

Got the order of args in printf wrong, this should work...

cat x.x | sed -e 's/,"/\|/g' | cut -f9 -d\| | tr -d \" | awk -F\-  'BEGIN{MON["JAN"] = "01"; MON["FEB"] = "02"; MON["MAR"] = "03";  MON["APR"] = "04"; MON["MAY"] = "05"; MON["JUN"] = "06"; MON["JUL"] =  "07"; MON["AUG"] = "08"; MON["SEP"] = "09"; MON["OCT"] = "10";  MON["NOV"] = "11"; MON["DEC"] = "12"} {  printf("20%s/%s/%s\n",$1,MON[$2],$3) }'

dhruuv369 · October 16, 2013, 4:50pm

@CarloM, @Blackrageous Thank you for the help...