awk to parse comma separated field and removing comma in between number and double quotes

as7951 · April 13, 2018, 3:35am

Hi Experts,

Please support
I have below data in file in comma seperated, but 4th column is containing comma in between numbers, bcz of which when i tried to parse the file the column 6th value(5049641141) is being removed from the file and value(222.82) in column 5 becoming value of column6.

Below is command used

awk -F"," -v OFS="|" '{gsub(/\,/,"",$4);gsub(/\"/,"");gsub(/^[[:space:]]+|[[:space:]]+$/,"",$4);gsub(/^[[:space:]]+|[[:space:]]+$/,"",$5);print $1, $2, $3, $4, $5, $6}' input.csv > output.csv

Input file

Bill.Doc.,Item,Exch.Rate,    Net value, Tax amount,Sales Doc.
6400392211,10,1," 1,271.19 ", 228.82 ,5049641141
6400392212,10,1, 635.59 , 114.40 ,5049641143
6400392213,10,1, 635.59 , 114.40 ,5049641145
6400392214,10,1," 1,271.19 ", 228.82 ,5049641147

output file(from command mentioned above)

Bill.Doc. |Item|Exch.Rate|Net value|Tax amount|Sales Doc.
6400392211|10|1|1|271.19|228.82
6400392212|10|1|635.59|114.40|5049641143
6400392213|10|1|635.59|114.40|5049641145
6400392214|10|1|1|271.19|228.82
6400392215|10|1|0.01|-|5049641149
6400392216|10|1|4.23|0.76|5049641151

Expected ouptut

Bill.Doc. |Item|Exch.Rate|Net value|Tax amount|Sales Doc.
6400392211|10|1|1271.19|228.82|5049641141
6400392212|10|1|635.59|114.40|5049641143
6400392213|10|1|635.59|114.40|5049641145
6400392214|10|1|1271.19|228.82|5049641147
6400392215|10|1|0.01|-|5049641149
6400392216|10|1|4.23|0.76|5049641151

RudiC · April 13, 2018, 4:27am

There are several methods to do what you need, many of them have been published in these forums.

Try

awk -F, -vOFS="|" '
match ($0, /"[^"]*"/)   {T1 = T2 = substr ($0, RSTART, RLENGTH)
                         gsub (/[ ",]/, _, T2)
                         sub (T1, T2)
                        }
                        {$1 = $1
                        }
1
' file
Bill.Doc.|Item|Exch.Rate|    Net value| Tax amount|Sales Doc.
6400392211|10|1|1271.19| 228.82 |5049641141
6400392212|10|1| 635.59 | 114.40 |5049641143
6400392213|10|1| 635.59 | 114.40 |5049641145
6400392214|10|1|1271.19| 228.82 |5049641147

RudiC · April 13, 2018, 5:40am

Or,

awk -F\" '
                {for (i=2; i<= NF; i+=2) gsub (",", _, $i)
                 gsub (" *\|*,\|* *", "|")
                }
1
' file

drl · April 14, 2018, 2:09pm

Hi.

There are a number of software collections and applications that know how to deal with CSV (sometimes TSV, and more generally DSV, Delimiter Separated Files) Many allow the delimiter to be specified in case it is not a comma or a TAB.

Here is one, csv2tsv , that simply replaces the input separator with the desired output separator, each of which can be specified to be other that the defaults, comma -> TAB. This snippet:

# Utility functions: print-as-echo, print-line-with-visual-space.
pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
pl() { pe;pe "-----" ;pe "$*"; }

pl " Input data file $FILE:"
head $FILE

pl " Expected output:"
cat $E

pl " Results:"
csv2tsv -c "," -t '|' $FILE |
sed '2,$s/[ ,]//g'

producing:

-----
 Input data file data1:
Bill.Doc.,Item,Exch.Rate,    Net value, Tax amount,Sales Doc.
6400392211,10,1," 1,271.19 ", 228.82 ,5049641141
6400392212,10,1, 635.59 , 114.40 ,5049641143
6400392213,10,1, 635.59 , 114.40 ,5049641145
6400392214,10,1," 1,271.19 ", 228.82 ,5049641147

-----
 Expected output:
Bill.Doc. |Item|Exch.Rate|Net value|Tax amount|Sales Doc.
6400392211|10|1|1271.19|228.82|5049641141
6400392212|10|1|635.59|114.40|5049641143
6400392213|10|1|635.59|114.40|5049641145
6400392214|10|1|1271.19|228.82|5049641147

-----
 Results:
Bill.Doc.|Item|Exch.Rate|    Net value| Tax amount|Sales Doc.
6400392211|10|1|1271.19|228.82|5049641141
6400392212|10|1|635.59|114.40|5049641143
6400392213|10|1|635.59|114.40|5049641145
6400392214|10|1|1271.19|228.82|5049641147

The sed just removes the extraneous commas and spaces from the numeric data. You can modify the header as desired -- I could not see how the spacing there was supposed to be handled.

Here are some details about the code:

csv2tsv Convert Comma-separated file to tab-separated file (tsv). (doc)
Path    : ~/executable/csv2tsv
Version : v1.1.14
Type    : ELF 64-bit LSB executable, x86-64, version 1 (GNU ...)
Home    : https://github.com/eBay/tsv-utils-dlang (doc)

And here are some alternate items for dealing with CSV / TSV/ DSV files:

CSV, comma separated value file (comma typical, it can be almost anything, DSV)

        0) cut -d, / awk -F, / perl -a -F,

        1) csvfix
           https://neilb.bitbucket.io/csvfix/ (checked  2017.03)

        2) csvtool
           https://github.com/Chris00/ocaml-csv (checked 2017.03)

        3) Text::CSV -- perl module, DIY

        4) crush, Google collection
           http://crush-tools.googlecode.com/files/crush-tools-2013-04.tar.gz

        5) csvkit, suite of utilities for converting to and working with CSV
           https://github.com/wireservice/csvkit

        6) agate, Python DIY
           https://github.com/wireservice/agate

        7) xlrd, extract data from Microsoft Excel (tm) spreadsheet files
           Includes runxlrd, a small utility to extract data
           Debian repository, and https://github.com/python-excel/xlrd
           Mentioned as one alternative in:
           http://www.python-excel.org/

        8) tsv, CLI tools for large, tabular data files
           filtering, statistics, sampling, joins, etc.
           https://github.com/eBay/tsv-utils-dlang

Here are a few other refrences to look over:

Comma-separated values - Wikipedia
CSV application support - Wikipedia
RFC 4180 - Common Format and MIME Type for Comma-Separated Values (CSV) Files

Best wishes ... cheers, drl