How to delete a column/columns of a CSV file which has cell values with a string enclosed in " , "?

dhruuv369 · June 20, 2013, 7:36pm

How can I delete a column from a CSV file which has comma separated value with a string enclosed in double quotes and a comma in between? I have a file 44.csv with 4 lines including the header like the below format:

column1, column2, column3, column 4, column5, column6
12,455,"string with quotes, and with a comma in between",4432,6787,890,88
4432,6787,"another, string with quotes, and with two comma in between",890,88,12,455
11,22,"simple string",77,777,333,22

I need to cut the 1,2,3 columns from the file, so I used the cut command as below

cut -d"," -f1,2,3 44.csv > 444.csv

I am getting the output as

column1, column2, column3
12,455,"string with quotes
4432,6787,"another string with quotes
11,22,"simple string"

But I need the output to be

column1, column2, column3
12,455,"string with quotes, and with a comma in between"
4432,6787,"another, string with quotes, and with two comma in between"
11,22,"simple string"

Any help is greatly appreciated.

Thanks

Yoda · June 20, 2013, 7:54pm

Please use code tags for posting data samples.

Here is an awk approach that might work for your data:

awk -F, 'NR==1{NF-=3}NF>1{sub(/",.*/,"\"")}1' OFS=, file

dhruuv369 · June 24, 2013, 9:20am

Thanks Yoda, I am new to UNIX and did not know about the CODE TAGS. I will make sure to use them on my next questions... However, the code worked, but can you please explain the code... How do I select other columns other than a given range... Lets say, I need to select columns which are not consecutive... I appreciate your help...

Yoda · June 24, 2013, 12:11pm

awk -F, '                               # Set comma as input field separator
        NR == 1 {                       # If NR (total records read) == 1
                NF -= 3                 # Subtract 3 from NF (number of fields in current record)
        }
        NF > 1 {                        # If NR (total records read) > 1
                sub (/",.*/, "\"")      # Substitute " followed by , followed by .* (0 or more occurrence of any char) with "
        }
        1                               # 1 == true (if true, default awk action is print)
' OFS=, file                            # Set comma as output field separator

dhruuv369 · June 24, 2013, 12:50pm

Thanks Yoda for the details in explaining the code, can you please answer the second part of my question too...

RudiC · June 24, 2013, 1:57pm

Yoda's proposal is working fine if the quoted column is the last one you want to keep. Try this for an arbitrary last column:

awk     '       {for (i=2; i<=NF; i+=2) {                               #  every second field is one inside double quotes
                         gsub (/,/, "\001", $i)                         # replace every comma in quoted field by something
                        }
                 split ($0, TMP, ",")                                   # get the comma separated fields into an temp array
                 $0=""
                 for (i=1; i<=COLS; i++) $0=$0 ($0?",":"") TMP       # rebuild $0 from this array using only COLS columns
                 gsub ("\001", ",")                                     # replace something back to commas
                }
         1                                                              # print new $0 string
        ' FS="\"" OFS="\"" COLS=4 file                                  # parameter COLS has number of columns to keep

This is not yet what you requested in your last post, to keep any arbitrary column, but you can use it as a starting point...

EDIT: try this to keep arb. columns:

awk     'FNR==1 {CNT=split(COLS, CNo, ",")}                             # get desired columns into array
                {for (i=2; i<=NF; i+=2) {                               # every second field is one inside double quotes
                         gsub (/,/, "\001", $i)                         # replace every comma in quoted field by something
                        }
                 split ($0, TMP, ",")                                   # now get the comma separated fields into an temp array
                 $0=""
                 for (i=1; i<=CNT; i++) $0=$0 ($0?",":"") TMP[CNo]   # rebuild $0 from this array using only COLS columns
                 gsub ("\001", ",")                                     # replace something back to commas
                }
         1                                                              # print new $0 string
        ' FS="\"" OFS="\"" COLS=1,3,6 file                              # parameter COLS has number of columns to keep
column1, column3, column6
455,12,890
4432,"another, string with quotes, and with two comma in between",12
11,"simple string",333

(in line 2, I moved the quoted string to field 2)

dhruuv369 · June 25, 2013, 10:36am

Thank you RudiC for explaining the code, it works perfectly for any given columns. Cheers