How to delete a column/columns of a CSV file which has cell values with a string enclosed in " , "?

How can I delete a column from a CSV file which has comma separated value with a string enclosed in double quotes and a comma in between? I have a file 44.csv with 4 lines including the header like the below format:

column1, column2, column3, column 4, column5, column6
12,455,"string with quotes, and with a comma in between",4432,6787,890,88
4432,6787,"another, string with quotes, and with two comma in between",890,88,12,455
11,22,"simple string",77,777,333,22

I need to cut the 1,2,3 columns from the file, so I used the cut command as below

cut -d"," -f1,2,3 44.csv > 444.csv

I am getting the output as

column1, column2, column3
12,455,"string with quotes
4432,6787,"another string with quotes
11,22,"simple string"

But I need the output to be

column1, column2, column3
12,455,"string with quotes, and with a comma in between"
4432,6787,"another, string with quotes, and with two comma in between"
11,22,"simple string"

Any help is greatly appreciated.

Thanks

Please use code tags for posting data samples.

Here is an awk approach that might work for your data:

awk -F, 'NR==1{NF-=3}NF>1{sub(/",.*/,"\"")}1' OFS=, file

Thanks Yoda, I am new to UNIX and did not know about the CODE TAGS. I will make sure to use them on my next questions... However, the code worked, but can you please explain the code... How do I select other columns other than a given range... Lets say, I need to select columns which are not consecutive... I appreciate your help...

awk -F, '                               # Set comma as input field separator
        NR == 1 {                       # If NR (total records read) == 1
                NF -= 3                 # Subtract 3 from NF (number of fields in current record)
        }
        NF > 1 {                        # If NR (total records read) > 1
                sub (/",.*/, "\"")      # Substitute " followed by , followed by .* (0 or more occurrence of any char) with "
        }
        1                               # 1 == true (if true, default awk action is print)
' OFS=, file                            # Set comma as output field separator
2 Likes

Thanks Yoda for the details in explaining the code, can you please answer the second part of my question too...

Yoda's proposal is working fine if the quoted column is the last one you want to keep. Try this for an arbitrary last column:

awk     '       {for (i=2; i<=NF; i+=2) {                               #  every second field is one inside double quotes
                         gsub (/,/, "\001", $i)                         # replace every comma in quoted field by something
                        }
                 split ($0, TMP, ",")                                   # get the comma separated fields into an temp array
                 $0=""
                 for (i=1; i<=COLS; i++) $0=$0 ($0?",":"") TMP       # rebuild $0 from this array using only COLS columns
                 gsub ("\001", ",")                                     # replace something back to commas
                }
         1                                                              # print new $0 string
        ' FS="\"" OFS="\"" COLS=4 file                                  # parameter COLS has number of columns to keep

This is not yet what you requested in your last post, to keep any arbitrary column, but you can use it as a starting point...

EDIT: try this to keep arb. columns:

awk     'FNR==1 {CNT=split(COLS, CNo, ",")}                             # get desired columns into array
                {for (i=2; i<=NF; i+=2) {                               # every second field is one inside double quotes
                         gsub (/,/, "\001", $i)                         # replace every comma in quoted field by something
                        }
                 split ($0, TMP, ",")                                   # now get the comma separated fields into an temp array
                 $0=""
                 for (i=1; i<=CNT; i++) $0=$0 ($0?",":"") TMP[CNo]   # rebuild $0 from this array using only COLS columns
                 gsub ("\001", ",")                                     # replace something back to commas
                }
         1                                                              # print new $0 string
        ' FS="\"" OFS="\"" COLS=1,3,6 file                              # parameter COLS has number of columns to keep
column1, column3, column6
455,12,890
4432,"another, string with quotes, and with two comma in between",12
11,"simple string",333

(in line 2, I moved the quoted string to field 2)

1 Like

Thank you RudiC for explaining the code, it works perfectly for any given columns. Cheers