Help with cut or awk command

i am trying to cut the 4th field from the file.
sample file

a1,a2,"a,3",a4,a5,a6
b1,"b,2",b3,b4,b5,b6
c1,"c,2","c,3",c4,c5,c6

i need the output in the format of

a1,a2,"a,3",a5,a6
b1,"b,2",b3,b5,b6
c1,"c,2","c,3",c5,c6

i tried using cut -d, -f1-4,6-
but i am getting the output as

a1,a2,"a,3",a5,a6
b1,"b,2",b3,b5,b6
c1,"c,2","c,c4,c5,c6

which is wrong. any other method to do this correctly?

For your input file, try:-

awk -F, '{$(NF-2)=$(NF-1) FS $NF;NF-=2}1' OFS=, file
1 Like

can you please explain how the command awk works?

Tricky part is that commas may be present within the field (that's why the quotes)
Yoda's solution will not work for files that have more than 6 columns, or files that have commas in the last two fields. E.g. file

a1,a2,"a,3",a4,a5,a6
b1,"b,2",b3,b4,b5,"b6,b999",b10
c1,"c,2","c,3",c4,c5,c6

will become

$ awk -F, '{$(NF-2)=$(NF-1) FS $NF;NF-=2}1' OFS=, file
a1,a2,"a,3",a5,a6
b1,"b,2",b3,b4,b5,b999",b10
c1,"c,2","c,3",c5,c6

Probably not what you want.

I would suggest the following approach:

  1. protect the comma within quotes by substituting it with some character
  2. get rid of the 4th field
  3. substitute the protect character with comma to get back the original

Try this solution:

sed  's/"\([^,]*\),\([^,]*\)"/"\1\x1C\2"/g' file  | awk -F, '{$4 = ""}1' OFS=, | sed 's/,,/,/g ; s/\x1C/,/g' 

Explanation:

sed  's/"\([^,]*\),\([^,]*\)"/"\1\x1C\2"/g' file  # replace commas within quotes with \x1C (non-ascii character unlikely to be in your file)
awk -F, '{$4 = ""}1' OFS=,                        # strip the fourth field -- now it is safe to split them by commas
sed 's/,,/,/g ; s/\x1C/,/g'                       # get rid of double commas and replace \x1C back to comma for quoted commas

It cuts fields from right to left.
Yet hard-coded to the 3rd field from right.
And not all awk versions empty the fields by modifying NF.
I have put a more general awk solution to your previous post that can be easily adapted.

Below code will work to remove any field > 2. You need to provide the number of fields and number of the field to be removed

awk 'BEGIN{FS = OFS = "\""}
  {for(i=1; i<=NF; i+=2) {
    gsub(/,/, "\n", $i)}}1' file | awk '{
 if(NR == RM_FLD) {next};
 if(NR == FLD_CNT)
   {t = $0; getline;
   printf "%s\n%s", t, $0;
   NR = 1; f = 1; next}
 if(NR == 2 && f == 1)
   {printf "%s%s%s", ORS, $0, ORS;
   next}}1' ORS=',' RM_FLD=4 FLD_CNT=6

need to replace comma only inside the " "(double qoutes)
sample file.

a1,a2,"a,3,12",a4,a5,a6,a7,"a,8","","",a9,a10,a11,"","",a12
b1,"b,2",b3,b4,b5,b6,"b,7",b8,"","",b9,"",b11,"","",b11
c1,"c,2","c,3",c4,c5,c6,c7,c8,"","","c,9",c10,c11,"","",c12

needed output

a1,a2,"a312",a4,a5,a6,a7,"a8","","",a9,a10,a11,"","",a12
b1,"b2",b3,b4,b5,b6,"b7",b8,"","",b9,"",b11,"","",b11
c1,"c2","c3",c4,c5,c6,c7,c8,"","","c9",c10,c11,"","",c12

i am using the command sed 's/"\([^,]*\),\([^,]*\)"/"\1\x1C\2"/g'
getting the output as

a1,a2,"a312",a4,a5,a6,a7,"a8","""",a9,a10,a11,"""",a12
b1,"b2",b3,b4,b5,b6,"b7",b8,"""",b9,"",b11,"""",b11
c1,"c2","c3",c4,c5,c6,c7,c8,"""","c9",c10,c11,"""",c12

To only replace commas inside double quotes, try:

awk '!(NR%2){gsub(/,/,"\x1C")}1' RS=\" ORS=\" file

Although I do not understand why "","" should be converted to """" (it would not work for that...)? Can you elaborate?

1 Like
a1,a2,"a,3,we",a4,a5,a6,a7,"a,8","","",a9,a10,a11,"","",a12
b1,"b,2",b3,b4,b5,b6,"b,7",b8,"","",b9,"",b11,"","",b11
c1,"c,2","c,3",c4,c5,c6,c7,c8,"","","c,9",c10,c11,"","",c12

i want the to cut the 8th field from the file.
so i have used the below command to replace all the , with x1C and then remove the 8 the filed and replace x1C back with comma.

sed  's/"\([a-zA-Z0-9][^,]*\),\([^,]*\)"/"\1\x1C\2"/g'  file  | awk -F, '{$8 = ""}1' OFS=, | sed 's/,,/,/g ; s/\x1C/,/g'

if i do the above command and if i have more than 1 comma inside a double quotes its not accepting

a1,a2,"a,3,we",a4,a5,a7,"a,8","","",a9,a10,a11,"","",a12
b1,"b,2",b3,b4,b5,b6,"b,7","","",b9,"",b11,"","",b11
c1,"c,2","c,3",c4,c5,c6,c7,"","","c,9",c10,c11,"","",c12

which is wrong.

Try this:

awk     '       {OFS=FS="\""; $0=$0; print NR, $COL, $(COL+1)
                 for (i=2; i<=NF; i+=2) gsub(/,/, "\001", $i)
                 OFS=FS=","; $0=$0; $COL=""; gsub("\001",",") 
                }
                1
        ' COL=8 file4

I have fixed a bug in my previous script.
Adapted for the new ranges it becomes

awk -F, '
{
  # translate to array s[1],s[2],...
  j=1
  for (i=1; i<=NF; i++) {
    s[j]=$i
    if ($i~/"[^"]+$/) {
      for (k=i+1; k<=NF && $k!~/"/; k++) {
        s[j]=s[j] FS $k
      }
      s[j]=s[j] FS $k
      i=k
    }
    j++
  }
}
{
  sep=""
  for (i=1; i in s; i++) {
  # condition for printing the array elements
    if (i<=7 || i>=9) {
      printf "%s", sep s
      sep=FS
    }
  }
  print ""
}
' file

Output is

a1,a2,"a,3,we",a4,a5,a6,a7,"","",a9,a10,a11,"","",a12
b1,"b,2",b3,b4,b5,b6,"b,7","","",b9,"",b11,"","",b11
c1,"c,2","c,3",c4,c5,c6,c7,"","","c,9",c10,c11,"","",c12