Help with cut or awk command

ATWC · July 8, 2014, 9:36am

i am trying to cut the 4th field from the file.
sample file

a1,a2,"a,3",a4,a5,a6
b1,"b,2",b3,b4,b5,b6
c1,"c,2","c,3",c4,c5,c6

i need the output in the format of

a1,a2,"a,3",a5,a6
b1,"b,2",b3,b5,b6
c1,"c,2","c,3",c5,c6

i tried using cut -d, -f1-4,6-
but i am getting the output as

a1,a2,"a,3",a5,a6
b1,"b,2",b3,b5,b6
c1,"c,2","c,c4,c5,c6

which is wrong. any other method to do this correctly?

Yoda · July 8, 2014, 11:59am

For your input file, try:-

awk -F, '{$(NF-2)=$(NF-1) FS $NF;NF-=2}1' OFS=, file

ATWC · July 8, 2014, 1:57pm

can you please explain how the command awk works?

mirni · July 8, 2014, 2:32pm

Tricky part is that commas may be present within the field (that's why the quotes)
Yoda's solution will not work for files that have more than 6 columns, or files that have commas in the last two fields. E.g. file

a1,a2,"a,3",a4,a5,a6
b1,"b,2",b3,b4,b5,"b6,b999",b10
c1,"c,2","c,3",c4,c5,c6

will become

$ awk -F, '{$(NF-2)=$(NF-1) FS $NF;NF-=2}1' OFS=, file
a1,a2,"a,3",a5,a6
b1,"b,2",b3,b4,b5,b999",b10
c1,"c,2","c,3",c5,c6

Probably not what you want.

I would suggest the following approach:

protect the comma within quotes by substituting it with some character
get rid of the 4th field
substitute the protect character with comma to get back the original

Try this solution:

sed  's/"\([^,]*\),\([^,]*\)"/"\1\x1C\2"/g' file  | awk -F, '{$4 = ""}1' OFS=, | sed 's/,,/,/g ; s/\x1C/,/g'

Explanation:

sed  's/"\([^,]*\),\([^,]*\)"/"\1\x1C\2"/g' file  # replace commas within quotes with \x1C (non-ascii character unlikely to be in your file)
awk -F, '{$4 = ""}1' OFS=,                        # strip the fourth field -- now it is safe to split them by commas
sed 's/,,/,/g ; s/\x1C/,/g'                       # get rid of double commas and replace \x1C back to comma for quoted commas

MadeInGermany · July 8, 2014, 2:37pm

It cuts fields from right to left.
Yet hard-coded to the 3rd field from right.
And not all awk versions empty the fields by modifying NF.
I have put a more general awk solution to your previous post that can be easily adapted.

SriniShoo · July 9, 2014, 1:09am

Below code will work to remove any field > 2. You need to provide the number of fields and number of the field to be removed

awk 'BEGIN{FS = OFS = "\""}
  {for(i=1; i<=NF; i+=2) {
    gsub(/,/, "\n", $i)}}1' file | awk '{
 if(NR == RM_FLD) {next};
 if(NR == FLD_CNT)
   {t = $0; getline;
   printf "%s\n%s", t, $0;
   NR = 1; f = 1; next}
 if(NR == 2 && f == 1)
   {printf "%s%s%s", ORS, $0, ORS;
   next}}1' ORS=',' RM_FLD=4 FLD_CNT=6

ATWC · July 9, 2014, 2:16pm

need to replace comma only inside the " "(double qoutes)
sample file.

a1,a2,"a,3,12",a4,a5,a6,a7,"a,8","","",a9,a10,a11,"","",a12
b1,"b,2",b3,b4,b5,b6,"b,7",b8,"","",b9,"",b11,"","",b11
c1,"c,2","c,3",c4,c5,c6,c7,c8,"","","c,9",c10,c11,"","",c12

needed output

a1,a2,"a312",a4,a5,a6,a7,"a8","","",a9,a10,a11,"","",a12
b1,"b2",b3,b4,b5,b6,"b7",b8,"","",b9,"",b11,"","",b11
c1,"c2","c3",c4,c5,c6,c7,c8,"","","c9",c10,c11,"","",c12

i am using the command sed 's/"\([^,]*\),\([^,]*\)"/"\1\x1C\2"/g'
getting the output as

a1,a2,"a312",a4,a5,a6,a7,"a8","""",a9,a10,a11,"""",a12
b1,"b2",b3,b4,b5,b6,"b7",b8,"""",b9,"",b11,"""",b11
c1,"c2","c3",c4,c5,c6,c7,c8,"""","c9",c10,c11,"""",c12

Scrutinizer · July 9, 2014, 2:33pm

To only replace commas inside double quotes, try:

awk '!(NR%2){gsub(/,/,"\x1C")}1' RS=\" ORS=\" file

Although I do not understand why "","" should be converted to """" (it would not work for that...)? Can you elaborate?

ATWC · July 9, 2014, 2:41pm

a1,a2,"a,3,we",a4,a5,a6,a7,"a,8","","",a9,a10,a11,"","",a12
b1,"b,2",b3,b4,b5,b6,"b,7",b8,"","",b9,"",b11,"","",b11
c1,"c,2","c,3",c4,c5,c6,c7,c8,"","","c,9",c10,c11,"","",c12

i want the to cut the 8th field from the file.
so i have used the below command to replace all the , with x1C and then remove the 8 the filed and replace x1C back with comma.

sed  's/"\([a-zA-Z0-9][^,]*\),\([^,]*\)"/"\1\x1C\2"/g'  file  | awk -F, '{$8 = ""}1' OFS=, | sed 's/,,/,/g ; s/\x1C/,/g'

if i do the above command and if i have more than 1 comma inside a double quotes its not accepting

a1,a2,"a,3,we",a4,a5,a7,"a,8","","",a9,a10,a11,"","",a12
b1,"b,2",b3,b4,b5,b6,"b,7","","",b9,"",b11,"","",b11
c1,"c,2","c,3",c4,c5,c6,c7,"","","c,9",c10,c11,"","",c12

which is wrong.

RudiC · July 9, 2014, 3:10pm

Try this:

awk     '       {OFS=FS="\""; $0=$0; print NR, $COL, $(COL+1)
                 for (i=2; i<=NF; i+=2) gsub(/,/, "\001", $i)
                 OFS=FS=","; $0=$0; $COL=""; gsub("\001",",") 
                }
                1
        ' COL=8 file4

MadeInGermany · July 9, 2014, 3:28pm

I have fixed a bug in my previous script.
Adapted for the new ranges it becomes

awk -F, '
{
  # translate to array s[1],s[2],...
  j=1
  for (i=1; i<=NF; i++) {
    s[j]=$i
    if ($i~/"[^"]+$/) {
      for (k=i+1; k<=NF && $k!~/"/; k++) {
        s[j]=s[j] FS $k
      }
      s[j]=s[j] FS $k
      i=k
    }
    j++
  }
}
{
  sep=""
  for (i=1; i in s; i++) {
  # condition for printing the array elements
    if (i<=7 || i>=9) {
      printf "%s", sep s
      sep=FS
    }
  }
  print ""
}
' file

Output is

a1,a2,"a,3,we",a4,a5,a6,a7,"","",a9,a10,a11,"","",a12
b1,"b,2",b3,b4,b5,b6,"b,7","","",b9,"",b11,"","",b11
c1,"c,2","c,3",c4,c5,c6,c7,"","","c,9",c10,c11,"","",c12