Combine Columns

Input

 
NJ090237_0263_GRP,NJ090237_0263_VIEW,NJ090237_0263_PSGRP,NJ090237_0263_GOLD_CSGRP,06E:0_08E:0_09E:0_11E:0,0CE5
NJ090237_0264_GRP,NJ090237_0263_VIEW,NJ090237_0264_PSGRP,NJ090237_0263_GOLD_CSGRP,06E:0_08E:0_09E:0_11E:0,0CE5
NJ090233_0263_GRP,NJ090233_0263_VIEW,NJ090233_0263_PSGRP,NJ090233_0263_GOLD_CSGRP,06E:0_08E:0_09E:0_11E:0,0CE6
NJ090233_0263_GRP,NJ090233_0263_VIEW,NJ090233_0264_PSGRP,NJ090233_0263_GOLD_CSGRP,06E:0_08E:0_09E:0_11E:0,0CE6
 

Basically when column 6 is same in the input file , combine $1,$2,$3,$4,$5 using a ";"
but if the value in any of the columns 1-5 is similar , just use unique value .

gawk '
  {
    i=$6
    p=(i in A)
  } 
  NR==FNR {
    A=A (p?";":x) $1
    B=B (p?";":x) $2
    C=C (p?";":x) $3
    D=D (p?";":x) $4
    E=E (p?";":x) $5
    next
  } 
  p {
    $1=A
    $2=B
    $3=C
    $4=D
    $5=E
    delete A
    delete B
    delete C
    delete D
    delete E
    print
  }
' FS=, OFS=, input1 input1 

I am getting this output .. its combining unique values also

NJ090237_0263_GRP;NJ090237_0264_GRP,NJ090237_0263_VIEW;NJ090237_0263_VIEW,NJ090237_0263_PSGRP;NJ090237_0264_PSGRP,NJ090237_0263_GOLD_CSGRP;NJ090237_0263_GOLD_CSGRP,06E:0_08E:0_09E:0_11E:0;06E:0_08E:0_09E:0_11E:0,0CE5
NJ090233_0263_GRP;NJ090233_0263_GRP,NJ090233_0263_VIEW;NJ090233_0263_VIEW,NJ090233_0263_PSGRP;NJ090233_0264_PSGRP,NJ090233_0263_GOLD_CSGRP;NJ090233_0263_GOLD_CSGRP,06E:0_08E:0_09E:0_11E:0;06E:0_08E:0_09E:0_11E:0,0CE6

but output needed is

NJ090237_0263_GRP;NJ090237_0264_GRP,NJ090237_0263_VIEW,NJ090237_0263_PSGRP;NJ090237_0264_PSGRP,NJ090237_0263_GOLD_CSGRP,06E:0_08E:0_09E:0_11E:0,0CE5
NJ090233_0263_GRP,NJ090233_0263_VIEW,NJ090233_0263_PSGRP;NJ090233_0264_PSGRP,NJ090233_0263_GOLD_CSGRP,06E:0_08E:0_09E:0_11E:0,0CE6

Are all lines with the same values in the last field always adjacent in your input file (as they are in your example)?

Does the output order matter?

What operating system and shell are you using?

They may or may not be adjacent .
The output order does not matter .
Linux ,bourne ,gawk ...

Thanks

Try:

gawk '
{	A[i = $6]
	for(j = 1; j <= 5; j++)
		if(!((i, j, $j) in d)) {
			d[i, j, $j]
			outd[i, j] = outd[i, j] ? outd[i, j] ";" $j : $j
		}
}
END {	for(i in A)
		print outd[i,1], outd[i,2], outd[i,3], outd[i,4], outd[i,5], i
}' FS=, OFS=, input1

If someone else wants to try this on a Solaris system and gawk isn't loaded, change gawk to /usr/xpg4/bin/awk or nawk .

1 Like

Thanks Don ! .. worked as expected ...
I have another question for u .. , not related to above though

Basically wherever there is a ";" in front of or after a field , change it to "[Field]"

 
 Example 
  
 F1;F2;F3,F4,F5,F6;F7,F8,F9;F10;F11 
 
etc 

to

[F1][F2][F3],F4,F5,[F6][F7],F8,[F9][F10][F11]

Thanks

Do you want to take the output produced by my earlier script and transform it into this new output format?

Or, do you want to change my earlier script to produce this new output format instead of the old output format?

The first one if you please .... thx

Try:

gawk '
{	for(i = 1; i <= NF; i++)
		if((n = split($i, f, ";")) > 1) {
			o = ""
			for(j = 1; j <= n; j++)
				o = o "[" f[j] "]"
			$i = o
		}
}
1' FS=, OFS=, out

where out is a file containing output in the format produced by the 1st script.

1 Like