Combine Columns

greycells · December 9, 2015, 5:29pm

Input

 
NJ090237_0263_GRP,NJ090237_0263_VIEW,NJ090237_0263_PSGRP,NJ090237_0263_GOLD_CSGRP,06E:0_08E:0_09E:0_11E:0,0CE5
NJ090237_0264_GRP,NJ090237_0263_VIEW,NJ090237_0264_PSGRP,NJ090237_0263_GOLD_CSGRP,06E:0_08E:0_09E:0_11E:0,0CE5
NJ090233_0263_GRP,NJ090233_0263_VIEW,NJ090233_0263_PSGRP,NJ090233_0263_GOLD_CSGRP,06E:0_08E:0_09E:0_11E:0,0CE6
NJ090233_0263_GRP,NJ090233_0263_VIEW,NJ090233_0264_PSGRP,NJ090233_0263_GOLD_CSGRP,06E:0_08E:0_09E:0_11E:0,0CE6

Basically when column 6 is same in the input file , combine $1,$2,$3,$4,$5 using a ";"
but if the value in any of the columns 1-5 is similar , just use unique value .

gawk '
  {
    i=$6
    p=(i in A)
  } 
  NR==FNR {
    A=A (p?";":x) $1
    B=B (p?";":x) $2
    C=C (p?";":x) $3
    D=D (p?";":x) $4
    E=E (p?";":x) $5
    next
  } 
  p {
    $1=A
    $2=B
    $3=C
    $4=D
    $5=E
    delete A
    delete B
    delete C
    delete D
    delete E
    print
  }
' FS=, OFS=, input1 input1

I am getting this output .. its combining unique values also

NJ090237_0263_GRP;NJ090237_0264_GRP,NJ090237_0263_VIEW;NJ090237_0263_VIEW,NJ090237_0263_PSGRP;NJ090237_0264_PSGRP,NJ090237_0263_GOLD_CSGRP;NJ090237_0263_GOLD_CSGRP,06E:0_08E:0_09E:0_11E:0;06E:0_08E:0_09E:0_11E:0,0CE5
NJ090233_0263_GRP;NJ090233_0263_GRP,NJ090233_0263_VIEW;NJ090233_0263_VIEW,NJ090233_0263_PSGRP;NJ090233_0264_PSGRP,NJ090233_0263_GOLD_CSGRP;NJ090233_0263_GOLD_CSGRP,06E:0_08E:0_09E:0_11E:0;06E:0_08E:0_09E:0_11E:0,0CE6

but output needed is

NJ090237_0263_GRP;NJ090237_0264_GRP,NJ090237_0263_VIEW,NJ090237_0263_PSGRP;NJ090237_0264_PSGRP,NJ090237_0263_GOLD_CSGRP,06E:0_08E:0_09E:0_11E:0,0CE5
NJ090233_0263_GRP,NJ090233_0263_VIEW,NJ090233_0263_PSGRP;NJ090233_0264_PSGRP,NJ090233_0263_GOLD_CSGRP,06E:0_08E:0_09E:0_11E:0,0CE6

Don_Cragun · December 9, 2015, 5:37pm

Are all lines with the same values in the last field always adjacent in your input file (as they are in your example)?

Does the output order matter?

What operating system and shell are you using?

greycells · December 9, 2015, 5:44pm

They may or may not be adjacent .
The output order does not matter .
Linux ,bourne ,gawk ...

Thanks

Don_Cragun · December 9, 2015, 6:10pm

Try:

gawk '
{	A[i = $6]
	for(j = 1; j <= 5; j++)
		if(!((i, j, $j) in d)) {
			d[i, j, $j]
			outd[i, j] = outd[i, j] ? outd[i, j] ";" $j : $j
		}
}
END {	for(i in A)
		print outd[i,1], outd[i,2], outd[i,3], outd[i,4], outd[i,5], i
}' FS=, OFS=, input1

If someone else wants to try this on a Solaris system and gawk isn't loaded, change gawk to /usr/xpg4/bin/awk or nawk .

greycells · December 9, 2015, 8:08pm

Thanks Don ! .. worked as expected ...
I have another question for u .. , not related to above though

Basically wherever there is a ";" in front of or after a field , change it to "[Field]"

 
 Example 
  
 F1;F2;F3,F4,F5,F6;F7,F8,F9;F10;F11

etc

to

[F1][F2][F3],F4,F5,[F6][F7],F8,[F9][F10][F11]

Thanks

Don_Cragun · December 9, 2015, 8:20pm

Do you want to take the output produced by my earlier script and transform it into this new output format?

Or, do you want to change my earlier script to produce this new output format instead of the old output format?

greycells · December 9, 2015, 11:25pm

The first one if you please .... thx

Don_Cragun · December 9, 2015, 11:32pm

Try:

gawk '
{	for(i = 1; i <= NF; i++)
		if((n = split($i, f, ";")) > 1) {
			o = ""
			for(j = 1; j <= n; j++)
				o = o "[" f[j] "]"
			$i = o
		}
}
1' FS=, OFS=, out

where out is a file containing output in the format produced by the 1st script.