Merging non-repeating columns of lines

menenuh · February 9, 2010, 9:31am

Hello,
I have file to work with. It has 5 columns. The first three, altogether, constitutes the position. The 4th column contains some values for downstream analysis and the fifth column contains some values that I want to add to 4th column (only if they happen to be in the same position).

My file looks like this:

chr3    10163261        10163262        A>R_32_32_50_22 rs71760202
chr3    10163295        10163296        A>R_28_28_50_20 rs71757232
chr3    10163295        10163296        A>R_28_28_50_20 rs71760202
chr3    10163306        10163307        T>Y_34_34_50_20 rs71757232
chr3    10163306        10163307        T>Y_34_34_50_20 rs71760202
chr3    10163306        10163307        T>Y_34_34_50_20 rs5030624

And I am trying to make it look like this:

chr3   10163261    10163262  A>R_32_32_50_22>rs71760202
chr3   10163295    10163296  A>R_28_28_50_20>rs71757232, rs71760202
chr3   10163306    10163307  T>Y_34_34_50_20>rs71757232, rs71760202, rs5030624

Any help / recommendation / pointer would be appreciated.
Cheers

ahmad.diab · February 9, 2010, 10:37am

code:-

nawk '{a[$1" "$2" "$3" "$4]=a[$1" "$2" "$3" "$4]$5","}
END{for (i in a) print i,">",a}' infile.txt | sort -k2.6 > outfile.txt

---------- Post updated at 17:37 ---------- Previous update was at 16:48 ----------

in perl:-

perl -lane '$h{"@F[0..3]"}=$h{"@F[0..3]"}."$F[4]," ;
END{ foreach $k (sort keys %h) {print "$k > $h{$k}"}  ; } ;' infile.txt

;);)

piece of cake :p:p:p

menenuh · February 9, 2010, 10:44am

Thanks a lot, it works like a charm

ahmad.diab · February 9, 2010, 10:47am

which one you like more the nawk or the perl code?

:):)

menenuh · February 9, 2010, 11:06am

I am a newbie in bash and every new thing seems like magic to me So I liked the nawk version better but I could use some explanation.

Scrutinizer · February 9, 2010, 12:42pm

or:

awk '{t=$5;$5="";if(p!=$0){if(p)print p s;p=$0;s=">"t}else s=s","t}END{print p s}' infile