Find duplicates in column 1 and merge their lines (awk?)

falcox · January 28, 2013, 11:58am

Hi,

I have a file (sorted by sort) with 8 tab delimited columns. The first column contains duplicated fields and I need to merge all these identical lines.

My input file:

comp100002	aaa	bbb	ccc	ddd	eee	fff	ggg
comp100003	aba	aba	aba	aba	aba	aba	aba
comp100003	fff	fff	fff	fff	fff	fff	fff
comp100004	xxx	xyz	xyz	xxx	xyz	xxx	xyz

My desired output file:

comp100002	aaa	bbb	ccc	ddd	eee	fff	ggg
comp100003	aba	aba	aba	aba	aba	aba	aba	fff	fff	fff	fff	fff	fff	fff
comp100004	xxx	xyz	xyz	xxx	xyz	xxx	xyz

Thanks for advice.

rdrtx1 · January 28, 2013, 12:29pm

try:

awk '
!(a[$1]) {a[$1]=$0}
a[$1] {w=$1; $1=""; a[w]=a[w] $0}
END {for (i in a) print a}
' FS="\t" OFS="\t" infile

falcox · January 28, 2013, 2:26pm

Thanks a lot, it prints desired results. However, if there is a single-copy identifier in field 1, it appends whole line twice. It's easy to get rid of these 8 additional columns, but since I am learning, could you please comment which part of the code is responsible for this?

Scrutinizer · January 28, 2013, 2:52pm

try:

awk 'p!=$1{if(p)print s; p=s=$1} {sub(p,x); s=s $0} END{if(p)print s}' FS='\t' file

rdrtx1 · January 28, 2013, 3:00pm

Fixed, try:

awk '
!(a[$1]) {a[$1]=$0; next}
a[$1] {w=$1; $1=""; a[w]=a[w] $0}
END {for (i in a) print a}
' FS="\t" OFS="\t" infile

falcox · January 28, 2013, 3:20pm

Thanks guys. Checked by diff and results of both scripts are now identical.