Duplicate values merge

Dear Gents,

Please can you help me to solve this problem.

Input file...

22057485  ,219 ,1050
22057485  ,223 ,1050
21897425  ,278 ,1050
21897425  ,279 ,1050
21897425  ,287 ,1050
20497465  ,602 ,1051
20517500  ,677 ,1051
20517500  ,681 ,1051
20577555  ,775 ,1052
20577555  ,778 ,1052
20357560  ,778 ,1052
20357560  ,780 ,1052
23717535  ,794 ,1053
23717535  ,805 ,1053
23657530  ,797 ,1053
23657530  ,798 ,1053
23657530  ,799 ,1053

I would like to get something like it:

output file

1050  22057485    219    223    
1050  21897425    278    279    287
1051  20497465    602    603    605
1051  20517500    677    681    
1052  20577555    775    778    
1052  20357560    778    780    
1053  23717535    794    805    
1053  23657530    797    798    799

Thanks in advance

If you don't mind the order of the output:

awk -F' *, *' '{c[$3 OFS $1]=c[$3 OFS $1]""?c[$3 OFS $1] OFS $2:$2}
END{for(i in c) print i,c}' OFS='\t' file
1 Like

Thanks a lot its works perfect. :b:

Try:

awk -F' *,' 'p!=$1{if(p)print s; s=$3 OFS $1; p=$1}{s=s OFS $2} END{print s}' OFS='\t' file
1 Like

@Scrutinizer
A very nice solution. I did use nearly one hour to study this simple work to find out how it works. I do admire how you guys manage to find this clever simple solution to the problems.

I just like to explain how this script work, so I have written it some more readable.

awk -F' *,' '		#1	
p!=$1{			#2
	if(p) print s;	#3
	s=$3 OFS $1;	#4
	p=$1}		#5
{s=s OFS $2} 		#6
END {print s}' \	#7
OFS='\t' file		#8

#1 Setting the Field separator to one or more spaced followed by a comma ' *,'

Run on line one 22057485 ,219 ,1050
$1=22057485 $2=219 $3=1050
#2 test if p is different form $1 , and it is since p=0 (no data)
#3 test if p contains data, no, p is blank, do not print.
#4 set s=$3 OFS $1 s="1050 22057485"
#5 p=$1=22057485
#6 s=s OFS $2 s="1050 22057485 219"
Run on line two 22057485 ,223 ,1050
$1=22057485 $2=223 $3=1050
#2 test if p is different form $1 , and it equal p=22057485 $1=22057485
Jump to #6
#6 s=s OFS $2 s="1050 22057485 219 223"
Run on line three 21897425 ,278 ,1050
$1=21897425 $2=278 $3=1050
#2 test if p is different form $1 , and it is since p=22057485 $1=21897425
#3 test if p contains data, yes print s 1050 22057485 219 223
#4 set s=$3 OFS $1 s="1050 21897425"
#5 p=$1=21897425
#6 s=s OFS $2 s="1050 21897425 278"
Run on line four
.
.
.
#7 END Last job, print the last line print s
#8 setts the Output Field Separator to tab OFS='\t'

2 Likes

Thanks to everybody for your great job.

---------- Post updated at 02:27 PM ---------- Previous update was at 02:37 AM ----------

Gents,

please other thing

1050  22057485    219    223
1050  21897425    278    279    287 
1051  20497465    602    603    605 
1051  20517500    677    681     
1052  20577555    775    778     
1052  20357560    778    780     
1053  23717535    794    805     
1053  23657530    797    798    799

How i can count the total of values only from the 4 column to the end.

In this case the total of values will be 11.

How I can get this value..?.

Thanks for your help

awk '{i+=NF-3} END {print i}' infile
11

For me this gives 11 not 15 numbers?

1050  22057485    219    223
1050  21897425    278    279    287 
1051  20497465    602    603    605 
1051  20517500    677    681     
1052  20577555    775    778     
1052  20357560    778    780     
1053  23717535    794    805     
1053  23657530    797    798    799

Or did I understand this incorrect.

1 Like

Hi Jotne

Thanks for your quick answer... Yes it is 11 not 15... Great job

Thanks:b: