[Solved] Data manipulation

kekanap · March 10, 2014, 8:30am

Hallo Team,

I need your help. I have a file that has two colums. See sample below:

105550	0.28
105550	0.24
125550	0.28
125550	0.24
215650	0.28
215650	0.24
315550	0.28
315550	0.24
335550	0.28
335550	0.24
40555	0.21
40555	0.17
415550	0.21
415550	0.17
43555	0.21
43555	0.17
45555	0.21
45555	0.17
46555	0.21
46555	0.17
47554	0.21
47554	0.17
100650	0.22
100650	0.18
102850	0.22
102850	0.18
120650	0.22
120650	0.18
122850	0.22
122850	0.18
130650	0.22

Name of the file is aaaa.csv and column1 (Prefix) and column2(Rate). As you can see abouve we have multiple duplicate Prefixes which have different rates. So what i would like to do is to have one prefix out of the multiple duplicate prefixes and one rate which the highest from the other rates. See below before and after of the output i would like to have.

Before:

105550	0.28
105550	0.24
125550	0.28
125550	0.24
215650	0.28
215650	0.24
315550	0.28
315550	0.24
335550	0.28
335550	0.24
40555	0.21
40555	0.17
415550	0.21
415550	0.17
43555	0.21
43555	0.17
45555	0.21
45555	0.17
46555	0.21
46555	0.17
47554	0.21
47554	0.17
100650	0.22
100650	0.18
102850	0.22
102850	0.18
120650	0.22
120650	0.18
122850	0.22
122850	0.18
130650	0.22

After

105550	0.28
215650	0.28
315550	0.28
40555	0.21
415550	0.21
43555	0.21
46555	0.21
47554	0.21
100650	0.22
120650	0.22
122850	0.22
130650	0.22

Lucas_0418 · March 10, 2014, 8:43am

awk '{a[$1]=$2>a[$1]?$2:a[$1]}END{for(b in a){print b,a}}'

CarloM · March 10, 2014, 8:45am

You want the highest field 2 value for each unique field 1? (If so, your output data appears to be missing a few rows, e.g. for 335550)

You could do something like:

awk '{ if ($2 > rates[$1]) { rates[$1]=$2 } } END { for (i in rates) { print i OFS rates } }' aaaa.csv

Note: This does not preserve the ordering or the fixed-width formatting.

ni2 · March 10, 2014, 9:03am

Are some numbers are missing from your output?

$ sort -u -r -n <input file>
415550 0.21
335550 0.28
315550 0.28
215650 0.28
130650 0.22
125550 0.28
122850 0.22
120650 0.22
105550 0.28
102850 0.22
100650 0.22
47554 0.21
46555 0.21
45555 0.21
43555 0.21
40555 0.21

OR

$ sort -u -n <input file>
40555 0.21
43555 0.21
45555 0.21
46555 0.21
47554 0.21
100650 0.22
102850 0.22
105550 0.28
120650 0.22
122850 0.22
125550 0.28
130650 0.22
215650 0.28
315550 0.28
335550 0.28
415550 0.21

kekanap · March 10, 2014, 9:11am

Hallo Carlo and Lucas,

I did as you advised but i still get duplicates have a look below:

[paxk@util1-pkl ~]$ grep awk '{ if ($2 > rates[$1]) { rates[$1]=$2 } } END { for (i in rates) { print i OFS rates } }' aaaa.csv > bbb.csv

[

paxk@util1-pkl ~]$ grep 105550 bbb.csv
 105550,0.28
 105550,0.24

---------- Post updated at 03:11 PM ---------- Previous update was at 03:07 PM ----------

ni2 thank you it worked.

CarloM · March 10, 2014, 9:27am

Where did the commas come from?