[Solved] Data manipulation

Hallo Team,

I need your help. I have a file that has two colums. See sample below:

105550	0.28
105550	0.24
125550	0.28
125550	0.24
215650	0.28
215650	0.24
315550	0.28
315550	0.24
335550	0.28
335550	0.24
40555	0.21
40555	0.17
415550	0.21
415550	0.17
43555	0.21
43555	0.17
45555	0.21
45555	0.17
46555	0.21
46555	0.17
47554	0.21
47554	0.17
100650	0.22
100650	0.18
102850	0.22
102850	0.18
120650	0.22
120650	0.18
122850	0.22
122850	0.18
130650	0.22

Name of the file is aaaa.csv and column1 (Prefix) and column2(Rate). As you can see abouve we have multiple duplicate Prefixes which have different rates. So what i would like to do is to have one prefix out of the multiple duplicate prefixes and one rate which the highest from the other rates. See below before and after of the output i would like to have.

Before:

105550	0.28
105550	0.24
125550	0.28
125550	0.24
215650	0.28
215650	0.24
315550	0.28
315550	0.24
335550	0.28
335550	0.24
40555	0.21
40555	0.17
415550	0.21
415550	0.17
43555	0.21
43555	0.17
45555	0.21
45555	0.17
46555	0.21
46555	0.17
47554	0.21
47554	0.17
100650	0.22
100650	0.18
102850	0.22
102850	0.18
120650	0.22
120650	0.18
122850	0.22
122850	0.18
130650	0.22

After

105550	0.28
215650	0.28
315550	0.28
40555	0.21
415550	0.21
43555	0.21
46555	0.21
47554	0.21
100650	0.22
120650	0.22
122850	0.22
130650	0.22
awk '{a[$1]=$2>a[$1]?$2:a[$1]}END{for(b in a){print b,a}}'

You want the highest field 2 value for each unique field 1? (If so, your output data appears to be missing a few rows, e.g. for 335550)

You could do something like:

awk '{ if ($2 > rates[$1]) { rates[$1]=$2 } } END { for (i in rates) { print i OFS rates } }' aaaa.csv

Note: This does not preserve the ordering or the fixed-width formatting.

1 Like

Are some numbers are missing from your output?

$ sort -u -r -n <input file>
415550 0.21
335550 0.28
315550 0.28
215650 0.28
130650 0.22
125550 0.28
122850 0.22
120650 0.22
105550 0.28
102850 0.22
100650 0.22
47554 0.21
46555 0.21
45555 0.21
43555 0.21
40555 0.21

OR

$ sort -u -n <input file>
40555 0.21
43555 0.21
45555 0.21
46555 0.21
47554 0.21
100650 0.22
102850 0.22
105550 0.28
120650 0.22
122850 0.22
125550 0.28
130650 0.22
215650 0.28
315550 0.28
335550 0.28
415550 0.21

1 Like

Hallo Carlo and Lucas,

I did as you advised but i still get duplicates have a look below:

[paxk@util1-pkl ~]$ grep awk '{ if ($2 > rates[$1]) { rates[$1]=$2 } } END { for (i in rates) { print i OFS rates } }' aaaa.csv > bbb.csv

[

paxk@util1-pkl ~]$ grep 105550 bbb.csv
 105550,0.28
 105550,0.24

---------- Post updated at 03:11 PM ---------- Previous update was at 03:07 PM ----------

ni2 thank you it worked.

Where did the commas come from?