Remove duplicate values with condition

jiam912 · June 20, 2014, 4:24pm

Hi Gents,

Please can you help me to get the desired output .

In the first column I have some duplicate records, The condition is that all need to reject the duplicate record keeping the last occurrence. But the condition is. If the last occurrence is equal to value 14 or 98 in column 3 and >25 or < 200 in column 4. I should keep the first occurrence and reject the last one.

Some times, the record has one single entry with value 14 or 98 in column 3 or value >25 or < 200 in column 4. Of the entry is only one time. I need to keep the entry and not reject.

Here is my Input file
Input file.

2265520807        1        1       13     1186
2265520807        2        1       14     1186
2265520809        1        1        9     1186
2265520809        2        1       10     1186
2265520811        1        1        9     1186
2265520833        1        1        2     1186
2265520833        2       14        2     1186
2265520835        1        1        2     1186
2265520837        1       14        4     1186
2265520837        2        1        4     1186
2265520841        1        1        2     1186
2265520849        1        1        1     1186
2265520849        2       14    85423     1186
2266320807        2        1        8     1186
2266320809        1        1        1     1186
2266320809        2        1       57     1186
2266320825        0        0        0        0
2266320825        2        1        2     1186
2266320833        1        1        1     1186
2266320841        1        1        3     1186
2266320849        1       14    85223     1186
2266520729        1        1       10     1187
2266520805        1        1        1     1187
2266520805        2        1        3     1187
2267120963        1       98        7     1187
2267120967        1        1       15     1187
2267120969        1       98    85147     1187
2267120969        2        1        1     1187
2267120969        3       98    85147     1187

using this code I get the first duplicate entry.

awk 'X[$1] {print X[$1]}{ X[$1]=$0}' Input.txt

2265520807        1        1       13     1186
2265520809        1        1        9     1186
2265520833        1        1        2     1186
2265520837        1       14        4     1186
2265520849        1        1        1     1186
2266320809        1        1        1     1186
2266320825        0        0        0        0
2266520805        1        1        1     1187
2267120969        1       98    85147     1187
2267120969        2        1        1     1187

But As I explain at the beggining I would like to get something like this.

2265520807        1        1       13     1186
2265520809        1        1        9     1186
2265520833        2       14        2     1186
2265520837        1       14        4     1186
2265520849        2       14    85423     1186
2266320809        2        1       57     1186
2266320825        0        0        0        0
2266520805        1        1        1     1187
2267120969        1       98    85147     1187
2267120969        3       98    85147     1187

Thanks for your support

MadeInGermany · June 23, 2014, 11:10am

If I read

as or column4 is >25 and <200 , I can achieve your desired output with

awk '($1 in X) {if ($3==14 || $3==98 || ($4>25 && $4<200)) {print} else {print X[$1]}} {X[$1]=$0}' Input.txt

NB a lookup with ($1 in X) is little more efficient than X[$1] .

jiam912 · June 23, 2014, 6:22pm

Dear MadeInGermany
Thanks for your support