find duplicate records... again

rleal · January 26, 2009, 5:12pm

Hi all:

Let's suppose I have a file like this (but with many more records).

XX ME   342 8688 2006  7  6 3c  60.029  -38.568  2901 0001   74   4 7603  8
    969.8    958.4   3.6320  34.8630
    985.5    973.9   3.6130  34.8600
    998.7    986.9   3.6070  34.8610
   1003.6    991.7   3.6240  34.8660
**
XX ME   342 8689 2006  7  6 3c  60.065  -38.617  2890 0001   74   4 7603  8
    960.9    949.6   3.6020  34.8580
    976.5    965.0   3.5870  34.8580
    991.6    979.9   3.5800  34.8580
   1002.8    990.9   3.5760  34.8580
   1003.9    992.0   3.5760  34.8590
**
XX ME   342 9690 2006  7  7 3c  60.100  -38.669  2876 0001   74   4 7603  8
    975.3    963.8   3.5820  34.8580
    992.3    980.6   3.5660  34.8570
   1003.3    991.4   3.5640  34.8580
   1004.4    992.5   3.5630  34.8590
**
XX ME   342 8688 2006  7  6 3c  60.029  -38.568  2901 0001   74   4 7603  8
      1.6      1.6   8.9330  34.9230
     13.5     13.4   8.4880  34.9200
**

That is a sequence of records, each composed by: a header line, the data list and an end-of-record delimiter ('**').

I'd like to:
1- retain the unique data, that is, excluding duplicate records. This should be done comparing the fields 5, 6, 7, 9 and 10 of the header lines.
2.- list ALL the duplicates (for further examination).

In the example above, it should return:

XX ME   342 8689 2006  7  6 3c  60.065  -38.617  2890 0001   74   4 7603  8
    960.9    949.6   3.6020  34.8580
    976.5    965.0   3.5870  34.8580
    991.6    979.9   3.5800  34.8580
   1002.8    990.9   3.5760  34.8580
   1003.9    992.0   3.5760  34.8590
**
XX ME   342 9690 2006  7  7 3c  60.100  -38.669  2876 0001   74   4 7603  8
    975.3    963.8   3.5820  34.8580
    992.3    980.6   3.5660  34.8570
   1003.3    991.4   3.5640  34.8580
   1004.4    992.5   3.5630  34.8590
**

for the unique, and

XX ME   342 8688 2006  7  6 3c  60.029  -38.568  2901 0001   74   4 7603  8
     969.8    958.4   3.6320  34.8630
     985.5    973.9   3.6130  34.8600
     998.7    986.9   3.6070  34.8610
    1003.6    991.7   3.6240  34.8660
 **
 XX ME   342 8688 2006  7  6 3c  60.029  -38.568  2901 0001   74   4 7603  8
       1.6      1.6   8.9330  34.9230
      13.5     13.4   8.4880  34.9200
 **

for the dupes. Is there a simple way to achieve this?

Thanks,

r.-

Ygor · January 27, 2009, 12:39am

Try...

gawk 'BEGIN{RS="\\*\\*\n+";ORS="**\n"}
      NR==FNR{a[$5,$6,$7,$9,$10]++;next}
      {print $0 > FILENAME "." (a[$5,$6,$7,$9,$10]==1?"uniq":"dupe")}' file file

Tested...

$ head -1000 file.*
==> file.dupe <==
XX ME   342 8688 2006  7  6 3c  60.029  -38.568  2901 0001   74   4 7603  8
    969.8    958.4   3.6320  34.8630
    985.5    973.9   3.6130  34.8600
    998.7    986.9   3.6070  34.8610
   1003.6    991.7   3.6240  34.8660
**
XX ME   342 8688 2006  7  6 3c  60.029  -38.568  2901 0001   74   4 7603  8
      1.6      1.6   8.9330  34.9230
     13.5     13.4   8.4880  34.9200
**

==> file.uniq <==
XX ME   342 8689 2006  7  6 3c  60.065  -38.617  2890 0001   74   4 7603  8
    960.9    949.6   3.6020  34.8580
    976.5    965.0   3.5870  34.8580
    991.6    979.9   3.5800  34.8580
   1002.8    990.9   3.5760  34.8580
   1003.9    992.0   3.5760  34.8590
**
XX ME   342 9690 2006  7  7 3c  60.100  -38.669  2876 0001   74   4 7603  8
    975.3    963.8   3.5820  34.8580
    992.3    980.6   3.5660  34.8570
   1003.3    991.4   3.5640  34.8580
   1004.4    992.5   3.5630  34.8590
**
$

rleal · January 27, 2009, 1:52pm

Great, that did it seamlessly! I try to understand how this piece of code works but it's far beyond my skills.

Now, let me ask for a step further. If I have this list of duplicate records :

58 JH     0  650 1996  6 14 4b  60.000   -6.250   783 0000   28   4 7600  6
    950.0    938.9  -9.9000  34.9112
    972.0    960.6  -9.9000  34.9117
**
RU P5     0   94 1993  4 28 4b  60.000   -5.500   878 0000   15   6 7600  5
    606.0    599.4   7.5300  35.1760    6.591    0.990
    758.0    749.5   0.8000  34.9130    7.074    1.020
**
58 JH     0  650 1996  6 14 4c  60.000   -6.250   783 0000   98   4 7600  6
    962.0    950.7  -9.9000  34.9108
    972.0    960.6  -9.9000  34.9117
**
90 AM   264 9854 1990  4 18 3c  60.000   -7.002   483 0001   42   4 7600  7
    394.0    389.9   6.8000  35.1780
    404.0    399.8   6.7400  35.1690
    414.0    409.7   6.5600  35.1590
**
06 AZ   290 1741 1996  7  9 3c  60.000   -6.845   489 0001   45   4 7600  6
    420.0    415.6   8.7735  35.2983
    430.0    425.5   8.7678  35.2970
    439.0    434.4   8.7582  35.2979
**
XX UN   104 2267 1999 10  2 3u  60.420   -8.580   485 0001    5   3 7600  8
     74.0     73.3  10.4000
    104.0    103.0   9.7000
**
XX IN   104 2286 1999 10  2 3u  60.420   -8.580   485 0001    6   3 7600  8
     74.0     73.3  10.4000
    104.0    103.0   9.7000
**
74 XX 10251 9893 1949  7 30 6b  60.000   -5.420   784 0000   13   4 7600  5
    505.5    500.0   7.9600  35.2200
    596.7    590.0   6.5200  35.1600
**
74 SC  1335   74 1949  7 30 6b  60.000   -5.420   784 0000   13   4 7600  5
    404.3    400.0   8.3900  35.2400
    505.5    500.0   7.1800  35.1900
    596.7    590.0   6.5200  35.1600
**
90 P5 12461 2819 1993  4 28 6b  60.000   -5.500   878 0000   15   6 7600  5
    606.8    600.0   7.5300  35.1800    6.390    0.990
    758.8    750.0   0.8000  34.9100    6.850    1.020
**
06 AZ 10389 5882 1996  7  9 6c  60.000   -6.845   489 0000   50   4 7600  6
    427.6    423.0   8.7777  35.2983
    436.7    432.0   8.7670  35.2970
    443.8    439.0   8.7582  35.2979
**
58 GS  3233  869 1990  4 18 6c  60.000   -7.002   483 0000   42   4 7600  7
    404.0    399.8   6.7400  35.1690
    414.0    409.7   6.5600  35.1590
**

I want to retain only one (or more) of the dupes, sending the remaining to another file. The criteria to retain one record would be:

if the second characters on $8 of the "header" are different, retain both
else retain the one with greater first character on $8
else retain the one with greater $13
else retain the one with $1~XX
else retain the one with $1~UN

In this case the output should be something like:

58 JH     0  650 1996  6 14 4b  60.000   -6.250   783 0000   28   4 7600  6
    950.0    938.9  -9.9000  34.9112
    972.0    960.6  -9.9000  34.9117
**
RU P5     0   94 1993  4 28 4b  60.000   -5.500   878 0000   15   6 7600  5
    606.0    599.4   7.5300  35.1760    6.591    0.990
    758.0    749.5   0.8000  34.9130    7.074    1.020
**
58 JH     0  650 1996  6 14 4c  60.000   -6.250   783 0000   98   4 7600  6
    962.0    950.7  -9.9000  34.9108
    972.0    960.6  -9.9000  34.9117
**
90 AM   264 9854 1990  4 18 3c  60.000   -7.002   483 0001   42   4 7600  7
    394.0    389.9   6.8000  35.1780
    404.0    399.8   6.7400  35.1690
    414.0    409.7   6.5600  35.1590
**
06 AZ   290 1741 1996  7  9 3c  60.000   -6.845   489 0001   45   4 7600  6
    420.0    415.6   8.7735  35.2983
    430.0    425.5   8.7678  35.2970
    439.0    434.4   8.7582  35.2979
**
XX IN   104 2286 1999 10  2 3u  60.420   -8.580   485 0001    6   3 7600  8
     74.0     73.3  10.4000
    104.0    103.0   9.7000
**
74 SC  1335   74 1949  7 30 6b  60.000   -5.420   784 0000   13   4 7600  5
    404.3    400.0   8.3900  35.2400
    505.5    500.0   7.1800  35.1900
    596.7    590.0   6.5200  35.1600
**

and and the rejected:

90 P5 12461 2821 1993  4 28 6b  60.000   -6.500   458 0000   13   6 7600  6
    303.2    300.0   8.0500  35.2200    6.290    0.860
    404.3    400.0   7.9900  35.2100    6.280    0.890
    460.0    455.0   7.5400  35.1800    6.360    0.910
**
06 AZ 10389 5882 1996  7  9 6c  60.000   -6.845   489 0000   50   4 7600  6
    427.6    423.0   8.7777  35.2983
    436.7    432.0   8.7670  35.2970
    443.8    439.0   8.7582  35.2979
**
58 GS  3233  869 1990  4 18 6c  60.000   -7.002   483 0000   42   4 7600  7
    404.0    399.8   6.7400  35.1690
    414.0    409.7   6.5600  35.1590
**
XX UN   104 2267 1999 10  2 3u  60.420   -8.580   485 0001    5   3 7600  8
     74.0     73.3  10.4000
    104.0    103.0   9.7000
**
74 XX 10251 9893 1949  7 30 6b  60.000   -5.420   784 0000   13   4 7600  5
    505.5    500.0   7.9600  35.2200
    596.7    590.0   6.5200  35.1600
**

I hope you can help me. Thanks,

r.-

Ygor · January 27, 2009, 7:12pm

While I'm happy to help, I don't have time to do it all for you. Try to understand the awk code provided and modify it to fit your requirements. Others may help if you get stuck.

rleal · January 28, 2009, 5:30pm

Thanks. I understand and appreciate your help.
Regs,

r.-