rleal
January 26, 2009, 5:12pm
1
Hi all:
Let's suppose I have a file like this (but with many more records).
XX ME 342 8688 2006 7 6 3c 60.029 -38.568 2901 0001 74 4 7603 8
969.8 958.4 3.6320 34.8630
985.5 973.9 3.6130 34.8600
998.7 986.9 3.6070 34.8610
1003.6 991.7 3.6240 34.8660
**
XX ME 342 8689 2006 7 6 3c 60.065 -38.617 2890 0001 74 4 7603 8
960.9 949.6 3.6020 34.8580
976.5 965.0 3.5870 34.8580
991.6 979.9 3.5800 34.8580
1002.8 990.9 3.5760 34.8580
1003.9 992.0 3.5760 34.8590
**
XX ME 342 9690 2006 7 7 3c 60.100 -38.669 2876 0001 74 4 7603 8
975.3 963.8 3.5820 34.8580
992.3 980.6 3.5660 34.8570
1003.3 991.4 3.5640 34.8580
1004.4 992.5 3.5630 34.8590
**
XX ME 342 8688 2006 7 6 3c 60.029 -38.568 2901 0001 74 4 7603 8
1.6 1.6 8.9330 34.9230
13.5 13.4 8.4880 34.9200
**
That is a sequence of records, each composed by: a header line, the data list and an end-of-record delimiter ('**').
I'd like to:
1- retain the unique data, that is, excluding duplicate records. This should be done comparing the fields 5, 6, 7, 9 and 10 of the header lines.
2.- list ALL the duplicates (for further examination).
In the example above, it should return:
XX ME 342 8689 2006 7 6 3c 60.065 -38.617 2890 0001 74 4 7603 8
960.9 949.6 3.6020 34.8580
976.5 965.0 3.5870 34.8580
991.6 979.9 3.5800 34.8580
1002.8 990.9 3.5760 34.8580
1003.9 992.0 3.5760 34.8590
**
XX ME 342 9690 2006 7 7 3c 60.100 -38.669 2876 0001 74 4 7603 8
975.3 963.8 3.5820 34.8580
992.3 980.6 3.5660 34.8570
1003.3 991.4 3.5640 34.8580
1004.4 992.5 3.5630 34.8590
**
for the unique, and
XX ME 342 8688 2006 7 6 3c 60.029 -38.568 2901 0001 74 4 7603 8
969.8 958.4 3.6320 34.8630
985.5 973.9 3.6130 34.8600
998.7 986.9 3.6070 34.8610
1003.6 991.7 3.6240 34.8660
**
XX ME 342 8688 2006 7 6 3c 60.029 -38.568 2901 0001 74 4 7603 8
1.6 1.6 8.9330 34.9230
13.5 13.4 8.4880 34.9200
**
for the dupes. Is there a simple way to achieve this?
Thanks,
r.-
Ygor
January 27, 2009, 12:39am
2
Try...
gawk 'BEGIN{RS="\\*\\*\n+";ORS="**\n"}
NR==FNR{a[$5,$6,$7,$9,$10]++;next}
{print $0 > FILENAME "." (a[$5,$6,$7,$9,$10]==1?"uniq":"dupe")}' file file
Tested...
$ head -1000 file.*
==> file.dupe <==
XX ME 342 8688 2006 7 6 3c 60.029 -38.568 2901 0001 74 4 7603 8
969.8 958.4 3.6320 34.8630
985.5 973.9 3.6130 34.8600
998.7 986.9 3.6070 34.8610
1003.6 991.7 3.6240 34.8660
**
XX ME 342 8688 2006 7 6 3c 60.029 -38.568 2901 0001 74 4 7603 8
1.6 1.6 8.9330 34.9230
13.5 13.4 8.4880 34.9200
**
==> file.uniq <==
XX ME 342 8689 2006 7 6 3c 60.065 -38.617 2890 0001 74 4 7603 8
960.9 949.6 3.6020 34.8580
976.5 965.0 3.5870 34.8580
991.6 979.9 3.5800 34.8580
1002.8 990.9 3.5760 34.8580
1003.9 992.0 3.5760 34.8590
**
XX ME 342 9690 2006 7 7 3c 60.100 -38.669 2876 0001 74 4 7603 8
975.3 963.8 3.5820 34.8580
992.3 980.6 3.5660 34.8570
1003.3 991.4 3.5640 34.8580
1004.4 992.5 3.5630 34.8590
**
$
rleal
January 27, 2009, 1:52pm
3
Great, that did it seamlessly! I try to understand how this piece of code works but it's far beyond my skills.
Now, let me ask for a step further. If I have this list of duplicate records :
58 JH 0 650 1996 6 14 4b 60.000 -6.250 783 0000 28 4 7600 6
950.0 938.9 -9.9000 34.9112
972.0 960.6 -9.9000 34.9117
**
RU P5 0 94 1993 4 28 4b 60.000 -5.500 878 0000 15 6 7600 5
606.0 599.4 7.5300 35.1760 6.591 0.990
758.0 749.5 0.8000 34.9130 7.074 1.020
**
58 JH 0 650 1996 6 14 4c 60.000 -6.250 783 0000 98 4 7600 6
962.0 950.7 -9.9000 34.9108
972.0 960.6 -9.9000 34.9117
**
90 AM 264 9854 1990 4 18 3c 60.000 -7.002 483 0001 42 4 7600 7
394.0 389.9 6.8000 35.1780
404.0 399.8 6.7400 35.1690
414.0 409.7 6.5600 35.1590
**
06 AZ 290 1741 1996 7 9 3c 60.000 -6.845 489 0001 45 4 7600 6
420.0 415.6 8.7735 35.2983
430.0 425.5 8.7678 35.2970
439.0 434.4 8.7582 35.2979
**
XX UN 104 2267 1999 10 2 3u 60.420 -8.580 485 0001 5 3 7600 8
74.0 73.3 10.4000
104.0 103.0 9.7000
**
XX IN 104 2286 1999 10 2 3u 60.420 -8.580 485 0001 6 3 7600 8
74.0 73.3 10.4000
104.0 103.0 9.7000
**
74 XX 10251 9893 1949 7 30 6b 60.000 -5.420 784 0000 13 4 7600 5
505.5 500.0 7.9600 35.2200
596.7 590.0 6.5200 35.1600
**
74 SC 1335 74 1949 7 30 6b 60.000 -5.420 784 0000 13 4 7600 5
404.3 400.0 8.3900 35.2400
505.5 500.0 7.1800 35.1900
596.7 590.0 6.5200 35.1600
**
90 P5 12461 2819 1993 4 28 6b 60.000 -5.500 878 0000 15 6 7600 5
606.8 600.0 7.5300 35.1800 6.390 0.990
758.8 750.0 0.8000 34.9100 6.850 1.020
**
06 AZ 10389 5882 1996 7 9 6c 60.000 -6.845 489 0000 50 4 7600 6
427.6 423.0 8.7777 35.2983
436.7 432.0 8.7670 35.2970
443.8 439.0 8.7582 35.2979
**
58 GS 3233 869 1990 4 18 6c 60.000 -7.002 483 0000 42 4 7600 7
404.0 399.8 6.7400 35.1690
414.0 409.7 6.5600 35.1590
**
I want to retain only one (or more) of the dupes, sending the remaining to another file. The criteria to retain one record would be:
if the second characters on $8 of the "header" are different, retain both
else retain the one with greater first character on $8
else retain the one with greater $13
else retain the one with $1~XX
else retain the one with $1~UN
In this case the output should be something like:
58 JH 0 650 1996 6 14 4b 60.000 -6.250 783 0000 28 4 7600 6
950.0 938.9 -9.9000 34.9112
972.0 960.6 -9.9000 34.9117
**
RU P5 0 94 1993 4 28 4b 60.000 -5.500 878 0000 15 6 7600 5
606.0 599.4 7.5300 35.1760 6.591 0.990
758.0 749.5 0.8000 34.9130 7.074 1.020
**
58 JH 0 650 1996 6 14 4c 60.000 -6.250 783 0000 98 4 7600 6
962.0 950.7 -9.9000 34.9108
972.0 960.6 -9.9000 34.9117
**
90 AM 264 9854 1990 4 18 3c 60.000 -7.002 483 0001 42 4 7600 7
394.0 389.9 6.8000 35.1780
404.0 399.8 6.7400 35.1690
414.0 409.7 6.5600 35.1590
**
06 AZ 290 1741 1996 7 9 3c 60.000 -6.845 489 0001 45 4 7600 6
420.0 415.6 8.7735 35.2983
430.0 425.5 8.7678 35.2970
439.0 434.4 8.7582 35.2979
**
XX IN 104 2286 1999 10 2 3u 60.420 -8.580 485 0001 6 3 7600 8
74.0 73.3 10.4000
104.0 103.0 9.7000
**
74 SC 1335 74 1949 7 30 6b 60.000 -5.420 784 0000 13 4 7600 5
404.3 400.0 8.3900 35.2400
505.5 500.0 7.1800 35.1900
596.7 590.0 6.5200 35.1600
**
and and the rejected:
90 P5 12461 2821 1993 4 28 6b 60.000 -6.500 458 0000 13 6 7600 6
303.2 300.0 8.0500 35.2200 6.290 0.860
404.3 400.0 7.9900 35.2100 6.280 0.890
460.0 455.0 7.5400 35.1800 6.360 0.910
**
06 AZ 10389 5882 1996 7 9 6c 60.000 -6.845 489 0000 50 4 7600 6
427.6 423.0 8.7777 35.2983
436.7 432.0 8.7670 35.2970
443.8 439.0 8.7582 35.2979
**
58 GS 3233 869 1990 4 18 6c 60.000 -7.002 483 0000 42 4 7600 7
404.0 399.8 6.7400 35.1690
414.0 409.7 6.5600 35.1590
**
XX UN 104 2267 1999 10 2 3u 60.420 -8.580 485 0001 5 3 7600 8
74.0 73.3 10.4000
104.0 103.0 9.7000
**
74 XX 10251 9893 1949 7 30 6b 60.000 -5.420 784 0000 13 4 7600 5
505.5 500.0 7.9600 35.2200
596.7 590.0 6.5200 35.1600
**
I hope you can help me. Thanks,
r.-
Ygor
January 27, 2009, 7:12pm
4
While I'm happy to help, I don't have time to do it all for you. Try to understand the awk code provided and modify it to fit your requirements. Others may help if you get stuck.
rleal
January 28, 2009, 5:30pm
5
Thanks. I understand and appreciate your help.
Regs,
r.-