Remove somewhat Duplicate records from a flat file

I have a flat file that contains records similar to the following two lines;

1984/11/08            7 700000 123456789 2
1984/11/08 1941/05/19 7 700000 123456789 2

The 123456789 2 represents an account number, this is how I identify the duplicate record.

The ### signs represent blank spaces in the file. This thread keeps stripping them out.

As you can see the second line has a second date in it. This is the line I need to KEEP and need to REMOVE the line before it.

How can I find these situations and then remove the first record?

Thanks for any help.


so whats the key? 123456789?


It keeps stripping them out because you're not putting them in code tags.

It could be as simple as

awk 'NF>5' < infile > outfile

to exclude all records with less than 6 fields.

Or, if some 'short' fields do NOT have duplicates, then:

awk '{if(NF == 6)  {        K=$1 $3 $4 $5 $6;        }
        else           {        K=$1 $2 $3 $4 $5; }

        if(!L[K]) { O[N++]=K; L[K]=$0; }
        else if(length(L[K]) < length($0)) L[K]=$0; }
END { for(M=0; M<N; M++) print L[O[M]]; }' < data
nawk '{a[$(NF-1),$NF]=$0}END {for (i in a) print a}' myFile
$ cat f29
1984/11/08            7 700000 123456789 2
1984/11/08 1941/05/19 7 700000 123456789 2
1999/06/08            8 800000 234567891 5
1999/06/08 1956/11/23 8 800000 234567891 5
$ # print only those lines that have 5 fields
$ perl -lane 'print if $#F==5' f29
1984/11/08 1941/05/19 7 700000 123456789 2
1999/06/08 1956/11/23 8 800000 234567891 5
$ # print only those lines that do have 2 dates at the beginning
$ perl -lne 'print if /^(\s*\d{4}\/\d\d\/\d\d){2}/' f29
1984/11/08 1941/05/19 7 700000 123456789 2
1999/06/08 1956/11/23 8 800000 234567891 5
$ # If the file is fixed-format, then you could try the following two approaches
$ # print only those lines whose column positions 12 through 21 are not blank spaces
$ perl -lne 'print if substr($_,11,10) !~ /^\s+$/' f29
1984/11/08 1941/05/19 7 700000 123456789 2
1999/06/08 1956/11/23 8 800000 234567891 5
$ # print only those lines whose column position 12 is not a blank space
$ perl -lne 'print if substr($_,11,1) ne " "' f29
1984/11/08 1941/05/19 7 700000 123456789 2
1999/06/08 1956/11/23 8 800000 234567891 5
