Remove somewhat Duplicate records from a flat file

jolney · September 29, 2011, 10:57am

I have a flat file that contains records similar to the following two lines;

1984/11/08            7 700000 123456789 2
1984/11/08 1941/05/19 7 700000 123456789 2

The 123456789 2 represents an account number, this is how I identify the duplicate record.

The ### signs represent blank spaces in the file. This thread keeps stripping them out.

As you can see the second line has a second date in it. This is the line I need to KEEP and need to REMOVE the line before it.

How can I find these situations and then remove the first record?

Thanks for any help.

Jeff

ahamed101 · September 29, 2011, 11:16am

so whats the key? 123456789?

--ahamed

Corona688 · September 29, 2011, 11:18am

It keeps stripping them out because you're not putting them in code tags.

It could be as simple as

awk 'NF>5' < infile > outfile

to exclude all records with less than 6 fields.

Or, if some 'short' fields do NOT have duplicates, then:

awk '{if(NF == 6)  {        K=$1 $3 $4 $5 $6;        }
        else           {        K=$1 $2 $3 $4 $5; }

        if(!L[K]) { O[N++]=K; L[K]=$0; }
        else if(length(L[K]) < length($0)) L[K]=$0; }
END { for(M=0; M<N; M++) print L[O[M]]; }' < data

vgersh99 · September 29, 2011, 11:18am

nawk '{a[$(NF-1),$NF]=$0}END {for (i in a) print a}' myFile

durden_tyler · September 29, 2011, 4:01pm

$
$ cat f29
1984/11/08            7 700000 123456789 2
1984/11/08 1941/05/19 7 700000 123456789 2
1999/06/08            8 800000 234567891 5
1999/06/08 1956/11/23 8 800000 234567891 5
$
$ # print only those lines that have 5 fields
$
$ perl -lane 'print if $#F==5' f29
1984/11/08 1941/05/19 7 700000 123456789 2
1999/06/08 1956/11/23 8 800000 234567891 5
$
$ # print only those lines that do have 2 dates at the beginning
$
$ perl -lne 'print if /^(\s*\d{4}\/\d\d\/\d\d){2}/' f29
1984/11/08 1941/05/19 7 700000 123456789 2
1999/06/08 1956/11/23 8 800000 234567891 5
$
$ # If the file is fixed-format, then you could try the following two approaches
$ # print only those lines whose column positions 12 through 21 are not blank spaces
$
$ perl -lne 'print if substr($_,11,10) !~ /^\s+$/' f29
1984/11/08 1941/05/19 7 700000 123456789 2
1999/06/08 1956/11/23 8 800000 234567891 5
$
$ # print only those lines whose column position 12 is not a blank space
$
$ perl -lne 'print if substr($_,11,1) ne " "' f29
1984/11/08 1941/05/19 7 700000 123456789 2
1999/06/08 1956/11/23 8 800000 234567891 5
$
$

tyler_durden