jolney
1
I have a flat file that contains records similar to the following two lines;
1984/11/08 7 700000 123456789 2
1984/11/08 1941/05/19 7 700000 123456789 2
The 123456789 2 represents an account number, this is how I identify the duplicate record.
The ### signs represent blank spaces in the file. This thread keeps stripping them out.
As you can see the second line has a second date in it. This is the line I need to KEEP and need to REMOVE the line before it.
How can I find these situations and then remove the first record?
Thanks for any help.
Jeff
so whats the key? 123456789?
--ahamed
It keeps stripping them out because you're not putting them in code tags.
It could be as simple as
awk 'NF>5' < infile > outfile
to exclude all records with less than 6 fields.
Or, if some 'short' fields do NOT have duplicates, then:
awk '{if(NF == 6) { K=$1 $3 $4 $5 $6; }
else { K=$1 $2 $3 $4 $5; }
if(!L[K]) { O[N++]=K; L[K]=$0; }
else if(length(L[K]) < length($0)) L[K]=$0; }
END { for(M=0; M<N; M++) print L[O[M]]; }' < data
nawk '{a[$(NF-1),$NF]=$0}END {for (i in a) print a}' myFile
$
$ cat f29
1984/11/08 7 700000 123456789 2
1984/11/08 1941/05/19 7 700000 123456789 2
1999/06/08 8 800000 234567891 5
1999/06/08 1956/11/23 8 800000 234567891 5
$
$ # print only those lines that have 5 fields
$
$ perl -lane 'print if $#F==5' f29
1984/11/08 1941/05/19 7 700000 123456789 2
1999/06/08 1956/11/23 8 800000 234567891 5
$
$ # print only those lines that do have 2 dates at the beginning
$
$ perl -lne 'print if /^(\s*\d{4}\/\d\d\/\d\d){2}/' f29
1984/11/08 1941/05/19 7 700000 123456789 2
1999/06/08 1956/11/23 8 800000 234567891 5
$
$ # If the file is fixed-format, then you could try the following two approaches
$ # print only those lines whose column positions 12 through 21 are not blank spaces
$
$ perl -lne 'print if substr($_,11,10) !~ /^\s+$/' f29
1984/11/08 1941/05/19 7 700000 123456789 2
1999/06/08 1956/11/23 8 800000 234567891 5
$
$ # print only those lines whose column position 12 is not a blank space
$
$ perl -lne 'print if substr($_,11,1) ne " "' f29
1984/11/08 1941/05/19 7 700000 123456789 2
1999/06/08 1956/11/23 8 800000 234567891 5
$
$
tyler_durden