Removing duplicates

giannicello · September 13, 2005, 11:07am

Hi, I've been trying to removed duplicates lines with similar columns in a fixed width file and it's not working.
I've search the forum but nothing comes close.

I have a sample file:

27147140631203RA CCD *
27147140631203RA PPN *
37147140631207RD AAA
47147140631203RD JNA
47147140631204DC ADK *
47147140631204DC ALK *
67147140631203DA ALM *
67147140631203DA CCD *
77147140631209QC RRP
87147140631203QA RRN

There are 3 spaces between first set of alphanumerics and the last three letter codes.

I want to remove lines that match only up to the 3 blanks and ignore the 3 letter codes or whatever else is on that line after the 3 letter codes.

Anyone know how I can do this? I want to keep at least one instance of any duplicates...doesn't matter which.
I put asteriks where I need to keep one of any two.

Thanks.
Gianni

pixelbeat · September 13, 2005, 11:13am

assuming the first field is always 16 chars you can:

uniq -w16

giannicello · September 13, 2005, 11:27am

I tried different combinations of sort and uniq, etc but none worked.
Also, I am on AIX and korn shell. When I ran uniq -?, I got:

uniq: Not a recognized flag: ?
Usage: uniq [-c | -d | -u] [-f Fields] [-s Chars] [-Fields] [+Chars] [InFile [OutFile]]

I have no -w switch...

Thanks.

pixelbeat · September 13, 2005, 11:51am

right so your uniq can only skip fields or chars.
How about swaping the fields using sed like:

sed 's/$[^ ]$ *$.$$/\2 \1/' |
uniq -f1
sed 's/$[^ ]$ *$.$$/\2 \1/'

Perderabo · September 13, 2005, 11:56am

Try:
sort -mu -k1,1 < datafile

futurelet · September 13, 2005, 9:35pm

awk '!($1 in a);{a[$1]}' infile

jim_mcnamara · September 14, 2005, 9:10am

Or an even more cryptic version:

awk '!x[$1]++' filename > newfile

All this does is create an associative array. The first time it encounters the array element it will be zero, so it will print the whole record. If the element is not zero we have seen it before, so do not print it. $1 is the first field in the record.

giannicello · September 14, 2005, 2:47pm

Wow. Thanks guys. I tried Perderabo's solution and it worked perfectly.
I wasn't sure if a simple code like that would work but it does and I'm a little unsure why it does work...glad it does but not sure why it does.

I'll test the other codes as well to see out of curiosity.
Thanks.
Gianni

tmarikle · September 14, 2005, 3:57pm

jim mcnamara:

Or an even more cryptic version:
awk '!x[$1]++' filename > newfile
All this does is create an associative array. The first time it encounters the array element it will be zero, so it will print the whole record. If the element is not zero we have seen it before, so do not print it. $1 is the first field in the record.

How cool is awk?!

I've seen this technique before but thought I would test it on 1 million lines in a data file. If finished in half of the time than the sort -mu command. awk also eliminated duplicates 1 million lines apart as you would expect based on the logic. The sort2-mu command assumes that the file is already sorted and a duplicate 1 million lines apart is ignored.

giannicello · September 14, 2005, 5:07pm

I tried the different solutions and the one that comes closest is Perderabo's.
The only time it doesn't work is if there are any blanks in the first set of alphanumerics ( which I just found out is possible).
How would I modify any of the above solutions to look at, say, characters 1 thru 30, out of a 100 character record for exact matches and keep first occurrence and remove the rest of the duplicates?

Here's some records that I found that's causing me to be back at square one...

92247140 1203QA RRN ..
92247140 1203QA RRP ...
92247140 1203QB RRP ...

Do I have to do an awk on this one with substrings? I tested Jim's solution also and it was fast..unfortunately it found a little more dups than I'd hope due to the way the records come in, otherwise, it I'd use it.

Thanks,
Gianni

tmarikle · September 14, 2005, 5:27pm

Jim's awk solution will work using substring:

awk '!a[substr($0,1,15)]++' inputfile

and it still runs in 12 seconds for a 3.3 million lines worth of data (your three lines repeated and unsorted).

My test result:

92247140 1203QA RRN ..
92247140 1203QB RRP ...

If you need to retail the rest of the line for each unique key, the awk script would have to be modified a bit.

vgersh99 · September 14, 2005, 5:54pm

it might be better to think of your lines in terms of 'fields' - In case your 'fields' might become varying in length.

Right now all your fields are of the same length and 'substr($0,1,15)' seems to be refering to the first two fields. This is what makes your line/record unique.

If that's the case:

awk '!a[$1,$2]++' inputfile

tmarikle · September 14, 2005, 6:12pm

vgersh99:

it might be better to think of your lines in terms of 'fields' - In case your 'fields' might become varying in length.

Right now all your fields are of the same length and 'substr($0,1,15)' seems to be refering to the first two fields. This is what makes your line/record unique.

If that's the case:
awk '!a[$1,$2]++' inputfile

Fields 1 and 2 won't work in the OP's case since it loses its uniqueness on character strings without the space:

47147140631204DC ADK
47147140631204DC ALK

If 1 through 30 is a static rule then you can use the substr. If the key length varies then that's another problem.