Hi, I've been trying to removed duplicates lines with similar columns in a fixed width file and it's not working.
I've search the forum but nothing comes close.
There are 3 spaces between first set of alphanumerics and the last three letter codes.
I want to remove lines that match only up to the 3 blanks and ignore the 3 letter codes or whatever else is on that line after the 3 letter codes.
Anyone know how I can do this? I want to keep at least one instance of any duplicates...doesn't matter which.
I put asteriks where I need to keep one of any two.
All this does is create an associative array. The first time it encounters the array element it will be zero, so it will print the whole record. If the element is not zero we have seen it before, so do not print it. $1 is the first field in the record.
Wow. Thanks guys. I tried Perderabo's solution and it worked perfectly.
I wasn't sure if a simple code like that would work but it does and I'm a little unsure why it does work...glad it does but not sure why it does.
I'll test the other codes as well to see out of curiosity.
Thanks.
Gianni
I've seen this technique before but thought I would test it on 1 million lines in a data file. If finished in half of the time than the sort -mu command. awk also eliminated duplicates 1 million lines apart as you would expect based on the logic. The sort2-mu command assumes that the file is already sorted and a duplicate 1 million lines apart is ignored.
I tried the different solutions and the one that comes closest is Perderabo's.
The only time it doesn't work is if there are any blanks in the first set of alphanumerics ( which I just found out is possible).
How would I modify any of the above solutions to look at, say, characters 1 thru 30, out of a 100 character record for exact matches and keep first occurrence and remove the rest of the duplicates?
Here's some records that I found that's causing me to be back at square one...
Do I have to do an awk on this one with substrings? I tested Jim's solution also and it was fast..unfortunately it found a little more dups than I'd hope due to the way the records come in, otherwise, it I'd use it.
it might be better to think of your lines in terms of 'fields' - In case your 'fields' might become varying in length.
Right now all your fields are of the same length and 'substr($0,1,15)' seems to be refering to the first two fields. This is what makes your line/record unique.