Checking file for duplicates

pxy2d1 · May 19, 2010, 8:16am

Hi all,

I am due to start receiving a weekly csv containing around 6 million rows. I need to do some processing on this file and then send it on elsewhere.

My problem is that after week 1 the files that I will receive are likely to contain data already received in previous files and I need to strip this data out before sending on.

Initially my plan was to keep a list of each row of data sent and then to check if each row in a new file is already present in my sent list. However it soon became clear that at week 2 I would be checking each of 6 million rows to see if they appeared on a list of 6 million already sent, but at week 5 would be checking against 30 million rows.

I was hoping that someone may have a more efficient way to achieve this.

It is likely that the data will start to be purged after week10 so I would say a max sent list of around 60 million rows.

Any ideas would be appreciated

curleb · May 19, 2010, 9:11am

Being a theoretical question as opposed to a real how-to question, what are you already doing to process the first file? Bear in mind that what you're asking for is more of a framework question than it is a scripting issue.

There are plenty of ways to skin the cat, but which way have you started to do it? No sense in us providing a method that doesn't fit your approach.

methyl · May 19, 2010, 9:28am

We must assume you have a mainstream database engine which can handle CSV files and a computer which has capacity for this task. You really don't mention much about the data or the computer.

At a design level, each record must contain a unique key and whatever information is a parameter to the "purge". Unless you know the source "purge" rules your database of "data already processed" will just grow.

It would make more sense to fix the data feed design at at source. A convention is to mark the record in the source database with a unique extract run reference to prevent repeat extracts - whilst also allowing a rerun.

avronius · May 19, 2010, 11:56am

I've had to create a differential list before for a similar task.

Records 1 to 100 would be sent, followed by 90 - 300, followed by 250 - whatever.

Each time I would create a list of the last N lines captured. In my case, 5 was sufficient, you may need more or less. I would then search for the last lines that I've captured and process from there. Upon completion, I create my new 'last N lines' and repeat for the next time 'round.

Eventually, the solution is to fix the distribution method to be consistent.

aigles · May 19, 2010, 1:49pm

Please show us the structure of your csv file (fields, keys...)

Jean-Pierre.

rdcwayx · May 19, 2010, 10:52pm

CSV is txt file, is command diff not suitable for you ?

matrixmadhan · May 20, 2010, 1:24am

Unique key - this is not required, record as a whole could be unique and there is no need for unique columns in it. Worst, after applying normalization or some form of transformation, 2 records could be unique or not be so.

Fixing the problem at scope, is definitely out of question, though I agree it makes sense to do that, most of the time, its completely out of scope. Not everything flows in line. We need to work on what has been received or what could be potentially received in a real world scenario.

pxy2d1 · May 20, 2010, 8:05am

Fixing this data at source is not currently an option although I agree it would make more sense, I currently have to work with the data I am sent.

A typical example of the data would be

9999,20-NOV-2009,XXX,YYY,0,LF,BUN,EE,L,14,07-NOV-2009,0,1,0,0,0,0,.003
9999,26-OCT-2009,XXX,YYY,0,GU,BUN,LE,L,42,15-SEP-2009,0,1,0,0,0,.131,.131
6666,24-MAR-2010,AAA,BBB,0,BO,MUB,EE,L,1,24-MAR-2010,0,1,0,0,0,.077,.077

Note that even though each row contains 18 fields it is only the first ten fields which make up the unique key.

I am currently playing with the idea of loading the total sent rows into an oracle table and then loading each file recieved into a temp table allowing me to do some set arithmetic on it to get what I need (SELECT TABLE2 MINUS TABLE1), but if possible I would prefer to do this just using files on the server.

Thanks for your help so far ppl

matrixmadhan · May 20, 2010, 8:32am

One approach,

use BDB's to store unique records ( based on the set of keys ) and this BDB needed not be loaded to primary memory completely, tie it to the disk and treat as though you are working on hash ( internally it will keep flipping between using primary and secondary memory )

- before any new record is processed
- form a key from the record with data from the current file
- check if the key is there in the BDB - hash lookup
- if there, record is there already there but probably could have different values, check that
- if not totally new entry, store it as a key value pair