Hi all,
I am due to start receiving a weekly csv containing around 6 million rows. I need to do some processing on this file and then send it on elsewhere.
My problem is that after week 1 the files that I will receive are likely to contain data already received in previous files and I need to strip this data out before sending on.
Initially my plan was to keep a list of each row of data sent and then to check if each row in a new file is already present in my sent list. However it soon became clear that at week 2 I would be checking each of 6 million rows to see if they appeared on a list of 6 million already sent, but at week 5 would be checking against 30 million rows.
I was hoping that someone may have a more efficient way to achieve this.
It is likely that the data will start to be purged after week10 so I would say a max sent list of around 60 million rows.
Any ideas would be appreciated