Solution for the Massive Comparison Operation

raghav288 · June 10, 2010, 7:49am

Hi

We have 50 million records in mainframes DB2. We have a requirement to Record the Change Data Capture(CDC) records.

i.e New Records or Updated Records that were added into the DB2.
Unfortunately we dont have any column indicators to give the details of the changes made to the records.

So we decided to import the same into flat files into UNIX box everyday and then compare the previous day's file and get the changed rows using UNIX text processing functionalities.

The problem is everyday huge data changes happen approx 40 million

So can anyone give me a solution for better handling of the same or even the hardware requirement for UNIX Server for faster processing of the huge data comparison.

Thanks

jim_mcnamara · June 10, 2010, 10:08am

We have a similar problem. Are you running diff? That would take forever.

Use something that has associative (hashed) arrays like awk or perl. Assuming you have several files, and an "old" one and a "new" one, that should take less than an hour.
You can search here for examples of both types of code on how to find file differences.

You need a lot of virtual memory, we run on a Solaris 9 sparc v440 with 32GB of memory.
We complete comparing 1.5GB (250K lines) files in about 5 minutes. We do them 12 at a time: 6 old vs 6 new.

I hope this is what you were asking....

raghav288 · June 11, 2010, 12:10am

Thank you very much for you help.. yes I hwas looking for the same

raghav288 · June 13, 2010, 1:58pm

Hi

Thanks for the solution.. We had come up with a solution for comparing the huge data..

Since we are comparing huge data of flat file records, the follwing can be done

A hash function may be used like you mentioned below for each rows on the flat files, making the comparison easier.

But Is there a utility hash function in unix same as that of orahash in oracle that wud encrypt each new row uniquely within a few set of characters or numbers.

Then we cud use only those hashed codes to compare with the old hash codes of the prev day file and which wud make processing faster too...

jim_mcnamara · June 13, 2010, 4:43pm

Glibc has an extensive hash library

See: Hash Tables

raghav288 · June 14, 2010, 5:53am

Hi Jim thanks for the reference but i need help in implementing the same

for eg
Flat file 1 contains :
AAA-BBB-CCC
XXX-YYY-ZZZ

I want to encrypt in such a way that the rows size for each rows are lowered and unique to each other as per the data, like

the above flat file for each row is to be encrypted using any methodology to like

XYS4358
ABC4385

Can anyone help me with the same, Im using AIX

methyl · June 14, 2010, 9:31am

Have you considered addinng a trigger to the DB2 database?

raghav288 · June 15, 2010, 4:50am

Hi methyl, we had considered that option too but, since we are dealing with huge amount of data, so there is a possibility that it wud hamper the database.

a_programmer · June 17, 2010, 2:48am

Assuming that:

your file has a unique key (like some sort of account id, order id etc.)
it is sorted by that key (in ascending order)

Here is an approach that I use when working with flat files and comparing data (its psuedo code):

open today
open yesterday
while true
{
  ## Check if we need to read a new record
  if(!today_rec)
  {
    today_rec = read_next_record(today)
    today_key = get_key(today_rec)
  }

  ## Check if we need to read a new record
  if(!yesterday_rec)
  {
    yesterday_rec = read_next_record(yesterday)
    yesterday_key = get_key(yesterday)
  }

  ## If both files are done, exit the processing loop
  if (today_rec == NULL and yesterday_rec == NULL)
    break;

  if (today_key < yesterday_key)
  {
    ## today_key is missing from the yesterday file, its an insert
    report_inserted_record(today_rec)     
    today_rec = NULL
    continue
  }
  else if (today_key > yesterday_key)
  {
    ## yesterday_key is missing from today file, its a delete
    report_deleted_record(yesterday_rec)
    yesterday_rec = NULL
    continue
  }
  else if (compare_records)
    report_changed_record(today_rec, yesterday_rec)
  
  today_rec = NULL
  yeterday_rec = NULL
}

The performance boost comes from the fact that each file is traversed exactly once in this approach.
You will have to handle boundary conditions (especially end-of-file and other errors properly in the code.

~A Programmer