We have 50 million records in mainframes DB2. We have a requirement to Record the Change Data Capture(CDC) records.
i.e New Records or Updated Records that were added into the DB2.
Unfortunately we dont have any column indicators to give the details of the changes made to the records.
So we decided to import the same into flat files into UNIX box everyday and then compare the previous day's file and get the changed rows using UNIX text processing functionalities.
The problem is everyday huge data changes happen approx 40 million
So can anyone give me a solution for better handling of the same or even the hardware requirement for UNIX Server for faster processing of the huge data comparison.
We have a similar problem. Are you running diff? That would take forever.
Use something that has associative (hashed) arrays like awk or perl. Assuming you have several files, and an "old" one and a "new" one, that should take less than an hour.
You can search here for examples of both types of code on how to find file differences.
You need a lot of virtual memory, we run on a Solaris 9 sparc v440 with 32GB of memory.
We complete comparing 1.5GB (250K lines) files in about 5 minutes. We do them 12 at a time: 6 old vs 6 new.
Thanks for the solution.. We had come up with a solution for comparing the huge data..
Since we are comparing huge data of flat file records, the follwing can be done
A hash function may be used like you mentioned below for each rows on the flat files, making the comparison easier.
But Is there a utility hash function in unix same as that of orahash in oracle that wud encrypt each new row uniquely within a few set of characters or numbers.
Then we cud use only those hashed codes to compare with the old hash codes of the prev day file and which wud make processing faster too...
Hi methyl, we had considered that option too but, since we are dealing with huge amount of data, so there is a possibility that it wud hamper the database.
your file has a unique key (like some sort of account id, order id etc.)
it is sorted by that key (in ascending order)
Here is an approach that I use when working with flat files and comparing data (its psuedo code):
open today
open yesterday
while true
{
## Check if we need to read a new record
if(!today_rec)
{
today_rec = read_next_record(today)
today_key = get_key(today_rec)
}
## Check if we need to read a new record
if(!yesterday_rec)
{
yesterday_rec = read_next_record(yesterday)
yesterday_key = get_key(yesterday)
}
## If both files are done, exit the processing loop
if (today_rec == NULL and yesterday_rec == NULL)
break;
if (today_key < yesterday_key)
{
## today_key is missing from the yesterday file, its an insert
report_inserted_record(today_rec)
today_rec = NULL
continue
}
else if (today_key > yesterday_key)
{
## yesterday_key is missing from today file, its a delete
report_deleted_record(yesterday_rec)
yesterday_rec = NULL
continue
}
else if (compare_records)
report_changed_record(today_rec, yesterday_rec)
today_rec = NULL
yeterday_rec = NULL
}
The performance boost comes from the fact that each file is traversed exactly once in this approach.
You will have to handle boundary conditions (especially end-of-file and other errors properly in the code.