Hi All,
I thinking on how to accelerate the speed on calculate the dat file against the number of records CTRL file.
There are about 300 to 400 folder directories that contains both DAT and CTL files.
DAT contain all the flat files records
CTL is the reference check file for the DAT for verification purpose.
It contain the DAT total number of records information, process date and so on.
for your information, some of the DAT files are pretty large may contain up to > 10Mil records (due to full file and part of the application requirement).
Sometimes it took more than 30 min to perform wc -l for all the DAT file (The script that I created which run serially).
One of the way to accelerate the process of perform aggregation is to spawn multiple processes to perform the tasks since the CPU is idle just for this verification purposes.
Anyone can share the idea ?
Thanks.
---------- Post updated at 10:21 PM ---------- Previous update was at 10:10 PM ----------
cat LNS_DSLNT01/DSLNT01.CTL
DSLNT01.DAT�2013-10-21�13636�......
::This 3rd Filed in the CTRL indicate the number of records of DAT file will be.
datdss@root:/dat3/data/UAT_TEST/wc -l ./LNS_DSLNT01/DSLNT01.DAT
13636
You can go parallel but wc -l DAT processing might saturate the disk channels pretty quickly. Maybe running the verification when the files are created would get it started faster and spread the load? GNU parallel can help run it at max speed. Extracting the CTL lines should be pretty easy, though, find|xargs grep .... Collect the should be CTL and the was DAT each as a line per file in two files, sort and run through comm -3 to find out what is out of whack. The <<() can help make this pipeline parallel.
The real trick is to poll for modified files and maintain a registry file of file names, line counts and mod times, so the slow part is done ahead of time. Use fuser to ensure the file is fully written (not open for write) before counting lines.