FASTEN count line of dat file and compare with the CTRL file

ckwan · November 12, 2013, 10:21pm

Hi All,
I thinking on how to accelerate the speed on calculate the dat file against the number of records CTRL file.

There are about 300 to 400 folder directories that contains both DAT and CTL files.
DAT contain all the flat files records
CTL is the reference check file for the DAT for verification purpose.
It contain the DAT total number of records information, process date and so on.

for your information, some of the DAT files are pretty large may contain up to > 10Mil records (due to full file and part of the application requirement).

Sometimes it took more than 30 min to perform wc -l for all the DAT file (The script that I created which run serially).

One of the way to accelerate the process of perform aggregation is to spawn multiple processes to perform the tasks since the CPU is idle just for this verification purposes.

Anyone can share the idea ?

Thanks.

---------- Post updated at 10:21 PM ---------- Previous update was at 10:10 PM ----------

  cat LNS_DSLNT01/DSLNT01.CTL
  DSLNT01.DAT�2013-10-21�13636�......

  ::This 3rd Filed in the CTRL indicate the number of records of DAT file will be.

  datdss@root:/dat3/data/UAT_TEST/wc -l ./LNS_DSLNT01/DSLNT01.DAT
  13636

DGPickett · November 13, 2013, 4:01pm

You can go parallel but wc -l DAT processing might saturate the disk channels pretty quickly. Maybe running the verification when the files are created would get it started faster and spread the load? GNU parallel can help run it at max speed. Extracting the CTL lines should be pretty easy, though, find|xargs grep .... Collect the should be CTL and the was DAT each as a line per file in two files, sort and run through comm -3 to find out what is out of whack. The <<() can help make this pipeline parallel.

ckwan · November 14, 2013, 3:40am

Yes this is part of script that run multiple processes at once.

while [[ $i -lt ${MAX_PROCESS}-1 ]] ; do

 let i=$(( $i + 1 ))
 if [ $i = 1 ]; then
   ostr=1
   oend=$ttl_src_ru
        echo $i,$ostr,$oend
        nohup ./verify $ostr $oend > verify_${ostr}_${oend}.lst &
 else
        ostr=`expr $oend + 1`
        oend=`expr $ostr + $ttl_src_ru`
        echo $i,$ostr,$oend
        nohup ./verify $ostr $oend > verify_${ostr}_${oend}.lst &
  fi

done

---------- Post updated at 03:40 AM ---------- Previous update was at 03:40 AM ----------

Yes this is part of script that run multiple processes at once.

while [[ $i -lt ${MAX_PROCESS}-1 ]] ; do

 let i=$(( $i + 1 ))
 if [ $i = 1 ]; then
   ostr=1
   oend=$ttl_src_ru
        echo $i,$ostr,$oend
        nohup ./verify $ostr $oend > verify_${ostr}_${oend}.lst &
 else
        ostr=`expr $oend + 1`
        oend=`expr $ostr + $ttl_src_ru`
        echo $i,$ostr,$oend
        nohup ./verify $ostr $oend > verify_${ostr}_${oend}.lst &
  fi

done

DGPickett · November 26, 2013, 1:32pm

GNU Parallel - GNU Project - Free Software Foundation no sense reinventing a lesser wheel.

The real trick is to poll for modified files and maintain a registry file of file names, line counts and mod times, so the slow part is done ahead of time. Use fuser to ensure the file is fully written (not open for write) before counting lines.