compare & split files

ss_ss · August 11, 2009, 11:02am

Hi All,

I've 1 big file like:

cat nid_lec_rej_20090804_merged
10084MOCLEC         0408090061480739nid090804132259.03.148990533               
2526716790000008947850036448540401014 R030007150692000                         
2535502720000000010100036165742685000 R030007150354000                         
2554132380000000298300036428156061013 R030007150082000                         
2608117990000000145250036428153472007 R030007148586000                         
2612547640000000055750036452910607010 R030007148076000                         
2511131960000000100000036008715245008 R030007133681000                         
2587377210000000171100036182913145003 R030007131966000                         
2588157990000000190200036459337192005 R030007131975000                         
2599294600000000179600036181101445019 R030007131676000                         
2626160970000000075500036165716085005 R030007131171000                         
2939008270000001725100036182920694027 R030007106040000                         
2941677890000000068000036001629351020 R030007105976000                         
2954673550000000234200036001620285029 R030007105655000                         
2956336840000000038650036001620285029 R030007105697000                         
2956389380000000048000036001620285029 R030007105593000                         
3000150000012287605000001994675

and 3 small files

cat CC29072009_CXXXCU01.rnd
10020MOCLEC         2907090061480739nid090729181916.03.147814552               
2526716790000008947850036448540401014  030007150692000                         
2535502720000000010100036165742685000  030007150354000                         
2554132380000000298300036428156061013  030007150082000                         
2608117990000000145250036428153472007  030007148586000                         
2612547640000000055750036452910607010  030007148076000                         
300005000000945725                                                             

cat CC04082009_CXXXCU04.rnd
10020MOCLEC         0408090061480739nid090804132259.06.148990533               
2511131960000000100000036008715245008  030007133681000                         
2587377210000000171100036182913145003  030007131966000                         
2588157990000000190200036459337192005  030007131975000                         
2599294600000000179600036181101445019  030007131676000                         
2626160970000000075500036165716085005  030007131171000                         
300005000000071640                                                             

cat CC25072009_CXXXCU07.rnd
10020MOCLEC         2507090061480739nid090725021957.09.146887198               
2939008270000001725100036182920694027  030007106040000                         
2941677890000000068000036001629351020  030007105976000                         
2954673550000000234200036001620285029  030007105655000                         
2956336840000000038650036001620285029  030007105697000                         
2956389380000000048000036001620285029  030007105593000                         
300005000001440155

Now im comparing big file with the 3 small files on the basis of id. This field is in 2nd column from 39th position to 79th position in detail record (whose 1st number is 2).

The same field in 3 small files is in 2nd column from 39th position to 80th position in detail record (whose 1st number is 2).

So right now in order to compare 1 big file with the 3 small files im writing 3 while loops but 3 while loops will scan the big file 3 times whereas i want to do in 1 go i.e. big file should be scanned once only.

cat nid_lec_rej_20090804_merged|grep ^2 | while read i

do

x=`echo $i |awk '{print substr($2,2,40)}'`

y=`awk '/[[:digit:]]{37}[[:space:]]{2}'$x'/' /data/output/TEMP/toRND/CC04082009_CXXXCU04.rnd`

[ "$y" !=  "" ] && echo $i >> rnd.out4

done

same for comparing rest 2 small files also.

Please suggest me some efficient ways for the same.

Thanks

Franklin52 · August 11, 2009, 1:54pm

What should be the desired output?

ss_ss · August 11, 2009, 10:03pm

The output should be 3 files splitted from the big file.

cat rnd.out1
2526716790000008947850036448540401014 R030007150692000                         
2535502720000000010100036165742685000 R030007150354000                         
2554132380000000298300036428156061013 R030007150082000                         
2608117990000000145250036428153472007 R030007148586000                         
2612547640000000055750036452910607010 R030007148076000                         

cat rnd.out4
2511131960000000100000036008715245008 R030007133681000                         
2587377210000000171100036182913145003 R030007131966000                         
2588157990000000190200036459337192005 R030007131975000                         
2599294600000000179600036181101445019 R030007131676000                         
2626160970000000075500036165716085005 R030007131171000                         

cat rnd.out7
2939008270000001725100036182920694027 R030007106040000                         
2941677890000000068000036001629351020 R030007105976000                         
2954673550000000234200036001620285029 R030007105655000                         
2956336840000000038650036001620285029 R030007105697000                         
2956389380000000048000036001620285029 R030007105593000

Franklin52 · August 12, 2009, 3:19am

This should work:

awk -F" |_" 'NR==FNR && /^2/{a[substr($0,40,15)]=$0;next}
FILENAME=="CC29072009_CXXXCU01.rnd" && /^2/ && a[$3]{print a[$3] > "rnd.out1"}
FILENAME=="CC04082009_CXXXCU04.rnd" && /^2/ && a[$3]{print a[$3] > "rnd.out4"}
FILENAME=="CC25072009_CXXXCU07.rnd" && /^2/ && a[$3]{print a[$3] > "rnd.out7"}
' file CC29072009_CXXXCU01.rnd CC04082009_CXXXCU04.rnd CC25072009_CXXXCU07.rnd

Use nawk or /usr/xpg4/bin/awk on Solaris if you get errors.

Regards

ss_ss · August 12, 2009, 4:17am

Im bit confused do i need to put this in a while loop?

Secondly the big file & the 3 small files are in different paths.

Franklin52 · August 12, 2009, 4:28am

You can copy and paste the code in a file and make it executable:

#!/usr/bin

awk -F" |_" 'NR==FNR && /^2/{a[substr($0,40,15)]=$0;next}
FILENAME=="CC29072009_CXXXCU01.rnd" && /^2/ && a[$3]{print a[$3] > "rnd.out1"}
FILENAME=="CC04082009_CXXXCU04.rnd" && /^2/ && a[$3]{print a[$3] > "rnd.out4"}
FILENAME=="CC25072009_CXXXCU07.rnd" && /^2/ && a[$3]{print a[$3] > "rnd.out7"}
' file CC29072009_CXXXCU01.rnd CC04082009_CXXXCU04.rnd CC25072009_CXXXCU07.rnd

Use full path names if the files are in different paths.

Regards

ss_ss · August 12, 2009, 5:00am

Could you please tell me in the piece of code provided by you, where is the comparison part with the big file?

And here is my complete code appended with the lines given by you.

#!/usr/bin

awk -F" |_" 'NR==FNR && /^2/{a[substr($0,40,15)]=$0;next}
FILENAME=="/arbor/FX/data/remote/cpm/output/WORK_TEMP/toDINER/CC29072009_CELPCU01.dnr" && /^2/ && a[$3]{print a[$3] > "rnd.out1"}
FILENAME=="/arbor/FX/data/remote/cpm/output/WORK_TEMP/toDINER/CC04082009_CELPCU04.dnr" && /^2/ && a[$3]{print a[$3] > "rnd.out4"}
FILENAME=="/arbor/FX/data/remote/cpm/output/WORK_TEMP/toDINER/CC25072009_CELPCU07.dnr" && /^2/ && a[$3]{print a[$3] > "rnd.out7"}
' file CC29072009_CELPCU01.dnr CC04082009_CELPCU04.dnr CC25072009_CELPCU07.dnr

total_amnt_01=`awk '{a += (substr($1,10,12))}END{printf a}' rnd.out1`
total_amnt_06=`awk '{a += (substr($1,10,12))}END{printf a}' rnd.out4`
total_amnt_09=`awk '{a += (substr($1,10,12))}END{printf a}' rnd.out7`

rec_cnt_01=`(awk 'END{print NR}' rnd.out1_CU01)`
rec_cnt_06=`(awk 'END{print NR}' rnd.out4_CU04)`
rec_cnt_09=`(awk 'END{print NR}' rnd.out7_CU07)`

sed -n '2p' /arbor/FX/data/remote/cpm/output/WORK_TEMP/CTRL/ctrl_DINER >>  /arbor/FX/data/remote/cpm/input/WORK_TEMP/frDINER/tmp.1
sed -n '4p' /arbor/FX/data/remote/cpm/output/WORK_TEMP/CTRL/ctrl_DINER >>  /arbor/FX/data/remote/cpm/input/WORK_TEMP/frDINER/tmp.4
sed -n '6p' /arbor/FX/data/remote/cpm/output/WORK_TEMP/CTRL/ctrl_DINER >>  /arbor/FX/data/remote/cpm/input/WORK_TEMP/frDINER/tmp.7

cat tmp.1 rnd.out1_CU01 >> din_cel_rej_20090804_CU01
cat tmp.4 rnd.out4_CU04 >> din_cel_rej_20090804_CU04
cat tmp.7 rnd.out7_CU07 >> din_cel_rej_20090804_CU07

rm tmp.1 tmp.4 tmp.7 rnd.out1_CU01 rnd.out4_CU04 rnd.out7_CU07

Thanks & Regards

Franklin52 · August 12, 2009, 5:11am

The last line of the awk code:

' file CC29072009_CELPCU01.dnr CC04082009_CELPCU04.dnr CC25072009_CELPCU07.dnr

should be, assuming the merged file is in the current directory (otherwise specify the full path):

' nid_lec_rej_20090804_merged /arbor/FX/data/remote/cpm/output/WORK_TEMP/toDINER/CC29072009_CELPCU01.dnr /arbor/FX/data/remote/cpm/output/WORK_TEMP/toDINER/CC04082009_CELPCU04.dnr /arbor/FX/data/remote/cpm/output/WORK_TEMP/toDINER/CC25072009_CELPCU07.dnr

ss_ss · August 12, 2009, 5:26am

Thanks a lot for your inputs, its working absolutely fine.

And my mistake i didnt read file

Thanks & Regards

Franklin52 · August 12, 2009, 5:38am

Your welcome, glad to hear you get it work now!

Regards

ss_ss · August 12, 2009, 6:58am

Troubling you again but i generalized it and after generalization the output files are getting created but only with the header record i.e. detail n trailer records are not coming

#! /usr/bin/ksh

program_name=$0
program_name=`echo $program_name | sed -e 's/.*\///'`

function usage
{
    echo
    echo $*

    cat << EOF

$program_name  [-options value]

Valid options:

              [-b Bank Name]      run for all banks  if not specified

EOF
exit 1
}

# get arguments from command line
while [ $# -gt 0 ]
do
    case $1 in
      -b)
           [ "$2" = "" ] && usage "no value for option $1"
           read_bank=$2
           shift 2
           ;;
       *)
           echo "no such option $1"
           usage
    esac
done

# Main CPM directories
cpm_base=/arbor/FX/data/remote/cpm
cpm_out=${cpm_base}/output
cpm_in=${cpm_base}/input
cpm_bak=${cpm_base}/BACKUP
cpm_work_in=${cpm_in}/WORK_TEMP
cpm_work_out=${cpm_out}/WORK_TEMP
ctrl_dir=${cpm_work_out}/CTRL
log=/SYSTEM/custom/data/log/CPM/cpm_merge_log.`date '+%Y%m%d%H%M%S'`
echo "logs created in $log"

# Function to check success
function check_status
{
 if [ $? -ne 0 ] ; then
    echo "Check directory permissions, files not able to copied or deleted; exiting main program ......... "
    exit 1
 fi
}

# Function to add leading zeroes to numbers
function leading_zeroes
{
  sum=$1
  ln=`echo $sum|awk '{print length}'`
  nr=$2
  zero=`expr $nr - $ln`
  i=1

  while [ $i -le $zero ]
   do
   sum="x${sum}"
   i=`expr $i + 1`
  done

  echo $sum |sed 's/x/0/g'
}

# Function to add trailing blanks to trailer
function trailing_blanks
{
  blank=$1
  i=1
  sum=""
    while [ $i -le $blank ]
   do
   sum="x${sum}"
   i=`expr $i + 1`
  done

  echo $sum |sed 's/x/ /g'
}

function split_files
{
  ch_name=$1
  exp=`echo $ch_name|tr 'A-Z' 'a-z' `
  #ctrl_file=${ctrl_dir}/ctrl_${ch_name}

  cd ${cpm_work_in}/fr${ch_name}
  file_id=`ls -trC1|grep -v .gz|awk /$exp[[:digit:]]{12}.*\.sd/|tail -1`

  cd ${cpm_work_out}/to${ch_name}
  if [ "$exp" = "cob" -o  "$exp" = "amx" ] ; then
     file_id3=`ls -trC1|awk /$exp[[:digit:]]{12}\.03\./|tail -1`
     file_id6=`ls -trC1|awk /$exp[[:digit:]]{12}\.06\./|tail -1`
     file_id9=`ls -trC1|awk /$exp[[:digit:]]{12}\.09\./|tail -1`
  else
     file_id3=`ls -trC1|awk /CC[[:digit:]]{8}_CELPCU01\..../|tail -1`
     echo $file_id3
     file_id6=`ls -trC1|awk /CC[[:digit:]]{8}_CELPCU04\..../|tail -1`
     echo $file_id6
     file_id9=`ls -trC1|awk /CC[[:digit:]]{8}_CELPCU07\..../|tail -1`
     echo $file_id9
  fi

ready_dir=${cpm_in}/fr${ch_name}/ready

awk -F" |_" 'NR==FNR && /^2/{a[substr($0,40,15)]=$0;next}
FILENAME=="${cpm_work_out}/to{$ch_name}/${file_id3}" && /^2/ && a[$3]{print a[$3] > "${ready_dir}/rnd.out1"}
FILENAME=="${cpm_work_out}/to{$ch_name}/${file_id6}" && /^2/ && a[$3]{print a[$3] > "${ready_dir}/rnd.out4"}
FILENAME=="${cpm_work_out}/to{$ch_name}/${file_id9}" && /^2/ && a[$3]{print a[$3] > "${ready_dir}/rnd.out7"}
' ${cpm_work_in}/fr${ch_name}/$file_id ${cpm_work_out}/to{$ch_name}/$file_id3 ${cpm_work_out}/to{$ch_name}/$file_id6 ${cpm_work_out}/to{$ch_name}/$file_id9

total_amnt_01=`awk '{a += (substr($1,10,12))}END{printf a}' ${ready_dir}/rnd.out1`
total_amnt_06=`awk '{a += (substr($1,10,12))}END{printf a}' ${ready_dir}/rnd.out4`
total_amnt_09=`awk '{a += (substr($1,10,12))}END{printf a}' ${ready_dir}/rnd.out7`

rec_cnt_01=`(awk 'END{print NR}' ${ready_dir}/rnd.out1)`
rec_cnt_04=`(awk 'END{print NR}' ${ready_dir}/rnd.out4)`
rec_cnt_07=`(awk 'END{print NR}' ${ready_dir}/rnd.out7)`

sed -n '2p' ${ctrl_dir}/ctrl_${ch_name} >>  ${ready_dir}/tmp.1
sed -n '4p' ${ctrl_dir}/ctrl_${ch_name} >>  ${ready_dir}/tmp.4
sed -n '6p' ${ctrl_dir}/ctrl_${ch_name} >>  ${ready_dir}/tmp.7

cat ${ready_dir}/tmp.1 ${ready_dir}/rnd.out1 >> ${ready_dir}/din_cel_rej_20090804_CU01
cat ${ready_dir}/tmp.4 ${ready_dir}/rnd.out4 >> ${ready_dir}/din_cel_rej_20090804_CU04
cat ${ready_dir}/tmp.7 ${ready_dir}/rnd.out7 >> ${ready_dir}/din_cel_rej_20090804_CU07

rm ${ready_dir}/tmp.1
rm ${ready_dir}/tmp.4
rm ${ready_dir}/tmp.7
rm ${ready_dir}/rnd.out1
rm ${ready_dir}/rnd.out4
rm ${ready_dir}/rnd.out7

count_3=`leading_zeroes $rec_cnt_01 5`
count_6=`leading_zeroes $rec_cnt_04 5`
count_9=`leading_zeroes $rec_cnt_07 5`

amount_3=`leading_zeroes $total_amnt_01 12`
amount_6=`leading_zeroes $total_amnt_01 12`
amount_9=`leading_zeroes $total_amnt_01 12`
tr=3
filler=`trailing_blanks 61`

echo "${tr}${count_3}${amount_3}${filler}" >>  ${ready_dir}/din_cel_rej_20090804_CU01
echo "${tr}${count_6}${amount_6}${filler}" >>  ${ready_dir}/din_cel_rej_20090804_CU04
echo "${tr}${count_9}${amount_9}${filler}" >>  ${ready_dir}/din_cel_rej_20090804_CU07
}

#Main Program start
{
case $read_bank in

         DINER|diner)
                 echo "Splitting files for DINER only..............."
                 split_files DINER
esac

if [ "$read_bank" = "" ] ; then
# ftp scripts to be added  here for all 4 banks
split_files DINER
fi
} >> $log

echo "logs created in $log "
exit 0

And unable to undersatnd this error:

awk: Input line � cannot be longer than 3,000 bytes.

Franklin52 · August 12, 2009, 7:22am

That's the limit of most awk implementations, use (install) gawk or mawk.

Regards

ss_ss · August 12, 2009, 8:40am

But the generalized version has lesser no of awk input lines than the specific ones as all the paths have been taken into variables

---------- Post updated at 04:40 AM ---------- Previous update was at 03:50 AM ----------

I again replaced the file name & path variables with another variable as shown below:

var=`${cpm_work_in}/fr${ch_name}/$file_id`
var1=`${cpm_work_out}/to{$ch_name}/$file_id3`
var4=`${cpm_work_out}/to{$ch_name}/$file_id6`
var7=`${cpm_work_out}/to{$ch_name}/$file_id9`

awk -F" |_" 'NR==FNR && /^2/{a[substr($0,40,15)]=$0;next}
FILENAME=="${cpm_work_out}/to${ch_name}/$file_id3" && /^2/ && a[$3]{a[$3] > "${ready_dir}/din.out1"}
FILENAME=="${cpm_work_out}/to${ch_name}/$file_id6" && /^2/ && a[$3]{a[$3] > "${ready_dir}/din.out4"}
FILENAME=="${cpm_work_out}/to${ch_name}/$file_id9" && /^2/ && a[$3]{a[$3] > "{ready_dir}/din.out7"}
' $var $var1 $var4 $var7

and it resolved that problem

Thanks & Regards