Performance issue in UNIX while generating .dat file from large text file

Hello Gurus,

We are facing some performance issue in UNIX. If someone had faced such kind of issue in past please provide your suggestions on this .

Problem Definition:
/Few of load processes of our Finance Application are facing issue in UNIX when they uses a shell script having below portion of code. The below portion of codes reads an input file and writes them into an .dat file. The performance issue arises when there is huge volume of data in the input file.
For example: For data volume having 200,000 records is taking 38 mins to get append/write into the .dat file which increases the complete load process timings. We need to increase the performance of this proces by reducing the time its taking to append/write the records.
/*****************************************

Portion of Code from Shell Script:
/***********************************************************************************************************************************************
m_arr_ctr=1
cat ${m_recv_dir}/${m_glb_d92_nm}${m_glb_file_seq} |while read d92_line
do
m_brch_cd=`echo "${d92_line}" |cut -c166-168`
# This is the case when we reach the last line '*/', we just skip that line
if [ "${m_brch_cd}" = "" ]
then
continue
fi
if [ "${m_brch_cd}" = "400" ]
then
m_jv_cd=`echo "${d92_line}" |cut -c190-192`
else
m_jv_cd=${m_brch_cd}
fi
if [ ! -s tmp_d92${m_brch_cd}z${m_jv_cd} ]
then
echo "TMP" > tmp_d92${m_brch_cd}z${m_jv_cd}
m_a_d92_list[$m_arr_ctr]=tmp_d92${m_brch_cd}z${m_jv_cd}
m_a_d92_files[$m_arr_ctr]=${m_recv_dir}/gd${m_brch_cd}x${m_jv_cd}${m_glb_rate_cd}.dat
m_arr_ctr=`expr $m_arr_ctr + 1`
m_touched="N"
else
m_touched="Y"
fi
if [ m_touched = "N" ]
then
echo "${d92_line}" > ${m_recv_dir}/gd${m_brch_cd}${m_jv_cd}${m_glb_rate_cd}.dat
else
echo "${d92_line}" >> ${m_recv_dir}/gd${m_brch_cd}${m_jv_cd}${m_glb_rate_cd}.dat
fi

done
for m_file_name in `echo ${m_a_d92_files
[]}`
do
if [[ `grep "
/" ${m_file_name} | wc -l` = 0 ]]
then
echo "/" >> ${m_file_name}
fi
done
for m_file_name in `echo ${m_a_d92_list
[
]}`
do
rm -f $m_file_name
done
/************************************

Please provide your valuable suggestions. Also is there any way by using SED command for appending the output in fast way?

With a file that size, you should really be using awk.

Please put code inside

 tags.


/****************************************

m_arr_ctr=1
cat ${m_recv_dir}/${m_glb_d92_nm}${m_glb_file_seq} |while read d92_line

[/quote]

[indent]
That cat is an unnecessary external command, but since it is only run once, eliminating it wll make very little difference.

Part of the slowness is due to calling multiple external commands (many of which are unnecessary: there's no need for expr as the shell can do its own arithmetic) for every line.

What shell are you using? If it's bash or ksh93, you can replace the call to cut:

m_brch_cd=${d92_line:165:3}

An unnecessary subshell (here and later) can add a significant amount of time. Use:

for m_file_name in "${m_a_d92_files[@]"

You don't need wc as well as grep:

if grep "*/" ${m_file_name} > /dev/null

Hi johnson,

thanks for your advise. I will try to implement your suggestion and will look in the performance. Also the shell used here is ksh.

Which version of ksh?

Hi John,

The ksh version is 88f. Also i implemented the comand which you gave but the one having removing cut (i.e m_brch_cd=${d92_line:165:3} ) did not worked as you said it will work for ksh93 . And rest of the command did not improved the perfoprmance much . (it improved performance by 1-2 mins). Can you please help me with the suggestion of using AWK. I am very new to AWK .

Please describe exactly what the script needs to do.

What files does it use for input? What is the format of those files?

What is the format of the output?

Hi John,

Please find the answers as below:

Please describe exactly what the script needs to do.
This script splits the data from Detail files (i.e which are the input files for the shell script in .txt format) . In this case the detal file located at ${m_recv_dir}/${m_glb_d92_nm}${m_glb_file_seq} . which is the starting portion of the code which i posted.

This script reads the data line by line from the text file and prepare output .DAT file.

Once the .DAT file is created it puts '*/' end of file character at the bottom of the output file generated. Once the .DAT output file is generated another shell script loads data from this .DAT files to work tables of the database using SQL Loader.

What files does it use for input? What is the format of those files?

The format of the input file is .txt

What is the format of the output?

The output format is .DAT

Please do let me know what else information you need so you can help me on this..

.txt and .DAT tell me nothing about the format of the files.

What does a line from the .txt file look like?

What has to be done to it to prepare it for the .DAT file?

Hi John,
Please find the answer as below:

What does a line from the .txt file look like?

A record from the input text file looks like below:

0 1509999999900000002A200811AA 0 108012121315LREPO 150
*/

The name of the input file looks like this : glbd92_1000112008_0402110932

What has to be done to it to prepare it for the .DAT file?

I am just redirecting the output to .DAT extension. no Conversion is there related to data from .txt to .DAt . its simply creating new .DAT file or APPENDING the .DTA file based of the loop conditions.

If you need more details of script then please do let me know.

In your script you use cut to get characters at columns 166 to 168 of some line. That line is not that long. What were you trying to do?

What are those "loop conditions"?

If all you are doing is copying lines from one file to another, why do you need a complicated script?

Hi John,

This is an actual record from the input .TXT file.

002000012008AA01000 10000405 7010000
150609 Y G5 PRD 000000000000000.00
000000 000000000000000.00000000 000000000000000.00000000 000000000000000.00000000 000000000000000.00000000 000000000000000.00000000 000000000000000.00000000
000000000000000.00000000 000000000000000.00000000-000000000000100.00000000 000000000000000.00000000 000000002104734.00000000 000000000000000.00000000
LREPO ZGIFBM_GL_108012121315

The script is not written by me , its written by someone else i have to enhance the script to make the performance better. The looping conditions are important because there is one Branch/site '400' which they want to cut the characters from the input record and need to keep it as branch.

Can you please let me know wheteher it is possible to avoid the m_arr_ctr and to generate/append the output in a batch instead line by line. Or how to apply awk over ther.

Can any one help me on this...

Try using perl. It was designed for fast text processing.

tyler_durden

It will be easier for us to suggest solutions if you could lay down the input file structer( which you already did)..tell us the logic of what you wanted to achieve and then an o/p sample for the said inputs..

cheers,
Devaraj Takhellambam

So was awk, and it is much easier to learn, and awk scripts are much easier to understand.

As far as I can see this section of code reads all the output data files to find out if they contain a '/' and then appends a '/' if there isn't one present.

Earlier in the script we apparently ignored the last line '*/' in the input stream (not proven that that bit of code works).

Providing that '/' was properly ignored in the input stream (an area of the script which could be improved by using grep -v \^'/' instead of the very first cat) it is impossible for a '/' to appear in any of the output files. We can therefore halve the run time by not re-reading the output data before appending the '/'.

for m_file_name in `echo ${m_a_d92_files[*]}`
do
                echo "*/" >> ${m_file_name}
done

Untested.

Using echo is unnecessary and will break the script if any member of m_a_d92_files[*] contains whitespace.

for m_file_name in "${m_a_d92_files[@]}"

cafjohnson.
Agreed. The script has many areas which could be improved. Apparently the script works with the files provided, but takes too long.
I looked at whether the "card dealing" method for splitting the data could be improved without using a high level language, but there is insufficient information about the data type distribution and no rules stated about the processing order of the data. As far as I can see the core script is slow because it appends to multiple output files.

The script is slow because you are using the shell on a very large file; that is exacerbated by a number of inefficient constructs and poorly written code.

If I knew exactly what you are trying to do, I could suggest an awk script.

pls post few records of the input file to help you .