Performance issue in UNIX while generating .dat file from large text file

KRAMA · April 20, 2009, 2:26pm

Hello Gurus,

We are facing some performance issue in UNIX. If someone had faced such kind of issue in past please provide your suggestions on this .

Problem Definition:
/Few of load processes of our Finance Application are facing issue in UNIX when they uses a shell script having below portion of code. The below portion of codes reads an input file and writes them into an .dat file. The performance issue arises when there is huge volume of data in the input file.
For example: For data volume having 200,000 records is taking 38 mins to get append/write into the .dat file which increases the complete load process timings. We need to increase the performance of this proces by reducing the time its taking to append/write the records.
/*****************************************

Portion of Code from Shell Script:
/***********************************************************************************************************************************************
m_arr_ctr=1
cat ${m_recv_dir}/${m_glb_d92_nm}${m_glb_file_seq} |while read d92_line
do
m_brch_cd=`echo "${d92_line}" |cut -c166-168`
# This is the case when we reach the last line '*/', we just skip that line
if [ "${m_brch_cd}" = "" ]
then
continue
fi
if [ "${m_brch_cd}" = "400" ]
then
m_jv_cd=`echo "${d92_line}" |cut -c190-192`
else
m_jv_cd=${m_brch_cd}
fi
if [ ! -s tmp_d92${m_brch_cd}z${m_jv_cd} ]
then
echo "TMP" > tmp_d92${m_brch_cd}z${m_jv_cd}
m_a_d92_list[$m_arr_ctr]=tmp_d92${m_brch_cd}z${m_jv_cd}
m_a_d92_files[$m_arr_ctr]=${m_recv_dir}/gd${m_brch_cd}x${m_jv_cd}${m_glb_rate_cd}.dat
m_arr_ctr=`expr $m_arr_ctr + 1`
m_touched="N"
else
m_touched="Y"
fi
if [ m_touched = "N" ]
then
echo "${d92_line}" > ${m_recv_dir}/gd${m_brch_cd}${m_jv_cd}${m_glb_rate_cd}.dat
else
echo "${d92_line}" >> ${m_recv_dir}/gd${m_brch_cd}${m_jv_cd}${m_glb_rate_cd}.dat
fi

done
for m_file_name in `echo ${m_a_d92_files
[]}`
do
if [[ `grep "/" ${m_file_name} | wc -l` = 0 ]]
then
echo "/" >> ${m_file_name}
fi
done
for m_file_name in `echo ${m_a_d92_list
[]}`
do
rm -f $m_file_name
done
/************************************

Please provide your valuable suggestions. Also is there any way by using SED command for appending the output in fast way?

cfajohnson · April 20, 2009, 3:21pm

With a file that size, you should really be using awk.

Please put code inside

 tags.


/****************************************

m_arr_ctr=1
cat ${m_recv_dir}/${m_glb_d92_nm}${m_glb_file_seq} |while read d92_line

[/quote]

[indent]
That cat is an unnecessary external command, but since it is only run once, eliminating it wll make very little difference.

Part of the slowness is due to calling multiple external commands (many of which are unnecessary: there's no need for expr as the shell can do its own arithmetic) for every line.

What shell are you using? If it's bash or ksh93, you can replace the call to cut:

m_brch_cd=${d92_line:165:3}

An unnecessary subshell (here and later) can add a significant amount of time. Use:

for m_file_name in "${m_a_d92_files[@]"

You don't need wc as well as grep:

if grep "*/" ${m_file_name} > /dev/null

then
echo "*/" >> ${m_file_name}
fi
done
for m_file_name in `echo ${m_a_d92_list
[*]}`
do
rm -f $m_file_name
done
/************************************
Please provide your valuable suggestions. Also is there any way by using SED command for appending the output in fast way?

KRAMA · April 20, 2009, 4:33pm

Hi johnson,

thanks for your advise. I will try to implement your suggestion and will look in the performance. Also the shell used here is ksh.

cfajohnson · April 20, 2009, 4:50pm

Which version of ksh?

KRAMA · April 21, 2009, 1:08pm

Hi John,

The ksh version is 88f. Also i implemented the comand which you gave but the one having removing cut (i.e m_brch_cd=${d92_line:165:3} ) did not worked as you said it will work for ksh93 . And rest of the command did not improved the perfoprmance much . (it improved performance by 1-2 mins). Can you please help me with the suggestion of using AWK. I am very new to AWK .

cfajohnson · April 21, 2009, 3:52pm

Please describe exactly what the script needs to do.

What files does it use for input? What is the format of those files?

What is the format of the output?

KRAMA · April 21, 2009, 4:43pm

Hi John,

Please find the answers as below:

Please describe exactly what the script needs to do.
This script splits the data from Detail files (i.e which are the input files for the shell script in .txt format) . In this case the detal file located at ${m_recv_dir}/${m_glb_d92_nm}${m_glb_file_seq} . which is the starting portion of the code which i posted.

This script reads the data line by line from the text file and prepare output .DAT file.

Once the .DAT file is created it puts '*/' end of file character at the bottom of the output file generated. Once the .DAT output file is generated another shell script loads data from this .DAT files to work tables of the database using SQL Loader.

What files does it use for input? What is the format of those files?

The format of the input file is .txt

What is the format of the output?

The output format is .DAT

Please do let me know what else information you need so you can help me on this..

cfajohnson · April 21, 2009, 4:56pm

.txt and .DAT tell me nothing about the format of the files.

What does a line from the .txt file look like?

What has to be done to it to prepare it for the .DAT file?

KRAMA · April 21, 2009, 5:28pm

Hi John,
Please find the answer as below:

What does a line from the .txt file look like?

A record from the input text file looks like below:

0 1509999999900000002A200811AA 0 108012121315LREPO 150
*/

The name of the input file looks like this : glbd92_1000112008_0402110932

What has to be done to it to prepare it for the .DAT file?

I am just redirecting the output to .DAT extension. no Conversion is there related to data from .txt to .DAt . its simply creating new .DAT file or APPENDING the .DTA file based of the loop conditions.

If you need more details of script then please do let me know.

cfajohnson · April 21, 2009, 5:49pm

In your script you use cut to get characters at columns 166 to 168 of some line. That line is not that long. What were you trying to do?

What are those "loop conditions"?

If all you are doing is copying lines from one file to another, why do you need a complicated script?

KRAMA · April 22, 2009, 10:40am

Hi John,

This is an actual record from the input .TXT file.

002000012008AA01000 10000405 7010000
150609 Y G5 PRD 000000000000000.00
000000 000000000000000.00000000 000000000000000.00000000 000000000000000.00000000 000000000000000.00000000 000000000000000.00000000 000000000000000.00000000
000000000000000.00000000 000000000000000.00000000-000000000000100.00000000 000000000000000.00000000 000000002104734.00000000 000000000000000.00000000
LREPO ZGIFBM_GL_108012121315

The script is not written by me , its written by someone else i have to enhance the script to make the performance better. The looping conditions are important because there is one Branch/site '400' which they want to cut the characters from the input record and need to keep it as branch.

Can you please let me know wheteher it is possible to avoid the m_arr_ctr and to generate/append the output in a batch instead line by line. Or how to apply awk over ther.

KRAMA · April 30, 2009, 12:43pm

Can any one help me on this...

durden_tyler · April 30, 2009, 9:59pm

Try using perl. It was designed for fast text processing.

tyler_durden

devtakh · May 1, 2009, 12:03am

It will be easier for us to suggest solutions if you could lay down the input file structer( which you already did)..tell us the logic of what you wanted to achieve and then an o/p sample for the said inputs..

cheers,
Devaraj Takhellambam

cfajohnson · May 1, 2009, 12:31am

So was awk, and it is much easier to learn, and awk scripts are much easier to understand.

methyl · May 1, 2009, 9:13am

As far as I can see this section of code reads all the output data files to find out if they contain a '/' and then appends a '/' if there isn't one present.

Earlier in the script we apparently ignored the last line '*/' in the input stream (not proven that that bit of code works).

Providing that '/' was properly ignored in the input stream (an area of the script which could be improved by using grep -v \^'/' instead of the very first cat) it is impossible for a '/' to appear in any of the output files. We can therefore halve the run time by not re-reading the output data before appending the '/'.

for m_file_name in `echo ${m_a_d92_files[*]}`
do
                echo "*/" >> ${m_file_name}
done

Untested.

cfajohnson · May 2, 2009, 6:41pm

Using echo is unnecessary and will break the script if any member of m_a_d92_files[*] contains whitespace.

for m_file_name in "${m_a_d92_files[@]}"

methyl · May 3, 2009, 12:39pm

cafjohnson.
Agreed. The script has many areas which could be improved. Apparently the script works with the files provided, but takes too long.
I looked at whether the "card dealing" method for splitting the data could be improved without using a high level language, but there is insufficient information about the data type distribution and no rules stated about the processing order of the data. As far as I can see the core script is slow because it appends to multiple output files.

cfajohnson · May 3, 2009, 1:41pm

The script is slow because you are using the shell on a very large file; that is exacerbated by a number of inefficient constructs and poorly written code.

If I knew exactly what you are trying to do, I could suggest an awk script.

jambesh · May 20, 2009, 10:55am

pls post few records of the input file to help you .