Performance issue in shell script

ureddy · July 8, 2014, 4:30am

Hi All,

I am facing performance issue while rinning the LINUX shell script.

I have file1 and file 2. File one is the source file and file 2 is lookup file. Need to replace if the pattern is matching in file1 with file2.
The order of lookup file is important as if any match then exit from loop and no need to search further for that record and continue search for next record.

file1
------
one|xxxx|111111NEW YORK|abcd
two|yyy|TEXAS 222222TEXASTEXAS|defg
three|zzzz|CALIFORNIA TEXAS TEXAS 3333 CALIFORNIA|defg
four|kkkk|DALLAS DALLAS|defg
 
file2
-----
NEW YORK,NY
CALIFORNIA,CA
TEXAS,TX

If the file2 record 1st field matches with file1 record 3rd field then I need to do the below things.

if the string present only once then dont replace string and just add filed2 from lookup and |2|N at the end of line
if the string present more than once then leave the first occurence of string and replace the rest of occurences and add |2|Y at end of line.

if there is no match then just add space and |2|N at the end of line

So output is below.

 
one|xxxx|111111NEW YORK|abcd|NY|2|N (NEY YORK matched but present only once so not replacing. Also as match found exit from loop and no need to search and replace)
two|yyy|TEXAS 222222TXTX|defg|TX|2|Y (TEXAS present more than once and replacing from 2nd occurence and leaving the first occurence) 
three|zzzz|CALIFORNIA TEXAS TEXAS 3333 CA|defg|CA|2|Y ( only replaced the 2nd occurence of CALIFORNIA. TEXAS not replaced because if any match already done(CALIFORNIA) then no need to replace rest of matches so exit from loop.
four|kkkk|DALLAS DALLAS|defg| |2|N (no match so not replaced any thing)

I have tested the below code and its working fine but taking much time. Its processing 1 record for 1 second and I have 1000000 records to process and taking much time.
Can any one help me in tunig this script.

CODE is below

echo "Replace the string matches only once or except FIRST occurence replace ALL." >>$LOG
tot_cnt=`wc -l < $REP_FILE_PATH/$REP_FILE`
del_tmp_files
 
while IFS='' read -r line; do (to preserve leading and trailing spacees used IFS='' read -r )
i=0
while read rep_line; do
field[1]=`cut -d',' -f1 <<<"$line"`
field[2]="`cut -d',' -f2 <<<"$line"`
cnt=`echo -n "$line" | grep -o "${field[1]}" | wc -l`
if [[ "$cnt" -eq 1 ]] ; then
sed -e "s/$/|"${field[2]}"|2|N/" <<<"$line" >> tmp.txt'
break
fi
if [[ "$cnt" -gt 1 ]] ; then
sed -e "s/"${field[1]}"/"${field[2]}"/2g" -e "s/$/|"${field[2]}"|2|Y/" <<<"$line" >> tmp.txt
break
fi
let i++
if [[ "$cnt" -eq 0 && "$tot_cnt" -eq $i ]] ; then
sed -e "s/$/|" "|2|N/" <<<"$line" >> tmp.txt
fi
done < file2.txt
done< file1.txt

rbatte1 · July 8, 2014, 5:41am

Welcome ureddy,

Please wrap your code & input/output in CODE tags. Highlight the text and press the CODE button or do this:-

```text
Here is my code
```

...to produce:-

Here is my code

The problem I think you are having is that you are starting many sub-processes for every line of your input file. Calls such as cut, sed, etc. all have a cost to setting up the process. If you are calling them in a loop, then you may have hundreds of calls.

If you can wrap your code in CODE tags, then it will be far more readable and I will have a go at it.

Thanks, in advance,
Robin

ureddy · July 8, 2014, 6:27am

Thanks for looking into this Robin. I have added code as you specified now.

rbatte1 · July 8, 2014, 6:29am

Thanks for the update to mark the code. Can you do the same with the input and output? If there are multiple spaces, these get compressed when displayed as normal text - and that might be important.

For your inner loop reading file2.txt where do you plan to use the value read in as rep_line It's not anywhere else in your script.

I'm also unclear with the if.....then....break.....fi section and what is actually required here. Are you simply looking to not complete the remaining if...then.... sections? There are better ways to code that.

Can you write your logic out in words like this:-

For every line in file1.txt
[list]
Read every line in file2.txt
Compare them so that
[list]
if condition A matches I take action A
if condition B matches I take action A
if condition C matches I take action C
or I do action D
[/list]
I write the output built from the input lines in format F
End loop
[/list]
End loop

To get the bullet list, write your text first, then highlight the block and press the bullet list button. Having lists within lists produces the indentation to make it easier to read.

Thanks,
Robin

ureddy · July 8, 2014, 6:48am

..................

rbatte1 · July 8, 2014, 7:00am

Um, where did your post go?

if you post again, I will be happy take a look.

Robin

ureddy · July 8, 2014, 7:02am

Actually I was reformatting(remove temp files and parameters) my code to make it run fast so looks I deleted "rep_line". Here is the corrected code.

echo "Replace the string matches only once or except FIRST occurence replace ALL."
        tot_cnt=`wc -l < $REP_FILE_PATH/$REP_FILE`
  del_tmp_files
 
        while IFS='' read -r line; do          (to preserve leading and trailing  spacees used IFS='' read -r )
  i=0
        while read line_1; do
            field[1]=`cut -d',' -f1 <<<"$line_1"`
            field[2]="`cut -d',' -f2 <<<"$line_1"`
            cnt=`echo -n "$line" | grep -o "${field[1]}" | wc -l`
            if [[ "$cnt" -eq 1 ]] ; then
            sed -e "s/$/|"${field[2]}"|2|N/" <<<"$line" >> tmp.txt'
            break
            fi
            if [[ "$cnt" -gt 1 ]] ; then
            sed -e "s/"${field[1]}"/"${field[2]}"/2g" -e "s/$/|"${field[2]}"|2|Y/" <<<"$line" >> tmp.txt
            break
            fi
        let i++
            if [[ "$cnt" -eq 0 && "$tot_cnt" -eq $i ]] ; then
            sed -e "s/$/|"  "|2|N/" <<<"$line" >> tmp.txt
            fi
            done < file2.txt
        done< file1.txt

for your question... if any match then need to stop searching further as no need to match further and continue with the next line match.
here is the description what im doing.

For every line in file1.txt
 Read every line in file2.txt 
 Compare them so that
  if the string present only once then dont replace string and just add filed2 from file2 and |2|N at the end of line in file1 and write output to temp file.Dont check rest of lines in file2.txt as match already found.
  if the string present more than once then leave the first occurence of string and replace the rest of occurences in filed3 of file1 to field2 of file2 and add |2|Y at end of line in file1 and write output to temp file.Dont check rest of lines in file2.txt as match already found.
  if there is no match at all then just add space and |2|N at the end of line and in file1 write output to temp file.
 End loop 
End loop

Sorry for the inconvinence in reading my post as Im new user for this Forum.

ureddy · July 10, 2014, 12:02am

Can any one help me on this please.

rbatte1 · July 11, 2014, 5:58am

Still working on this. I can't quite get the substitution logic right without calling external programs. I'd like to do it all in the same shell to save the processing overheads. The day job keeps getting in the way too

Robin