Need to reduce the execution time

We are trying to execute below script for finding out the occurrence of a particular word in a log file Need suggestions to optimize the script.

Test.log size - Approx to 500 to 600 MB

$wc -l Test.log

16609852 Test.log

po_numbers - 11 to 12k po's to search

$more po_numbers

xxx1335
AB1085
SSS6205
UY3347
OP9111
....and so on 

Current Execution Time - 2.45 hrs

while IFS= read -r po
do
check=$(grep -c "PO_NUMBER=$po" Test.log)
echo $po "-->" $check >>list3

if [ "$check" = "0" ]
then
echo $po >>po_to_server
#else break
fi
done < po_numbers

Welcome KumarPiyush7225,

I have a few to questions pose in response first:-

  • What OS, shell and version are you using?
  • What logical process have you considered? (to help steer us to follow what you are trying to achieve)
  • What are the rules for causing an alert? Is it simply the record being mentioned 3 or more times?
  • What is the format of the Test.log file?

Looking at the code, for every record in po_numbers you are reading the full Test.log file. That is the target - reduce the number of times you read a huge file.

There are probably many ways to achieve most tasks, so giving us an idea of your style and thoughts will help us guide you to an answer most suitable to you so you can adjust it to suit your needs in future.

Kind regards,
Robin

Thanks for your response.

  • Linux simtosp81 2.6.18-274.el5 #1 SMP Fri Jul 8 17:36:59 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
  • I want to know if the po in po_numbers file does not exist in test.log than put that in po_to_server file.
  • If the pattern matches i.e. PO_NUMBER=$po then its OK if not then print that po to po_to_server file.
  • The test.log file is a Human readable text file which can have the pattern PO_NUMBER=$po in any position/row of the log file.

Something like this?

awk -v s='^PO_Number=' '
  NR==FNR {
    A[$1]
    next
  }
  {
    for(i=1; i<=NF; i++)
      if (sub(s,x,$i))
        if ($i in A)
          C[$i]++
  } 

  END {
    for(i in C) print i " --> " C
  }
' po_numbers Test.log  > list3

or with some shell-fu:

grep -of <(sed 's/^/PO_Number=/' po_numbers) Test.log |
sort -u |
uniq -c |
sed 's/^ *\([0-9]*\) PO_Number=\(.*\)/\2 --> \1/' > list3

Another approach in python:-

match = set(line.strip() for line in open('po_numbers'))

count = {}

for po_num in open('Test.log').read().split():
        if po_num in match:
                if po_num in count:
                        count[po_num] += 1
                else:
                        count[po_num] = 1

f = open('po_to_server', 'w')

for po_number in match:
        if po_number in count:
                print(po_number, count[po_number])
        else:
                print(po_number, 0)
                f.write(po_number + '\n')

f.close()

Try also

awk '
FNR == NR       {PAT=PAT "|" $1
                 next
                }
FNR == 1        {sub ("=\|", "=(", PAT)
                 sub ("$", ")", PAT)
                }
match ($0, PAT) {TMP = substr ($0, RSTART+10, RLENGTH-10)
                 print TMP > "list3"
                 sub (TMP, "", PAT)
                 sub ("\|\|", "|", PAT)
                 sub ("\(\|", "(", PAT)
                 sub ("\|\)", ")", PAT)
                }

END             {gsub ("[()]", "", PAT)
                 for (n = split (PAT, T, "[=|]"); n>1; n--) print T[n] > "po_2_server"
                }

' PAT="PO_NUMBER=" po_numbers Test.log

In reply to
Scrutinizer
Post# 4

How do i get the other file i.e. po_to_server.(it will have all the po which where not found in the test.log) ??

---------- Post updated at 04:59 PM ---------- Previous update was at 04:51 PM ----------

How do i get the other file i.e. po_to_server.(it will have all the po which where not found in the test.log) ??

Hi, try this adaptation:

awk -v s='^PO_Number=' '
  NR==FNR {
    A[$1]
    next
  }
  {
    for(i=1; i<=NF; i++)
      if (sub(s,x,$i))
        if ($i in A)
          C[$i]++
  } 

  END {
    for(i in A) {
      if (i in C)
        print i " --> " C
      else
        print i > "po_to_server"
    }
  }
' po_numbers Test.log  > list3

Thank You :slight_smile: :):slight_smile: Scrutinizer.
Must appreciated your response.

It worked for me. and the result is really very fast(in seconds.)

I wonder if I could do a uppercase for the input data(po_numbers) for checking in log file as well or the other way around, but i thing tweaking with the search string in the log file for both upper and lower case values will increase the search time.

So,if the input $po itself comes as uppercase then we'll be good in this case.

You're welcome :slight_smile:

You could try:

  NR==FNR {
    A[toupper($1)]
    next
  }

There is also tolower() in awk.

Certainly you can do the counting with the array A (no need for a separate array C ).
Of course you need an if (A) condition in the END section.
Is counting required at all?