Need to reduce the execution time

KumarPiyush7225 · January 29, 2018, 9:48am

We are trying to execute below script for finding out the occurrence of a particular word in a log file Need suggestions to optimize the script.

Test.log size - Approx to 500 to 600 MB

$wc -l Test.log

16609852 Test.log

po_numbers - 11 to 12k po's to search

$more po_numbers

xxx1335
AB1085
SSS6205
UY3347
OP9111
....and so on

Current Execution Time - 2.45 hrs

while IFS= read -r po
do
check=$(grep -c "PO_NUMBER=$po" Test.log)
echo $po "-->" $check >>list3

if [ "$check" = "0" ]
then
echo $po >>po_to_server
#else break
fi
done < po_numbers

rbatte1 · January 29, 2018, 10:36am

Welcome KumarPiyush7225,

I have a few to questions pose in response first:-

What OS, shell and version are you using?
What logical process have you considered? (to help steer us to follow what you are trying to achieve)
What are the rules for causing an alert? Is it simply the record being mentioned 3 or more times?
What is the format of the Test.log file?

Looking at the code, for every record in po_numbers you are reading the full Test.log file. That is the target - reduce the number of times you read a huge file.

There are probably many ways to achieve most tasks, so giving us an idea of your style and thoughts will help us guide you to an answer most suitable to you so you can adjust it to suit your needs in future.

Kind regards,
Robin

KumarPiyush7225 · January 29, 2018, 10:52am

Thanks for your response.

Linux simtosp81 2.6.18-274.el5 #1 SMP Fri Jul 8 17:36:59 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux
I want to know if the po in po_numbers file does not exist in test.log than put that in po_to_server file.
If the pattern matches i.e. PO_NUMBER=$po then its OK if not then print that po to po_to_server file.
The test.log file is a Human readable text file which can have the pattern PO_NUMBER=$po in any position/row of the log file.

Scrutinizer · January 29, 2018, 11:03am

Something like this?

awk -v s='^PO_Number=' '
  NR==FNR {
    A[$1]
    next
  }
  {
    for(i=1; i<=NF; i++)
      if (sub(s,x,$i))
        if ($i in A)
          C[$i]++
  } 

  END {
    for(i in C) print i " --> " C
  }
' po_numbers Test.log  > list3

or with some shell-fu:

grep -of <(sed 's/^/PO_Number=/' po_numbers) Test.log |
sort -u |
uniq -c |
sed 's/^ *\([0-9]*\) PO_Number=\(.*\)/\2 --> \1/' > list3

Yoda · January 29, 2018, 11:14am

Another approach in python:-

match = set(line.strip() for line in open('po_numbers'))

count = {}

for po_num in open('Test.log').read().split():
        if po_num in match:
                if po_num in count:
                        count[po_num] += 1
                else:
                        count[po_num] = 1

f = open('po_to_server', 'w')

for po_number in match:
        if po_number in count:
                print(po_number, count[po_number])
        else:
                print(po_number, 0)
                f.write(po_number + '\n')

f.close()

RudiC · January 29, 2018, 12:22pm

Try also

awk '
FNR == NR       {PAT=PAT "|" $1
                 next
                }
FNR == 1        {sub ("=\|", "=(", PAT)
                 sub ("$", ")", PAT)
                }
match ($0, PAT) {TMP = substr ($0, RSTART+10, RLENGTH-10)
                 print TMP > "list3"
                 sub (TMP, "", PAT)
                 sub ("\|\|", "|", PAT)
                 sub ("\(\|", "(", PAT)
                 sub ("\|\)", ")", PAT)
                }

END             {gsub ("[()]", "", PAT)
                 for (n = split (PAT, T, "[=|]"); n>1; n--) print T[n] > "po_2_server"
                }

' PAT="PO_NUMBER=" po_numbers Test.log

KumarPiyush7225 · January 30, 2018, 6:29am

In reply to
Scrutinizer
Post# 4

How do i get the other file i.e. po_to_server.(it will have all the po which where not found in the test.log) ??

---------- Post updated at 04:59 PM ---------- Previous update was at 04:51 PM ----------

How do i get the other file i.e. po_to_server.(it will have all the po which where not found in the test.log) ??

scrutinizer:

Something like this?

awk -v s='^PO_Number=' '
  NR==FNR {
   A[$1]
   next
  }
  {
   for(i=1; i<=NF; i++)
   if (sub(s,x,$i))
   if ($i in A)
   C[$i]++
  } 

  END {
   for(i in C) print i " --> " C
  }
' po_numbers Test.log  > list3

or with some shell-fu:

grep -of <(sed 's/^/PO_Number=/' po_numbers) Test.log |
sort -u |
uniq -c |
sed 's/^ *\([0-9]*\) PO_Number=\(.*\)/\2 --> \1/' > list3

Scrutinizer · January 30, 2018, 11:32am

Hi, try this adaptation:

awk -v s='^PO_Number=' '
  NR==FNR {
    A[$1]
    next
  }
  {
    for(i=1; i<=NF; i++)
      if (sub(s,x,$i))
        if ($i in A)
          C[$i]++
  } 

  END {
    for(i in A) {
      if (i in C)
        print i " --> " C
      else
        print i > "po_to_server"
    }
  }
' po_numbers Test.log  > list3

KumarPiyush7225 · January 30, 2018, 12:48pm

Thank You :) Scrutinizer.
Must appreciated your response.

It worked for me. and the result is really very fast(in seconds.)

I wonder if I could do a uppercase for the input data(po_numbers) for checking in log file as well or the other way around, but i thing tweaking with the search string in the log file for both upper and lower case values will increase the search time.

So,if the input $po itself comes as uppercase then we'll be good in this case.

scrutinizer:

Hi, try this adaptation:

awk -v s='^PO_Number=' '
  NR==FNR {
   A[$1]
   next
  }
  {
   for(i=1; i<=NF; i++)
   if (sub(s,x,$i))
   if ($i in A)
   C[$i]++
  } 

  END {
   for(i in A) {
   if (i in C)
   print i " --> " C
   else
   print i > "po_to_server"
   }
  }
' po_numbers Test.log  > list3

Scrutinizer · January 30, 2018, 2:20pm

You're welcome

You could try:

  NR==FNR {
    A[toupper($1)]
    next
  }

There is also tolower() in awk.

MadeInGermany · January 30, 2018, 3:09pm

Certainly you can do the counting with the array A (no need for a separate array C ).
Of course you need an if (A) condition in the END section.
Is counting required at all?