Performance assessment of using single or combined pattern matching

ananan · July 8, 2017, 2:01pm

Hi,

I want to know which pattern matching technique will be giving better performance and quick result.

I will be having the patterns in a file and want to read that patterns and search through a whole file of say 70 MB size. whether if i initially create a pattern matching string while reading through the pattern file and combining them to form like or condition added string variable and use it in the awk to search for the pattern in the 70 MB file

nawk -F"," '{ if ((substr($17,1,10)==1234567890 || substr($17,1,10)==2345678901 || substr($17,1,10)==3456789012 || substr($17,1,10)==4567890123)  && (substr($3,1,6)=="ABCDEF" || substr($3,1,6)=="GHIJKL" || substr($3,1,6)=="MNOPQR")) print substr($3,1,6)","$4","$5","$6","$8","$100","$101","$102",4"$103","$104","$109}' /text16.txt

or read the pattern one by one and search the whole file each time for each pattern.
Like

 
While read line
Do
... (same nawk with single pattern in the or portion and after &&  patterns will be same and fixed) 
Done<file

Which process will be faster and kindly give sample string formation techniques if more than one entry available in the file

I wish to make the string concatenation to form the or portion alone in the above code. And portion will be fixed... File can contain one pattern or multiple patterns.

Scrutinizer · July 8, 2017, 5:28pm

The first should be fastest.
I think it could be replaced by something like this (not tested), which should save some operations:

nawk  '
  BEGIN {
    FS=OFS=","
    p1="^(1234567890|2345678901|3456789012|4567890123)$"
    p2="^(ABCDEF|GHIJKL|MNOPQR)$"
  }

  {
    f1=substr($17,1,10)
    f2=substr($3,1,6)
  }

  f1~p1 && f2~p2 {
    print f2,$4,$5,$6,$8,$100,$101,$102,"4"$103,$104,$109
  }
' /text16.txt

RudiC · July 8, 2017, 5:41pm

What Scrutinizer proposed is definitely faster than what you have in your post, but it has the patterns as string constants. You'll need to build those from the file, but how will you tell where to use the "or" operator and where the "and"? Please post a sample of your pattern file.

ananan · July 8, 2017, 10:57pm

Rightly catched my requirements..
The and condition variable p2 will be constant. I need to build the or conditions variable p1 alone.
Say pattern file contains

File can contain either odd number of data or even number of data {I mean the wc -l of file}

RudiC · July 9, 2017, 3:49am

Do you really have empty lines in your pattern file? Anyhow, try this adaption of Scrutinizer's proposal:

awk  '
BEGIN           {FS=OFS=","
                 p2="^(ABCDEF|GHIJKL|MNOPQR)$"
                }

NR == FNR       {if (NF)        {TMP = TMP DL $0
                                 DL = "|"
                                }
                 next
                }
FNR == 1        {p1 = "^(" TMP ")$"
                }
                {f1=substr($17,1,10)
                 f2=substr($3,1,6)
                }

f1~p1 && f2~p2  {print f2,$4,$5,$6,$8,$100,$101,$102,"4"$103,$104,$109
                }
' patternfile file

ananan · July 9, 2017, 3:21pm

Thank you. It's working fine. If you don't mind how can I achieve if I need to read another pattern file 2 and form a variable p3 as like p2 and do a match with f3 in the same script.

rudic:

Do you really have empty lines in your pattern file? Anyhow, try this adaption of Scrutinizer's proposal:

awk  '
BEGIN           {FS=OFS=","
   p2="^(ABCDEF|GHIJKL|MNOPQR)$"
   }

NR == FNR       {if (NF)        {TMP = TMP DL $0
   DL = "|"
   }
   next
   }
FNR == 1        {p1 = "^(" TMP ")$"
   }
   {f1=substr($17,1,10)
   f2=substr($3,1,6)
   }

f1~p1 && f2~p2  {print f2,$4,$5,$6,$8,$100,$101,$102,"4"$103,$104,$109
   }
' patternfile file

RudiC · July 10, 2017, 3:29am

Why don't you give it a go yourself and post it here so we can discuss your approach?

bakunin · July 10, 2017, 1:55pm

ananan:

or read the pattern one by one and search the whole file each time for each pattern.
Like
 
While read line
Do
... (same nawk with single pattern in the or portion and after &&  patterns will be same and fixed) 
Done<file 

It is a long standing knowledge that such an approach (even if the syntax errors are corrected, because it is NOT Do...Done but do...done - the language is case-sensitive) will always be way slower than using awk (or sed or any other text filter) on the whole file.

The reason is: whenever you call an external program (external to the shell, that is) from the shell you start a new (sub-)process. Starting a process is a resource-consuming activity for the system: it has to load an executable into memory, allocate the resources (memory, etc.) necessary to run it and finally start it. This:

command

is exactly one such process, while this:

while read line ; do
     command
done < /some/file

will create such a new process for every line in the input file. When you say the file is 70MB big i suppose these are alot of lines.

Of course, the opening of a single process is no big deal. It will add up, though, and "no big deal" times several thousand times eventually adds up to some big deal.

I hope this helps.

bakunin