Using forward slash in search pattern in perl script

ambarginni · June 14, 2016, 12:21pm

I have existing pattern in the perl script as:

my $pattern = "^Line.*?:|^Errors*: [^0]|^SEVERE:.*?:|^Null pointer exception occurred";

and I wanted to include below keywords in my search pattern

 "I/O exception" and "FileNotFoundException"

the problem is when I include my pattern like

my $pattern = "^Line.*?:|^Errors*: [^0]|^SEVERE:.*?:|^Null pointer exception occurred|I/O exception|FileNotFoundException";

I am not sure the forward slash used in I/O Exception will still be valid or not as forward slash is an special character.

Similar way I have existing code as:

@fails = grep /fail$/, ( grep !/^\./, readdir LOGDIR );

Here also I wanted to include I/O Exception and FileNotFoundException.

to include like

@fails = grep /I/O exception\|FileNotFoundException\|fail$/, ( grep !/^\./, readdir LOGDIR );

is the right way?

I really do not understand regex and so facing problem please help.

bakunin · June 14, 2016, 1:37pm

It is relatively easy: the forward slash has a special meaning. Whenever you want to use a character with a special meaning literally you need to "escape" it. Escaping is done by prepending it with a backslash:

my $pattern = "^Line.*?:|^Errors*: [^0]|[...]|I\/O exception";

Notice that "|" separates different patterns to seach for and is like a logical OR. The expression

"ab|cd|ef|gh"

searches for any occurrence of "ab" or "cd" or "ef" or "gh". Therefore to add a pattern just add a "|" at the end and then the pattern you intend to search for.

I hope this helps.

bakunin

ambarginni · June 15, 2016, 3:02am

I have appended the keywords as you suggested.

Further I wanted to understand the code better.. can you please help me..
in the below code

what does that mean....

@fails = grep /FileNotFoundException\|I\/O exception\|fail$/, ( grep !/^\./, readdir LOGDIR );

bakunin · June 15, 2016, 5:59am

ambarginni:

Further I wanted to understand the code better.. can you please help me..
in the below code what does that mean....
@fails = grep /FileNotFoundException\|I\/O exception\|fail$/, ( grep !/^\./, readdir LOGDIR );

To be honest, i can only guess, because i habitually avoid perl like the plague. It seems to me that: !/^\./ is the regular expression, where "!" means: reverse the search, give every line NOT matched by the following expression.

The following regular expression is /^\./ : /.../ is only the (traditional) delimiter (like the double quotes for strings: ".."), "^" at the beginning means "beginning of line" and "\." is an escaped (we had that already) literal full stop.

So the whole expression means "every line NOT starting with a full stop as first character.

To understand regexes better, here is a little introduction into the concept:

As i already said regular expressions are delimited by "/", so we will write it like that here, regardless of the tool we use to test it needing this delimiter or not. The simplest form of regular expression is to search for fixed strings:

/aBcde/

searches for the string "aBcde". Note that it searches for "aBcde" BUT NOT for "abcde"! Every character you put into the search string represents itself and itself only!

Now, this is of rather limited use and to make regexes more flexible there are so-called meta-characters. Metacharacters do not match themselves but modify the way other characters are matched. Suppose we would like to search for "aBcde" or "abcde" and do not want to care about capitalisation of the "b". This can be done by a "range":

/a[Bb]cde/

The [Bb] means: either "B" or "b" - but not both! The input string aBbcde would NOT be matched! you can signify ranges of characters using the dash:

/a[a-z]cde/         # all non-capitalized characters
/a[A-Za-z]cde/      # all capitalized or non-capitalized characters
/a[a-z0-9]cde/      # all non-capitalized characters or numbers
/x[0-9][0-9][0-9]x/ # a three-digit number surrounded by "x"

It is also possible to "invert" these ranges, by using a caret "^" as the first character inside the range:

/a[^0-9]cde/      # an "a" followed by any single charater except a number followed by "cde"

Note that inside these ranges all metacharacter LOSE their special meaning.

The next metacharacter is similar to that ranges but even more general: it is the full stop ".". It matches any single character:

/ab.de/    # any 5-digit string starting with "ab" and ending with "de"

All metacharacters revert only to their literal meaning if they are prepended by a backslash - even the backslash itself:

/\..\./    # a dot followed by any single character followed by a dot
/\\..\./   # a backslash followed by two characters followed by a dot

Because the first backslash escapes the second one the second one counts as a simple backslash without the escaping ability. Therefore the previously escaped dot becomes a metacharacter again.

Now there are not only metacharacters to match certain other characters (or groups thereof) but which modify other expressions. The first you need is ther asterisk "*". It means zero or more of the expression before.

/ab*c/  # a followed by any number of b's (even none) followed by c

Notice that the last example matches "abc" and "abbbc" but also "ac"! If you want to match at least one "b" so that "ac" is not matched at all you need to double it:

/abb*c/  # matches "abc" and "abbc", "abbbc", etc. but not "ac"

There is also another construct to count the number of expressions to match: "\{m,n\}" where "m" and "n" are numbers. This works similar to the asterisk, but limits the number of allowed occurrences to be between m and n.

/ab\{1,3\}c/  # matches "abc" and "abbc" and "abbbc", but not "ac" or "abbbbc", etc.

You may have noticed that i talked about "expressions" rather than "characters" in the last part. In the simplest form an "expression" is a single character or metacharacter:

/@.\{2,3\}@/   # any two or three characters, surrounded by ats

But this is an "expression":

/![0-5]\{2,3\}!/   # any two or three digits 0-5, surrounded by exclamation marks

And characters or other expressions can be further grouped by braces:

/\([0-9]\{3\}\.\)\{2\}/   # two groups of each 3 digits followed by a literal dot

This would match "123.456." or "913.756.", etc.

Greedyness

There is a cause of endless misunderstandings caused by the range of a possible match. Consider the following input:

abcdXfdgdkjXsfdsdX2387X

Now suppose we have the following regular expression, which part of the above string would it match:

/a.*X/   # an "a" followed by any number of any characters followed by "X"

Possible answers:
abcdXfdgdkjXsfdsdX2387X
abcdXfdgdkjXsfdsdX2387X
abcdXfdgdkjXsfdsdX2387X
abcdXfdgdkjXsfdsdX2387X

The right answer, in all Unix-like regex machines is: the last one. because "any character" includes the X, the longest possible match is used. This is called "greedyness" of regular expressions. If it matches the longest possible string it is called "greedy" if it matches the shortest possible string this is called "non-greedy". Unix-regexes are usually greedy.

That rises the question how we would match the non-greedy variant. This is done usually with inverted character classes. The first answer above can be built this way:

/a[^X]*X/   # an "a" followed by any number of non-X followed by "X"

Now this is only a short introduction. If you want to know more about regexes you might want to read Dale Doughertys phantastic book "sed & awk", published by O'Reilly.

Further pointers: understand the (slight) differences between "extended regular expressions" (EREs) and "basic regular expressions" (BREs) - btw., i have introduced BREs here and that there are UNIX-EREs and UNIX-BREs and also GNU-EREs and GNU-BREs (used in the Linux counterparts of Unix utilities like sed and awk ). There are also perl-REs, which is still a slightly different regexp-engine. They are all quite similar, though, and the basic workings are always the same, so if you know one you know about 90% of all the others too.

I hope this helps.

bakunin

MadeInGermany · June 15, 2016, 12:51pm

The slash is not a sepecial character in an RE, but it is a delimiter for an RE in perl.
I am not a perl expert, but it looks like you need to escape a constant / within the delimiting slashes

grep /I\/O exception|FileNotFoundException/, ...;

but not in a variable

my $pattern = "I/O exception|FileNotFoundException";
grep /$pattern/, ...;

Because grep takes any perl expression, one can take the m operator that allows other RE delimiters, for example the #

grep m#I/O exception|FileNotFoundException#, ...;

Aia · June 15, 2016, 7:27pm

ambarginni:

I have appended the keywords as you suggested.

Further I wanted to understand the code better.. can you please help me..
in the below code what does that mean....

grep !/^\./
@fails = grep /FileNotFoundException\|I\/O exception\|fail$/, ( grep !/^\./, readdir LOGDIR );

Actually, grep !/^\./ should not be considered as stated. grep in Perl is not similar to the grep family of utilities in the Unix word. While often, it uses regular expression as part of the block or expression, that's the only similitude.
The proper way of thinking of grep is as this:

grep BLOCK LIST
grep EXPRESSION,LIST

Notice the coma.

The first will be seen as:

my @list_result = grep {!/^\./} @given_list;

Inside that block {} you can put a lot of normal code, in this case is a regular expression and the return of that expression is negated. If that evaluates to true, the content of $_ (which contains an element of the list @given_list) is appended to @list_result.

The second will be seen as:

my @list_result = grep !/^\./, @given_list;

or

my @list_result = grep (!/^\./, @given_list);

The difference between BLOCK and EXPRESSION is that the later only accepts one expression instead of whole block of code. For this purpose, it is the same.

Going back to this portion:

@fails = grep /FileNotFoundException\|I\/O exception\|fail$/, ( grep !/^\./, readdir LOGDIR );

First, you need to eliminate those escape characters highlighted in red, otherwise you are just making the | a normal pipe character and not the regex alternation, indicating OR
I\/O that needs to be escaped only because you are using the default delimiters m// and when the code gets parsed it sees the first / it reads until I/ and it thinks is done, and it does not know what to do with the rest.
If the default delimiter is used, the m is optional, but it is not optional if another character is used as delimiter.
Can you now see what that line of code is? Can you tell if it is a grep BLOCK LIST or a grep EXPRESSION,LIST ?
It is a nested set of grep EXPRESSION,LIST.
The first one is grep !/^\./, readdir LOGDIR which will return a LIST of the filenames in LOGDIR that do not start with a period, being itself the LIST argument passed to the second grep /FileNotFoundException|I\/O exception|fail$/, (...) represented by the three dots.

As a note, the last grep is not reading the content of the files and searching for the strings FileNotFoundException or I/O exception or fail$ . It's using those regex against the filename, if any of those filenames match in the name, that element gets appended to @fails.