Regex learning.

RavinderSingh13 · February 12, 2020, 7:50am

Hello All,

I have come across a question from colleague about complex regex, so I written a regex using grep's -P option in PCRE regex. Since its a new learning for me, so thought to share with forums.

Lets say we have a Input_file with following test data:

cat Input_file
PROJECT = 1.1.1.1
Project = 1.1.1.1.1.1.1.1
PROJECT = "1.1.1.1.1"
ProJEct = '1.1'

Now conditions here are first keyword project is fixed but could be in any case, then versions side is the main thing which we need to get as an output. In versions apart from first major version all can have alphabets also.

So I have come up with:

grep -ioP 'project\D+\K(\d+\.([\d,a-z,A-Z]+\.){1,}[\d,a-z,A-Z]+|\d+\.[\d,a-z,A-Z]+|\d+)'   Input_file

Explanation of above code:

-i : means ignore case for grep which will help us to match any kind of Project string in lines.
-o : means give only exact match of the line.
-P : means it enables PCRE regex suite for grep, which has all kind of regex mechanism in it.

Now coming to main code part:

project\D+ : Look for string project(in any case) till all NON digits value(\D denotes it).
\K : means forget all previous matches this is a GREAT feature of grep and I LOVED it
d+\.([\d,a-z,A-Z]+\.){1,}[\d,a-z,A-Z]+|\d+\.[\d,a-z,A-Z]+|\d+ : Here I am matching digits OR digits with alphabets with one or more occurences and only digits too for all lines, to cover all kind of cases.

Since after \k ( denotes the match which should be printed so it will print only matched part in lines.

I am still learning PCRE regex, any suggestions, improvements are super allowed
Cheers and Happy learning.

Thanks,
R. Singh

nezabudka · February 12, 2020, 8:27am

Hi
It's a bit redundant
option -i applies to the whole template

grep -ioP 'project\D+\K(\d+\.([\d,a-z]+\.){1,}[\d,a-z]+|\d+\.[\d,a-z]+|\d+)'

if you want to limit it is better so

grep -oP '(?i:project)'

RavinderSingh13 · February 12, 2020, 8:55am

nezabudka:

Hi
It's a bit redundant
option -i applies to the whole template
grep -ioP 'project\D+\K(\d+\.([\d,a-z]+\.){1,}[\d,a-z]+|\d+\.[\d,a-z]+|\d+)'
if you want to limit it is better so
grep -oP '(?i:project)'

Hello Nez,

Cool; thanks for letting your views , but IMHO why I added that checks in case version is in some other format then it shouldn't have false positive in output.

Thanks,
R. Singh

nezabudka · February 12, 2020, 9:21am

here is opposites \K

grep -ioP 'project\D+(?=\d+\.([\d,a-z]+\.){1,}[\d,a-z]+|\d+\.[\d,a-z]+|\d+)'

maybe means forget all follows matches?

--- Post updated at 18:18 ---

exactly means forget THIS match #positive lookahead
I'm sorry, carried away

acascianelli · February 28, 2020, 9:29pm

Not sure if this would be useful for you, but I found this tool a while back and it comes in handy when having to deal with regular expressions:

Expresso Regular Expression Tool

RavinderSingh13 · March 3, 2020, 12:09am

Hello All,

Learnt an example of Lazy match in Regex in Perl, so thought to share here.
Let's say following is Input_file.

cat Input_file
abcdtest123^ DUMMYtestabcd12234 DUMMY bla blabla12231311313blabla bla.....,,,,,bla
test132131 ^ DUMMY blabla1213 121313_ 131y7351eg1eub wdfwfknfidh28e7ty;;;

Now we would like to have data between first occurrence of ^ to DUMMY , then we could use Lazy match like as follows:

perl -pe 's|(\^.*?DUMMY\s+)(.*)| new_text_here.... \2|'  Input_file

Output will be as follows for mentioned sample:

abcdtest123 new_text_here.... bla blabla12231311313blabla bla.....,,,,,bla
test132131  new_text_here.... blabla1213 121313_ 131y7351eg1eub wdfwfknfidh28e7ty;;;

Why is Lazy match Good here? Because .* is a GREEDY match and matches anything till last occurrence of any mentioned character etc but using Lazy match .*?DUMMY\s+ it matches very first occurrence of string DUMMY followed with space starting from ^

Tested and written this in PERL, thought/views/improvements are most welcome here.

Thanks,
R. Singh

nezabudka · March 3, 2020, 12:38am

answered not in the case

nezabudka · March 3, 2020, 2:26am

I'm probably a little wrong. the non-greedy expression on the left side allows you to not check the entire string, but limit itself to the first match. hence not a greedy expression faster and that's good
remark:
I already realized that you wrote exactly that

nezabudka · March 3, 2020, 3:14pm

else in PCRE work such flag (?U:) . I will put a space so that it can be seen

cat file
abcdtest123^ DUMMY testabcd12234 DUMMY bla blabla12231311313blabla bla.....,,,,,bla

grep -oP '(?P<p>.*)(\^.*DUMMY\s+)\K\g<p>' file
bla blabla12231311313blabla bla.....,,,,,bla

grep -oP '(?P<p>.*)(?U:\^.*DUMMY\s+)\K\g<p>' file
testabcd12234 DUMMY bla blabla12231311313blabla bla.....,,,,,bla

Thanks for an interesting example.