Help with understanding this regex in a Perl script parsing a 'complex' string

Hi,

I need some guidance with understanding this Perl script below. I am not the author of the script and the author has not leave any documentation. I supposed it is meant to be 'easy' if you're a Perl or regex guru. I am having problem understanding what regex to use :confused: The script does warn about tweaking the regex to suit the ever changing string :mad:

This is the script

[host01]$ cat x.pl
#!/usr/bin/perl
#
# ./logparse.pl <logfile> <service_name_to_search> | sort | uniq
#

$log = $ARGV[0];
$service_name = $ARGV[1];
$found = 0;
open LOG, $log || die "cannot open logfile $!";
while ($line = <LOG>){
        if ( $line =~ /\(SERVICE_NAME=$service_name\).*\(HOST=([\d.\w]+)\)\(USER=(\w+)\)/ ) {
                print $service_name . "\t" . $1 . "\t" . $2 . "\t" . $3 . "\n";
                $found = 1;
           }
        elsif ( $line =~ /\(USER=(\w+)\).*\(SERVICE_NAME=$service_name.*\).*\(HOST=([\d.\w]+)\)*/ ) {
                print $service_name . "\t" . $1 . "\t" . $2 . "\t" . $3 . "\n";
                $found = 1;
           }
        elsif ( $line =~ /\(CONNECT_DATA=\((\w+).*\(SERVICE_NAME=$service_name.*\).*\(HOST=([\d.\w]+)\)*/ ) {
                print $service_name . "\t" . $1 . "\t" . $2 . "\t" . $3 . "\n";
                $found = 1;
           }
        }
close LOG;

if ( $found == "0" ) {
   print "\n" ;
   print "There is no nothing found for " . $service_name . "\n" ;
   print "Maybe the regex needs changing " . "\n" ;
   print "The string format has been known to change " . "\n" ;
   print "\n" ;
}

Here's some sample files to parse and run against this script.

#==> test1.log <==
#2018-07-23 13:19:38 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=mickey))(SERVER=DEDICATED)(SERVICE_NAME=work_app.com.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=12.123.11.123)(PORT=53102)) * establish * work_app.com.ph * 0
#2018-07-23 09:12:12 * (CONNECT_DATA=(CID=(PROGRAM=SQL Developer)(HOST=__jdbc__)(USER=minnie))(SERVICE_NAME=work_app.com.ph)(SERVER=dedicated)(INSTANCE_NAME=testp11)) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.214.14.29)(PORT=53548)) * establish * work_app.com.ph * 0
#
#==> test2.log <==
#2019-05-12 04:17:10 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=fail_app.com.ph)(CID=(PROGRAM=C:\Windows\system32\exec01.exe)(HOST=MNLAPP01)(USER=!sysadmin01))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.11.11.123)(PORT=62625)) * establish * fail_app.com.ph * 0
#2019-05-12 04:17:10 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=fail_app.com.ph)(CID=(PROGRAM=C:\Windows\system32\exec02.exe)(HOST=MNLAPP01)(USER=!sysadmin01))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.11.11.123)(PORT=62627)) * establish * fail_app.com.ph * 0
#2019-05-12 04:17:10 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=fail_app.com.ph)(CID=(PROGRAM=C:\Windows\system32\exec03.exe)(HOST=MNLAPP01)(USER=!sysadmin01))(INSTANCE_NAME=xxxt23)) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.11.11.123)(PORT=62626)) * establish * fail_app.com.ph * 0
#2019-05-12 04:17:11 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=fail_app.com.ph)(CID=(PROGRAM=C:\Windows\system32\exec01.exe)(HOST=MNLAPP01)(USER=!sysadmin01))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.11.11.123)(PORT=62629)) * establish * fail_app.com.ph * 0

Sample run of the script is as below:

[host01]$ ./x.pl test1.log work_app
work_app        mickey  12.123.11.123
work_app        minnie  10.214.14.29
[host01]$ ./x.pl test2.log fail_app
fail_app        SERVER  10.11.11.123
fail_app        SERVER  10.11.11.123
fail_app        SERVER  10.11.11.123
fail_app        SERVER  10.11.11.123

Using awk and paste, this is what I am hoping to get with the Perl script

   awk '{ print $4 }' test2.log | awk -F"(" '{ print $6 }' | awk -F")" '{ print $1 }' > program.tmp.99
   awk '{ print $4 }' test2.log | awk -F"(" '{ print $7 }' | awk -F")" '{ print $1 }' > host.tmp.99
   awk '{ print $4 }' test2.log | awk -F"(" '{ print $8 }' | awk -F")" '{ print $1 }' > user.tmp.99
   awk '{ print $6 }' test2.log | awk -F"(" '{ print $4 }' | awk -F")" '{ print $1 }' > host_ip.tmp.99

   paste program.tmp.99 host.tmp.99 user.tmp.99 host_ip.tmp.99 | sort | uniq

[host01]$ paste program.tmp.99 host.tmp.99 user.tmp.99 host_ip.tmp.99 | sort | uniq
PROGRAM=C:\Windows\system32\exec01.exe  HOST=MNLAPP01   USER=!sysadmin01        HOST=10.11.11.123
PROGRAM=C:\Windows\system32\exec02.exe  HOST=MNLAPP01   USER=!sysadmin01        HOST=10.11.11.123
PROGRAM=C:\Windows\system32\exec03.exe  HOST=MNLAPP01   USER=!sysadmin01        HOST=10.11.11.123

May I please ask someone to kindly explain how the regex is parsing the string? I've been pulling whatever is left of my hair all day and still can't figure out how is it doing what it is meant to be doing. At the moment, I use awk to tmp files and paste to get what I wanted. It is not the best solution I know, sorry.

For the first run of x.pl, it looks alright, but am expecting hoping to get the PROGRAM value as well. I am hoping it should be $1 :frowning:

[host01]$ ./x.pl test1.log work_app
work_app        mickey  12.123.11.123
work_app        minnie  10.214.14.29

For the second run of x.pl, I was hoping to get the output from using awk+paste.

[host01]$ ./x.pl test2.log fail_app
fail_app        SERVER  10.11.11.123
fail_app        SERVER  10.11.11.123
fail_app        SERVER  10.11.11.123
fail_app        SERVER  10.11.11.123

I believe the answers to my problem is trying to figure how is the Perl regex is dissecting the string into several fields. I can understand this line here does the work of search/match for the search string but how does ot break it down into several fields

$line =~ /\(SERVICE_NAME=$service_name\).*\(HOST=([\d.\w]+)\)\(USER=(\w+)\)/

The connection changes also based on the program so sometimes I need information before that SERVICE_NAME and sometimes I need information after and sometimes both? :frowning:

Some regex tutorial will be much appreacited :slight_smile:
Please advise. Thanks.

My perl is non-existent, so no help possible here. But - why not a simple awk solution, like

awk '
match ($4, "SERVICE_NAME=" SRV) {if (match ($4, /PROGRAM=[^)]*/)) P  = substr ($4, RSTART, RLENGTH)
                                 if (match ($4, /USER=[^)]*/))    U  = substr ($4, RSTART, RLENGTH)
                                 if (match ($4, /HOST=[^)]*/))    H  = substr ($4, RSTART, RLENGTH)
                                 if (match ($6, /HOST=[^)]*/))    IP = substr ($6, RSTART, RLENGTH)
                                 print P, H, U, IP
                                }
' SRV="fail_app" OFS="\t" file2
PROGRAM=C:\Windows\system32\exec01.exe    HOST=MNLAPP01    USER=!sysadmin01    HOST=10.11.11.123
PROGRAM=C:\Windows\system32\exec02.exe    HOST=MNLAPP01    USER=!sysadmin01    HOST=10.11.11.123
PROGRAM=C:\Windows\system32\exec03.exe    HOST=MNLAPP01    USER=!sysadmin01    HOST=10.11.11.123
PROGRAM=C:\Windows\system32\exec01.exe    HOST=MNLAPP01    USER=!sysadmin01    HOST=10.11.11.123

EDIT: or

awk '
function chop(FLD, STR)          {if (match ($FLD, STR "=[^)]*")) return substr ($FLD, RSTART, RLENGTH)
                                 }
match ($4, "SERVICE_NAME=" SRV)  {print chop(4, "PROGRAM"), chop(4, "USER"), chop(4,"HOST"), chop(6, "HOST")
                                 }
' SRV="fail_app" OFS="\t" file2
2 Likes

Hi RudiC

I tried both of your suggestion and they both work fine with test2.log but not with test1.log. Is there any way to get it to work for both or do I need to use different awk code for each?

$ head -100 test*log
==> test1.log <==
2018-07-23 13:19:38 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=mickey))(SERVER=DEDICATED)(SERVICE_NAME=work_app.com.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=12.123.11.123)(PORT=53102)) * establish * work_app.com.ph * 0
2018-07-23 09:12:12 * (CONNECT_DATA=(CID=(PROGRAM=SQL Developer)(HOST=__jdbc__)(USER=minnie))(SERVICE_NAME=work_app.com.ph)(SERVER=dedicated)(INSTANCE_NAME=testp11)) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.214.14.29)(PORT=53548)) * establish * work_app.com.ph * 0

==> test2.log <==
2019-05-12 04:17:10 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=fail_app.com.ph)(CID=(PROGRAM=C:\Windows\system32\exec01.exe)(HOST=MNLAPP01)(USER=!sysadmin01))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.11.11.123)(PORT=62625)) * establish * fail_app.com.ph * 0
2019-05-12 04:17:10 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=fail_app.com.ph)(CID=(PROGRAM=C:\Windows\system32\exec02.exe)(HOST=MNLAPP01)(USER=!sysadmin01))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.11.11.123)(PORT=62627)) * establish * fail_app.com.ph * 0
2019-05-12 04:17:10 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=fail_app.com.ph)(CID=(PROGRAM=C:\Windows\system32\exec03.exe)(HOST=MNLAPP01)(USER=!sysadmin01))(INSTANCE_NAME=xxxt23)) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.11.11.123)(PORT=62626)) * establish * fail_app.com.ph * 0
2019-05-12 04:17:11 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=fail_app.com.ph)(CID=(PROGRAM=C:\Windows\system32\exec01.exe)(HOST=MNLAPP01)(USER=!sysadmin01))) * (ADDRESS=(PROTOCOL=tcp)(HOST=10.11.11.123)(PORT=62629)) * establish * fail_app.com.ph * 0

These connect strings are from the Oracle DB listener logs and it contains several version of these connection strings. So far, these are the only two formats that I've seen, hopefully there is not another one.

What am currently doing is grep and re-direct all of them to a file and then further break down those two files based on (CONNECT_DATA=(CID= and (CONNECT_DATA=(SERVER=DEDICATED) and then run those four (4) awk and paste for each set and then combine them both :(. If I found another version of how the CONNECT_DATA looks like, I supposed I create another for that case. Not sure if there is any other way around it.

Would have been if Oracle themselves had provided their own parser :frowning:

OK, let's use the star ( * ) as the field separator, and don't forget to adapt the SRV variable:

awk '
function chop(FLD, STR)          {if (match ($FLD, STR "=[^)]*")) return substr ($FLD, RSTART, RLENGTH)
                                 }
match ($0, "SERVICE_NAME=" SRV)  {print chop(2, "PROGRAM"), chop(2, "USER"), chop(2,"HOST"), chop(3, "HOST")
                                 }
' SRV="work_app" FS="\*" OFS="\t" file1
PROGRAM=JDBC Thin Client    USER=mickey    HOST=__jdbc__    HOST=12.123.11.123
PROGRAM=SQL Developer    USER=minnie    HOST=__jdbc__    HOST=10.214.14.29
PROGRAM=JDBC Thin Client    USER=mickey    HOST=__jdbc__    HOST=12.123.11.123
PROGRAM=SQL Developer    USER=minnie    HOST=__jdbc__    HOST=10.214.14.29
awk '
function chop(FLD, STR)          {if (match ($FLD, STR "=[^)]*")) return substr ($FLD, RSTART, RLENGTH)
                                 }
match ($0, "SERVICE_NAME=" SRV)  {print chop(2, "PROGRAM"), chop(2, "USER"), chop(2,"HOST"), chop(3, "HOST")
                                 }
' SRV="work_app" FS="\*" OFS="\t" file2
philipp@philipp-All-Series:~/MediathekView/playground$ awk '
function chop(FLD, STR)          {if (match ($FLD, STR "=[^)]*")) return substr ($FLD, RSTART, RLENGTH)
                                 }
match ($0, "SERVICE_NAME=" SRV)  {print chop(2, "PROGRAM"), chop(2, "USER"), chop(2,"HOST"), chop(3, "HOST")
                                 }
' SRV="fail_app" FS="\*" OFS="\t" file2
PROGRAM=C:\Windows\system32\exec01.exe    USER=!sysadmin01    HOST=MNLAPP01    HOST=10.11.11.123
PROGRAM=C:\Windows\system32\exec02.exe    USER=!sysadmin01    HOST=MNLAPP01    HOST=10.11.11.123
PROGRAM=C:\Windows\system32\exec03.exe    USER=!sysadmin01    HOST=MNLAPP01    HOST=10.11.11.123
PROGRAM=C:\Windows\system32\exec01.exe    USER=!sysadmin01    HOST=MNLAPP01    HOST=10.11.11.123