Help 'speeding' up this 'parsing' script - taking 24+ hours to run

newbie_01 · March 28, 2018, 2:43pm

Hi,

I've written a ksh script that read a file and parse/filter/format each line. The script runs as expected but it runs for 24+ hours for a file that has 2million lines. And sometimes, the input file has 10million lines which means it can be running for more than 2 days and still not finish. And of course, SA's been chasing me up as it is showing in top as running like forever.

I need some advise on maybe instead of reading one line at a time, I can run an awk one liner instead. I wish I can code it in Perl but not sure how to. Most says it is faster in Perl but not sure how to use Perl-like equivalence of the UNIX command besides using system

Anyway, hopefully I can interest someone into looking into this.

Below is the excerpt / part of the script that is taking the most time:

for LOG in *search_string_found.out
#for LOG in *xyz
do
   server_db=`echo $LOG | awk -F"_" '{ print $1 }'`
   server_app=`echo $LOG | awk -F"_" '{ print $2 }'`
   echo "- [ `date` ] // `wc -l $LOG | awk '{ print $1 }'` lines ==> Processing $LOG // ${server_db} from ${server_app}"

#while IFS="*" read TS CS HOST RESULT SERVICE RETURNCODE
oIFS=$IFS
while read line
do
   IFS="*"
   echo "$line" | read TS CS HOST RESULT SERVICE RETURNCODE
   timestamp=`echo $TS | awk '{ print $2 }'`
   year=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $3 }'`
   day=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $1 }'`
   month=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $2 }'`

   case $month in
      "JAN" ) mm="01" ;;
      "FEB" ) mm="02" ;;
      "MAR" ) mm="03" ;;
      "APR" ) mm="04" ;;
      "MAY" ) mm="05" ;;
      "JUN" ) mm="06" ;;
      "JUL" ) mm="07" ;;
      "AUG" ) mm="08" ;;
      "SEP" ) mm="09" ;;
      "OCT" ) mm="10" ;;
      "NOV" ) mm="11" ;;
      "DEC" ) mm="12" ;;
   esac
   TS2="$year-$mm-$day $timestamp"

   program=`echo $CS | awk -F"(" '{ print $4 }' | awk -F"=" '{ print $2 }' | awk -F")" '{ print $1}'`
   user=`echo $CS | awk -F"(" '{ print $6 }' | awk -F"=" '{ print $2 }' | awk -F")" '{ print $1}'`
   service_name=`echo $CS | awk -F"(" '{ print $8 }' | awk -F"=" '{ print $2 }' | awk -F")" '{ print $1}'`

   app_protocol=`echo $HOST | awk -F"(" '{ print $3 }' | awk -F"=" '{ print $2 }' | awk -F")" '{ print $1}'`
   app_host=`echo $HOST | awk -F"(" '{ print $4 }' | awk -F"=" '{ print $2 }' | awk -F")" '{ print $1}'`
   app_port=`echo $HOST | awk -F"(" '{ print $5 }' | awk -F"=" '{ print $2 }' | awk -F")" '{ print $1}'`

   #echo "- line = $line"
   #echo "- timestamp = $TS"
   #echo "  TS2 = $TS2"
   #echo "- connectstring = $CS"
   #echo "  program = $program"
   #echo "  user = $user"
   #echo "  service_name = $service_name"
   #echo "- host = $HOST"
   #echo "  app_protocol = $app_protocol"
   #echo "  app_host = $app_host"
   #echo "  app_port = $app_port"
   #echo "- result = $RESULT"
   #echo "- service = $SERVICE"
   #echo "- returncode = $RETURNCODE"
   #echo "-------------------------------------------------------------"
   #echo

   RETURNCODE=`echo $RETURNCODE | sed "s/ *//g"`
   detail="$TS2^${server_db}^${server_app} = ${app_host}^$program^$user^${service_name}^$RETURNCODE^$line"
   #echo "${detail}" | tee -a ${f_report}
   echo "${detail}" >> ${f_report}
   IFS=$oIFS
done <  $LOG

Below are example entries of the input file that the script reads, it can be 2million lines at least and go to as much as 10million lines. I've change entries as they are customer data.

04-MAR-2018 03:19:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)(CID=(PROGRAM=chrome)(HOST=mnl0ia9b5.abcde.xx.yy)(USER=mickeyp0))(INSTANCE_NAME=test3)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.101)(PORT=60791)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:19:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)(CID=(PROGRAM=chrome)(HOST=mnl0ia9b5.abcde.xx.yy)(USER=mickeyp0))(INSTANCE_NAME=test3)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.101)(PORT=60795)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:21:07 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=mickeyp0))(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.7)(PORT=14582)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:22:25 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=mickeyp0))(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.7)(PORT=15176)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:24:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)(CID=(PROGRAM=chrome)(HOST=mnl0ia9b5.abcde.xx.yy)(USER=mickeyp0))(INSTANCE_NAME=test3)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.101)(PORT=60881)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:24:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)(CID=(PROGRAM=chrome)(HOST=mnl0ia9b5.abcde.xx.yy)(USER=mickeyp0))(INSTANCE_NAME=test3)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.101)(PORT=60885)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:29:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)(CID=(PROGRAM=chrome)(HOST=mnl0ia9b5.abcde.xx.yy)(USER=mickeyp0))(INSTANCE_NAME=test3)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.101)(PORT=60965)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:29:01 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)(CID=(PROGRAM=chrome)(HOST=mnl0ia9b5.abcde.xx.yy)(USER=mickeyp0))(INSTANCE_NAME=test3)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.101)(PORT=60969)) * establish * test_app.abcde.xx.yy * 12514
04-MAR-2018 03:29:02 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)(CID=(PROGRAM=xyzimain)(HOST=mnl0ia9b5.abcde.xx.yy)(USER=mickeyp0))(INSTANCE_NAME=test3)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.101)(PORT=60973)) * establish * test_app.abcde.xx.yy * 0
04-MAR-2018 03:57:10 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=mickeyp0))(SERVER=DEDICATED)(SERVICE_NAME=test_app.abcde.xx.yy)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.66.90.7)(PORT=24394)) * establish * test_app.abcde.xx.yy * 12514

What I am wanting to do really in simplest term is as below:

Change the date format to YYYY-MM-DD. Main reason being is it is most convenient sorting in this date format
Filter some information from each line, i.e host name, IP, program name, service name, return code etc.

I then re-direct these formatted line/record to a file that I can check group by return code value or simply do a sort | uniq -c so it displays and show a count of occurrence.

Any advice much appreciated. Thanks in advance.

Corona688 · March 28, 2018, 3:33pm

Running awk once and only once would be so much faster than running awk 180,000,000 times, it'd be done in under a minute, maybe even single digit seconds.

Perl is not faster. If you wrote this code the same way in Perl it'd be just as slow or slower.

Unfortunately, the program you've given doesn't seem to work, so I can't tell what output you want. Could you post the output you want?

Don_Cragun · March 28, 2018, 3:35pm

You have shown us an input file and you have shown us a script that invokes awk and sed at least 30 times for every line read from your file. It is no wonder that running this script is burning up CPU cycles to the detriment of anyone else trying to use the same system you're using.

Please describe in English exactly what output you're trying to produce and show us the exact output you hope to produce from your sample input. Saying that you want to filter the host name for each line doesn't really describe what you're trying to do especially since many of your sample input lines contain more than one (HOST=value) string.

Please also tell us what operating system you're using. (Different operating systems have different utilities and different options available for some utilities.)

newbie_01 · April 3, 2018, 1:50am

Hi,

Sorry Corona688 and Don Cragun, I should have thought about how very so difficult and unfair of me not to post in an example output. :o

You are right that it is indeed a lot, lot, lot faster if it reads the whole file at once instead of line by one I kick off the script to run on a 10million lines over the weekend, I didn't get an easter miracle of any sort, it is still running at this time.

You can ignore or ideally forget the so horrible codes that I posted :o. Maybe I can explain what I've been trying to do as below.

So, here is an example raw input file, un-filtered

24-MAR-2018 07:59:52 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=42145)) * establish * testapp1_app.somewhere.out.ph * 0
24-MAR-2018 07:59:52 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=42149)) * establish * testapp1_app.somewhere.out.ph * 12514
24-MAR-2018 07:59:52 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=42153)) * establish * testapp1_app.somewhere.out.ph * 0
24-MAR-2018 07:59:52 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=42157)) * establish * testapp1_app.somewhere.out.ph * 12514
24-MAR-2018 07:59:52 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=42161)) * establish * testapp1_app.somewhere.out.ph * 12514
12-MAR-2018 10:04:38 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)(CID=(PROGRAM=sqlplus)(HOST=xxx00001.somewhere.out.ph)(USER=ogre01))) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.101)(PORT=12358)) * establish * testapp1_app.somewhere.out.ph * 12514
12-MAR-2018 16:23:09 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=11662)) * establish * testapp1_app.somewhere.out.ph * 12514
12-MAR-2018 16:23:09 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=11666)) * establish * testapp1_app.somewhere.out.ph * 12514
12-MAR-2018 16:23:09 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=11672)) * establish * testapp1_app.somewhere.out.ph * 12514
12-MAR-2018 16:23:09 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=11674)) * establish * testapp1_app.somewhere.out.ph * 0
12-MAR-2018 16:23:09 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=11680)) * establish * testapp1_app.somewhere.out.ph * 12514
12-MAR-2018 16:23:09 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=11682)) * establish * testapp1_app.somewhere.out.ph * 12514
12-MAR-2018 16:23:09 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=11686)) * establish * testapp1_app.somewhere.out.ph * 0
12-MAR-2018 16:23:09 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=11690)) * establish * testapp1_app.somewhere.out.ph * 12520
12-MAR-2018 16:23:09 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=11696)) * establish * testapp1_app.somewhere.out.ph * 12514

There can be million of these lines and at the moment, the script reads one line at a time and generate a formatted output like below.

2018-03-12 10:04:38  runserver01        = 66.65.60.101                testapp1_app.somewhere.out.ph       sqlplus         ogre01                    12514
2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 0
2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 0
2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514
2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514
2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514
2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514
2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514
2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514
2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12520
2018-03-24 07:59:52  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 0
2018-03-24 07:59:52  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 0
2018-03-24 07:59:52  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514
2018-03-24 07:59:52  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514
2018-03-24 07:59:52  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514

I then use sort | uniq -c to do some sort of a count and comes up with below:

      1 2018-03-12 10:04:38  runserver01        = 66.65.60.101                testapp1_app.somewhere.out.ph       sqlplus         ogre01                    12514
      2 2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 0
      6 2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514
      1 2018-03-12 16:23:09  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12520
      2 2018-03-24 07:59:52  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 0
      3 2018-03-24 07:59:52  runserver01        = 66.65.60.7                  JDBC Thin Client                    ogre01          testapp1_app.somewhere.out.ph 12514

All fields of the output file are from the input file with the exception of the second field that is showing up as runserver01. This is from running hostname. It doesn't have to be on the second field. it can be anywhere or can come in later on after all the filtering, it is just basically a way for me to figure out where I run the script from.

Most of the lines are of the following format:

12-MAR-2018 16:23:09 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=11662)) * establish * testapp1_app.somewhere.out.ph * 12514

Sometimes, it can be like below:

12-MAR-2018 10:04:38 *  (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)(CID=(PROGRAM=sqlplus)(HOST=xxx00001.somewhere.out.ph)(USER=ogre01)))  * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.101)(PORT=12358)) * establish *  testapp1_app.somewhere.out.ph * 12514

I don't know how to make awk differentiate between the two formats and filter/get the right information. Note that the information are in different order for these two strings.

And yes, running the whole file thru awk is faster instead of having to read one line at a time but I don't know how to get awk to do what I wanted so it comes up with the output format that I wanted.

I am looking at maybe do one run of awk changing the date format first and then the next awk is to filter out the CONNECT_DATA string into different parts.

But I can't figure out what to do, so for the first pass, I need to change

24-MAR-2018 07:59:52 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=42145)) * establish * testapp1_app.somewhere.out.ph * 0
12-MAR-2018 10:04:38 *  (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)(CID=(PROGRAM=sqlplus)(HOST=xxx00001.somewhere.out.ph)(USER=ogre01)))  * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.101)(PORT=12358)) * establish *  testapp1_app.somewhere.out.ph * 12514

to

2018-03-12 10:04:38 * (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)(CID=(PROGRAM=sqlplus)(HOST=xxx00001.somewhere.out.ph)(USER=ogre01))) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.101)(PORT=12358)) * establish * testapp1_app.somewhere.out.ph * 12514
2018-03-24 07:59:52 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin  Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph))  * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=42145)) * establish *  testapp1_app.somewhere.out.ph * 0

How do I tell awk -F"*" to print $1 and the rest of the field with $1 to be further change to a YYYY-MM-DD format. The real reason behind formatting it to YYYY-MM-DD is because that works best for when doing the sort.

And then the next pass is supposed to filter it to be like

2018-03-12 10:04:38  runserver01        = 66.65.60.101                testapp1_app.somewhere.out.ph       sqlplus         ogre01                    12514
2018-03-24 07:59:52  runserver01        = 66.65.60.7                   JDBC Thin Client                    ogre01           testapp1_app.somewhere.out.ph 0

Or ideally be like

2018-03-12 16:23:09  runserver01        = 66.65.60.101                sqlplus                             ogre01          testapp1_app.somewhere.out.ph 12514
2018-03-24 07:59:52  runserver01        = 66.65.60.7                    JDBC Thin Client                    ogre01            testapp1_app.somewhere.out.ph 0

Please advise on how best to do what I am wanting to do. Apologies for not giving enough information earlier.

P.S:
That ksh script that I run processing a file that has 9890943 lines, it is still running, ps -o etime= -p 3036 says it has been running for 5-14:38:03, time to CTRL-C it :o

RudiC · April 3, 2018, 4:17am

Not sure why the service name comes in field $4 sometimes, shoving other fields right, and in field $6 other times...
How far do you get with

awk -F\* '
BEGIN   {for (n=split("JAN*FEB*MAR*APR*MAY*JUN*JUL*AUG*SEP*OCT*NOV*DEC", T); n; n--) MTH[T[n]] = n
         "hostname" | getline HN
        }

function GETSTR(SRC, STR)       {match (SRC, STR "[^)]*")
                                 LN = length(STR) - gsub (/\(/, "&", STR)
                                 return substr (SRC, RSTART+LN, RLENGTH-LN)
                                }

        {gsub (/ *\* */, "*")
         split ($1, T, "[- ]")
         if (T[2] in MTH) $1 = sprintf ("%s-%02d-%s %s", T[3], MTH[T[2]], T[1], T[4])
         PG = GETSTR($2, "CID=\(PROGRAM=")
         US = GETSTR($2, "USER=")
         SN = GETSTR($2, "SERVICE_NAME=")
         IP = GETSTR($3, "HOST=")
         print $1, HN, "= " IP, PG, US, SN, $NF
        }
' OFS="\t" file
2018-03-24 07:59:52    RudisPC    = 66.65.60.7    JDBC Thin Client    ogre01    testapp1_app.somewhere.out.ph    12514
2018-03-24 07:59:52    RudisPC    = 66.65.60.7    JDBC Thin Client    ogre01    testapp1_app.somewhere.out.ph    12514
2018-03-12 10:04:38    RudisPC    = 66.65.60.101    sqlplus    ogre01    testapp1_app.somewhere.out.ph    12514
2018-03-12 16:23:09    RudisPC    = 66.65.60.7    JDBC Thin Client    ogre01    testapp1_app.somewhere.out.ph    12514
2018-03-12 16:23:09    RudisPC    = 66.65.60.7    JDBC Thin Client    ogre01    testapp1_app.somewhere.out.ph    12514
2018-03-12 16:23:09    RudisPC    = 66.65.60.7    JDBC Thin Client    ogre01    testapp1_app.somewhere.out.ph    12514
.
.
.

bakunin · April 3, 2018, 4:26am

The following is probably not the answer you hoped for - i will tell you what you did do wrongly, why it was wrong and how you could do it better. You will still have to implement what i tell you yourself. Also, i will keep my explanation very short and introductory. You will need to research many of the pointers i will give you on your own to explore the full capabilities of the things i will explain to you.

If you want to show us the fruit of your efforts once you reimplemented the script and seek further advice - you will be welcome.

This is a good start. Whenever you write code always take the time to estimate how long it will run, depending on the amount of input you expect. You don't need exact calculations, just a rough estimation for some expected orders of magnitude will suffice. There is a whole mathematical theory about this (see "Landau symbols" or "Big O notation"), but we won't need it. A glimpse of it will suffice.

Look at the following code:

while read LINE ; do
     program -abc "$LINE" >> firstresult
     program -def "$LINE" >> secondresult
done < /some/input

How long will this run? Well, obviously that depends on how long "program" will run, yes? But even without knowing that we can already say that for every line of input we will have to run "program" twice. Now we can examine the input and if it contains, say, 1 million lines, we know that "program" will be called 2 million times. If we estimate that "program" needs 1 millisecond for a single run the script will take 0.001s x 2 000 000 = 2 000s ^= ~35min . Add to that some overhead for reading the input file, writing the output files, loading "program" two million times into memory and starting it, etc. and we probably end at 1 hour runtime.

Especially for large inputs it makes sense to test the finished program (script) with a short input and measure the time it takes. For this there is the time command. For instance you can take your script, save it under the name of myscript and then execute it with a test input of, say, 1000 lines, like this:

time ./myscript <maybe necessary options/arguments here>

You will get an output like the following:

time ./myscript -some options

real    0m0,41s
user    0m0,03s
sys     0m0,08s

If you are interested you may want to explore performance tuning and measuring but for a start we are only concerned with the "real" line of the output. This is how long your program has run overall. Now, that you have an estimation how long it has taken to process thousand lines it is easy to extrapolate how long it takes to process a million or ten million.

The next thing i want to talk about is probably more of what you expected: how to make code faster. First, here is a part of your code which i have trimmed down a bit. Let us use our new tool to estimate the runtime:

for LOG in *search_string_found.out
#for LOG in *xyz
do
   server_db=`echo $LOG | awk -F"_" '{ print $1 }'`
   server_app=`echo $LOG | awk -F"_" '{ print $2 }'`
   echo "- [ `date` ] // `wc -l $LOG | awk '{ print $1 }'` lines ==> Processing $LOG // ${server_db} from ${server_app}"

#while IFS="*" read TS CS HOST RESULT SERVICE RETURNCODE
oIFS=$IFS
while read line
do
   IFS="*"
   echo "$line" | read TS CS HOST RESULT SERVICE RETURNCODE
   timestamp=`echo $TS | awk '{ print $2 }'`
   year=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $3 }'`
   day=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $1 }'`
   month=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $2 }'`
done
done

You immediated see why it pays off to indent properly: you don't see at a glance how many levels of nesting you have here. Therefore, let us first reindent your code:

for LOG in *search_string_found.out ; do
     server_db=`echo $LOG | awk -F"_" '{ print $1 }'`
     server_app=`echo $LOG | awk -F"_" '{ print $2 }'`
     echo "- [ `date` ] // `wc -l $LOG | awk '{ print $1 }'` lines ==> Processing $LOG // ${server_db} from ${server_app}"

     oIFS=$IFS
     while read line ; do
          IFS="*"
          echo "$line" | read TS CS HOST RESULT SERVICE RETURNCODE
          timestamp=`echo $TS | awk '{ print $2 }'`
          year=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $3 }'`
          day=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $1 }'`
          month=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $2 }'`
     done < $LOG
done

Now we see immediately that the inner while-loop is executed completely every time the outer for-loop does one pass. If we estimate the for-loop to find 10 files and each file has 100 lines the while-loop as a whole will be executed 10 times and every line within the while-loop will be exectuted 1000 times.

Most lines within the while-loop look like this:

variable=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print something }'`

What does the shell do to process this code? First, the shell creates an extra process, in which the echo program is started. Some output stream is generated running echo $TS . Next, the awk program is loaded and executed by starting a child process and running awk '{ print $1 }' inside it. To this process the output generated by the echo is fed as input. The awk program generates some output of its own and a third sub-process is created and started, into which another instance of the awk program is loaded. The output of the first awk program is now fed as input to the second awk program, which itself generates some output based on that input. This output is caught and put into the variable.

Sounds complicated? Yes - because it is! Calling an external program is one of the most "expensive" (in terms of needed system resources and time) system calls there are! Fast shell scripts differ mostly in this regard from slow ones: how well they avoid calling external programs.

That begs the question: if we don't filter the part we need from the rest of the output with awk , what should we use instead? Luckily, the inventors of the shell asked themselves this question and they invented: variable expansion (also called "parameter expansion").

I won't explain it completely here, but only a short introduction: suppose we have a variable holding a date, like this (notice that i imply the european date format: YYYY-MM-DD):

var="2018-03-31"

Now, we want to split that into a year, month and day part.

There is a device which will cut off a part of a variables content based on some pattern:

${variable#pattern}     # cut off from the front, shortest match
${variable%pattern}     # cut off from the rear, shortest match

${variable##pattern}    # cut off from the front, longest match
${variable%%pattern}    # cut off from the rear, longest match

In our case the pattern we look for is "-", because this separates the days, months and the year. You can also use wildcards, like "*" (any number of any characters) and "?" (any single character), just like in filenames, when you do a ls -l *.txt .

Now let us try (i absolutely suggest that you play around with this - create your own variable contents and try different patterns and what comes out):

$ mydate="2018-03-31"
$ echo "${mydate#*-}"
03-31
$ echo "${mydate##*-}"
31
$ echo "${mydate%-*}"
2018-03
$ echo "${mydate%%-*}"
2018

Notice, that the content of the variable is not changed at all - just the part which is displayed is changed! If you want to save the result you will need to assign another (or the same) variable with it:

$ mydate="2018-03-31"
$ myday="${mydate##*-}"
$ myyear="${mydate%%-*}"
$ echo "YEAR: $myyear   DAY: $myday"

Notice that i have left out the month here. we need a two-step approach to filter that out:

$ mydate="2018-03-31"
$ echo "${mydate#*-}"
03-31
$ mymonth="${mydate#*-}"
$ echo "${mymonth%-*}"
03
$ mymonth="${mymonth%-*}"

Now we have a complete solution:

$ mydate="2018-03-31"
$ myday="${mydate##*-}"
$ myyear="${mydate%%-*}"
$ mymonth="${mydate#-*}"
$ mymonth="${mymonth%*-}"
$ echo "YEAR: $myyear   MONTH: $mymonth DAY: $myday"

You probably may ask right now how much this is influencing the runtime. You are right to ask, but seeing is believing, as they say. Prepare a log file with 1000 lines and run these two scripts, each with the "time" command, i showed you above:

while read line ; do
     echo "$line" | read TS CS HOST RESULT SERVICE RETURNCODE
     timestamp=`echo $TS | awk '{ print $2 }'`
     year=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $3 }'`
     day=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $1 }'`
     month=`echo $TS | awk '{ print $1 }' | awk -F"-" '{ print $2 }'`
     echo "SCRIPT1: YEAR: $year   MONTH: $month  DAY: $day"
done < /your/file

while read TS junk ; do
     year="${TS##*-}"
     day="${TS%%-*}"
     month="${TS#*-}"
     month="${month%%-*}"
     echo "SCRIPT2: YEAR: $year   MONTH: $month  DAY: $day"
done < /your/file

And see what comes out.

I have used another device above to further speed up things: the shell has the ability to split input into fields. This is usually done along delimiters of whitespace. Consider the following command:

command -abc file1 file2

Somehow we expect the shell to interpret file1 as the name of one file and file2 as the name of another. We do NOT expect the shell to confuse this for a file called -abc file1 or file1 file2 or so. This is because of this innate splitting ability and the fact that the strings file1 and file2 are surrounded by whitespace.

We can use this ability to our advantage when we read input too. You do it already when you do:

echo "$line" | read TS CS HOST RESULT SERVICE RETURNCODE

The content of the variable "line" is split along whitespace and the first part goes into a variable named TS, the second part to a variable named CS and so on. (On a passing note: "HOST" is a bad name for a variable because it is often a - fixed - value with the name of the system you are running on. Use something else.)

But instead of doing:

while read line ; do
     echo $line | read var1 var2 var3 ...
done

You can do immediately:

while read var1 var2 var3 ... ; do
     ....
done

This is what i have done above. Notice that you may still need the line as a whole and it might make sense to retain it like you did - i just didn't need it for this part, so i left it out. You should just be aware of what is possible.

There are some further rules for this splitting: if you have less variables than fields everything left over will be put into the last variable:

$ echo one two three four five | read var1 var2 var3
$ echo $var1
one
$ echo $var2
two
$ echo $var3
three four five

So, if you need only the, say, second part of a list of values:

while read junk VAR junk ; do
     echo $VAR
done < /your/input

If you have more variables than available fields the last variables will be simply empty.

Now, i suggest you first play around with what i told you and explore the possibilities. Only then try to reimplement your script in light of what i told you.

I hope this helps.

bakunin

newbie_01 · May 5, 2018, 2:53pm

rudic:

Not sure why the service name comes in field $4 sometimes, shoving other fields right, and in field $6 other times...
How far do you get with

awk -F\* '
BEGIN   {for (n=split("JAN*FEB*MAR*APR*MAY*JUN*JUL*AUG*SEP*OCT*NOV*DEC", T); n; n--) MTH[T[n]] = n
   "hostname" | getline HN
   }

function GETSTR(SRC, STR)       {match (SRC, STR "[^)]*")
   LN = length(STR) - gsub (/\(/, "&", STR)
   return substr (SRC, RSTART+LN, RLENGTH-LN)
   }

   {gsub (/ *\* */, "*")
   split ($1, T, "[- ]")
   if (T[2] in MTH) $1 = sprintf ("%s-%02d-%s %s", T[3], MTH[T[2]], T[1], T[4])
   PG = GETSTR($2, "CID=\(PROGRAM=")
   US = GETSTR($2, "USER=")
   SN = GETSTR($2, "SERVICE_NAME=")
   IP = GETSTR($3, "HOST=")
   print $1, HN, "= " IP, PG, US, SN, $NF
   }
' OFS="\t" file
2018-03-24 07:59:52    RudisPC    = 66.65.60.7    JDBC Thin Client    ogre01    testapp1_app.somewhere.out.ph    12514
2018-03-24 07:59:52    RudisPC    = 66.65.60.7    JDBC Thin Client    ogre01    testapp1_app.somewhere.out.ph    12514
2018-03-12 10:04:38    RudisPC    = 66.65.60.101    sqlplus    ogre01    testapp1_app.somewhere.out.ph    12514
2018-03-12 16:23:09    RudisPC    = 66.65.60.7    JDBC Thin Client    ogre01    testapp1_app.somewhere.out.ph    12514
2018-03-12 16:23:09    RudisPC    = 66.65.60.7    JDBC Thin Client    ogre01    testapp1_app.somewhere.out.ph    12514
2018-03-12 16:23:09    RudisPC    = 66.65.60.7    JDBC Thin Client    ogre01    testapp1_app.somewhere.out.ph    12514
.
.
.

Yeah, I hate that fact too, that the service name divert from field to field. Looking at the lines, it has to do with the request being a JDBC connection or otherwise. I'll give the awk bit to work. Thanks a lot.

---------- Post updated at 01:53 PM ---------- Previous update was at 01:40 PM ----------

Sorry, I've been sick for awhile. Thanks a lot for all your advise. I will give all of the suggestion with a cut down version of the file. I will have a real long read and understand how to implement your suggestion. Wish me luck. Thanks again everyone.

newbie_01 · July 24, 2018, 5:33am

Hi Rudic

Sorry it has taken me awhile to test. It is giving some errors as below:

This is what I ran:

$ ./x.ksh
awk: cmd. line:14: warning: escape sequence `\(' treated as plain `('
awk: cmd. line:6: (FILENAME=x.txt FNR=1) fatal: Unmatched ( or \(: /CID=(PROGRAM=[^)]*/

Here's the script with the awk code:

$ cat x.ksh
#!/bin/ksh

awk -F\* '
BEGIN   {for (n=split("JAN*FEB*MAR*APR*MAY*JUN*JUL*AUG*SEP*OCT*NOV*DEC", T); n; n--) MTH[T[n]] = n
         "hostname" | getline HN
        }

function GETSTR(SRC, STR)       {match (SRC, STR "[^)]*")
                                 LN = length(STR) - gsub (/\(/, "&", STR)
                                 return substr (SRC, RSTART+LN, RLENGTH-LN)
                                }

        {gsub (/ *\* */, "*")
         split ($1, T, "[- ]")
         if (T[2] in MTH) $1 = sprintf ("%s-%02d-%s %s", T[3], MTH[T[2]], T[1], T[4])
         PG = GETSTR($2, "CID=\(PROGRAM=")
         US = GETSTR($2, "USER=")
         SN = GETSTR($2, "SERVICE_NAME=")
         IP = GETSTR($3, "HOST=")
         print $1, HN, "= " IP, PG, US, SN, $NF
        }
' OFS="\t" x.txt

Below is the input file to awk:

$ cat x.txt
12-MAR-2018 16:23:09 * (CONNECT_DATA=(CID=(PROGRAM=JDBC Thin Client)(HOST=__jdbc__)(USER=ogre01))(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)) * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.7)(PORT=11662)) * establish * testapp1_app.somewhere.out.ph * 12514
12-MAR-2018 10:04:38 *  (CONNECT_DATA=(SERVER=DEDICATED)(SERVICE_NAME=testapp1_app.somewhere.out.ph)(CID=(PROGRAM=sqlplus)(HOST=xxx00001.somewhere.out.ph)(USER=ogre01)))  * (ADDRESS=(PROTOCOL=tcp)(HOST=66.65.60.101)(PORT=12358)) * establish *  testapp1_app.somewhere.out.ph * 12514

And here's my awk version. I tried with gawk and it gives the same error. I have no nawk.

$ awk --version
GNU Awk 3.1.7
Copyright (C) 1989, 1991-2009 Free Software Foundation.

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 3 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.

You should have received a copy of the GNU General Public License
along with this program. If not, see http://www.gnu.org/licenses/.

RudiC · July 24, 2018, 8:30am

I'm afraid you'll have to experiment a bit on your own. Try to supply the escaped opening parenthesis in line 14 by two \\ instead of one, or by enclosing it by a pair of square brackets, so it doesn't offend the regex parser.

MadeInGermany · July 24, 2018, 10:03am

This happens with GNU awk. Other awk versions behave differently.
A character set makes a literal ( immune against deferences:

  PG = GETSTR($2, "CID=[(]PROGRAM=")