Parsing log with sed problem

migurus · January 5, 2011, 9:01pm

I have huge logs that have sql statements in them and I need to parse those statements out.

Log sample:

OUT1: #STMT# [SELECT X FROM P WHERE A=1] from: [SELECT X FROM P WHERE A=?]
[E1 Fine]: 2010.12.20 14:01:16.357--DatabaseSessionImpl(17489534)--Connection(11145468)--Thread(Thread[AWT-EventQueue-0,6,main])--SELECT GETDATE()
[E2 Finer]: 2010.12.20 14:01:16.357--UnitOfWork(12167196)--Thread(Thread[AWT-EventQueue-0,6,main])--begin unit of work commit
[E3 Finer]: 2010.12.20 14:01:16.357--DatabaseSessionImpl(17489534)--Connection(11145468)--Thread(Thread[AWT-EventQueue-0,6,main])--begin transaction
[E4 Fine]: 2010.12.20 14:01:16.357--DatabaseSessionImpl(17489534)--Connection(11145468)--Thread(Thread[AWT-EventQueue-0,6,main])--UPDATE S SET CNT=CNT+1 WHERE SNAME='ABC'
[E5 Fine]: 2010.12.20 14:01:16.373--DatabaseSessionImpl(17489534)--Connection(11145468)--Thread(Thread[AWT-EventQueue-0,6,main])--SELECT CNT FROM S WHERE SNAME='ABC'
[E6 Finer]: 2010.12.20 14:01:16.373--DatabaseSessionImpl(17489534)--Connection(11145468)--Thread(Thread[AWT-EventQueue-0,6,main])--commit transaction

I'd like to get

SELECT X FROM P WHERE A=1
SELECT GETDATE()
UPDATE S SET CNT=CNT+1 WHERE SNAME='ABC'
SELECT CNT FROM S WHERE SNAME='ABC'

my script:

#!/bin/ksh
sed -n  -e '/\-\-SELECT/  s/^.*\-\-//p' \
        -e '/\-\-UPDATE/  s/^.*\-\-//p' \
        -e '/\-\-DELETE/  s/^.*\-\-//p' \
        -e '/#STMT#/    s/.*# \[//;s/\].*//p'   $1

produces this

SELECT X FROM P WHERE A=1
SELECT GETDATE()
[E2 Finer
[E3 Finer
UPDATE S SET CNT=CNT+1 WHERE SNAME='ABC'
SELECT CNT FROM S WHERE SNAME='ABC'
[E6 Finer

What would be suggestions of sed gurus to get clean sql statements. Please note: sed is much preferred in my environment.

DGPickett · January 5, 2011, 9:26pm

Well, rather than focusing on the positive, variable SQL you want to keep, you might focus on the more static part just before that you want to remove. There are a lot of key words, also english words, that might appear.

You lost the from clause derived table there. Is that good?

Chubler_XL · January 5, 2011, 9:43pm

How about this:

#!/bin/sh
sed -n \
    -e '/\-\-SELECT/  s/^.*\-\-//p' \         
    -e '/\-\-UPDATE/  s/^.*\-\-//p' \        
    -e '/\-\-DELETE/  s/^.*\-\-//p' \        
    -e '/#STMT#/      s/\].*//'     \
    -e '/#STMT#/      s/.*# \[//p' $1

-- Edit --
or, I think this is the sort of thing DGPickett was suggesting:

#!/bin/sh
sed -n \
    -e '/\-\-begin/d' \
    -e '/\-\-commit/d' \
    -e 's/^.*])\-\-//p' \
    -e '/#STMT#/s/^.*# \[\([^]]*\)].*/\1/p'

Scrutinizer · January 6, 2011, 5:21am

Or awk:

awk -F'--|[][]' '/#STMT#/{print $2;next}$NF!~/[[:lower:]]/{print $NF}' infile

sed -n 's/.*#STMT# \[\([^]]*\).*/\1/p;/--[^[:lower:]]*$/s/.*--//p' infile

migurus · January 6, 2011, 2:15pm

works great, but I have difficulty understanding the regular expression in that. The way I read it is
/^.# \[ - match from the beginning to the last # sign followed by space and open square bracket
$[^]]*$]./ - match [^] and mark it as 1st -- what is [^] here?
the last part ].* says match from the closing square braket to the end and it is being excluded. So, my question is what is this $[^]]*$
I'd appreciate if you will find time to explain it.
Thanks.

Chubler_XL · January 6, 2011, 4:49pm

Yep your spot on with your analysis.

To explain $[^]]$ as you say the $ and $ are for grouping and store the matching string in \1 so we are just left with [^]] which is: zero or more non closing-square-bracket characters (The ^ symbol means any character not listed, for example [^a-zA-z] would match any single non-alpha character).

The net effect is: On any line that contains "#STMT#" replace everything up to the last "# [" with nothing, and then everything from the first "]" with nothing.

rdcwayx · January 6, 2011, 7:19pm

awk -F "--" 'NR==1{split($0,a,"[][]");print a[2];next} {print $NF}' infile |grep -v "[a-z]"

Chubler_XL · January 6, 2011, 8:05pm

Is it safe to assume the SQL statements are all in upper case?

Can an SQL statment contain lower case (perhaps in the criteria):

SELECT CNT FROM S WHERE SNAME='Jones'

DGPickett · January 7, 2011, 12:13pm

sed '
  s/.* #STMT# \(.*\) from: .*/\1/
  t
  s/.*,main\])--//
  t
  d
 '

migurus · January 7, 2011, 3:20pm

Thanks. Unfortunatelly the awk installed can not work with long lines of those logs, that is the reason for sed to be preferred. Somehow the sed solution posted has been running for some 20 minutes and that might (or might not) be due to the version of sed on the system.

---------- Post updated at 12:20 PM ---------- Previous update was at 12:19 PM ----------

Yes, I've seen some, that is why grep -v in prev. post would not work.

DGPickett · January 7, 2011, 3:30pm

Normalized uppercase, but all case is legal.

migurus · January 7, 2011, 3:40pm

Thank you, solution works, but verrrrry slow (post #4 sed solution by Scrutinizer suffers the same problem)

I'm guessing this is due to the sed version installes, I don't know how to get sed version, this one does not like the usual -v or --version.

Thank you people! So far Chubler_XL solution works the best.

Scrutinizer · January 7, 2011, 4:14pm

The slowless is probably partly due to the grouping part, the part in parentheses... Does this make difference?

sed -n '/.*#STMT#/{s/[^[]*\[//;s/\].*//;p;};/--[^[:lower:]]*$/s/.*--//p' infile

reversing the two statements is faster:

sed -n '/--[^[:lower:]]*$/s/.*--//p;/.*#STMT#/{s/[^[]*\[//;s/\].*//;p;}' infile

DGPickett · January 7, 2011, 4:47pm

You might even profit from an egrep in front, to spread the load so sed does not need to test every line.

grep -E ' #STMT# |,main\]\)--' $input_file | sed '
  s/.* #STMT# \(.*\) from: .*/\1/
  t
  s/.*,main\])--//
 '

Scrutinizer · January 7, 2011, 4:56pm

The rest beside the grouping and the reversal, the biggest gain is matching against fixed strings, versus ranges of characters. Therefore Chublers code is fastest:

sed -n '/--SELECT/s/.*--//p;/--UPDATE/s/.*--//p;/--DELETE/s/.*--//p;/#STMT#/s/\].*//;/#STMT#/s/.*# \[//p' infile

This is the fastest I found so far that works with the sample provided:

sed -n '/Fine]/s/.*--//p;/#STMT#/{s/\].*//;s/.*# \[//p;}' infile

DGPickett · January 7, 2011, 5:11pm

The first 's' can do the search once inside the replace. The second to third transition can be done without a search before with grouping:

 
sed -n 's/.*Fine].*--//p;/#STMT#/{
  s/\].*//
  s/.*# \[//p
 }
 ' infile
 
or even get the second search inside 's' and use a t to continue the trimming on hits:
 
sed -n '
  s/.*Fine].*--//p
  s/.* #STMT# //
  t x
  b
  :x
  s/\].*//p
 ' infile
 
Now, printing the pattern space does not delete the line, so to avoid considering the same line for both patterns:
 
sed -n '
  s/.*Fine].*--//
  t p
  s/.* #STMT# //
  t x
  b
  :x
  s/\].*//p
  b
  :p
  p
 ' infile

I guess it depends on whether branching is quick, but a lot of sed speed tuning is avoiding regex matching at the same data twice!

It's sad and amazing  is so slow!

Scrutinizer · January 7, 2011, 5:16pm

Ha I was thinking the same thing about the single search . Yes that grouping really has an impact, as we also found out some time ago..., but also character ranges/classes..

DGPickett · January 10, 2011, 10:07am

Sometimes I put the sed processing in many piped together sed sessions, one line per session, so the load is distributed and I get to busy more CPUs and save time. But I also moved many slow things into simple C character-at-a-time or line-at-a-time state machine filters, like trimming white space from delimited fields. This runs orders faster than sed or awk, and the C program is still general, and so a very reusable shell accessory.

I edited in some additional potential speedup tweaks above.

migurus · January 10, 2011, 3:53pm

scrutinizer:

The slowless is probably partly due to the grouping part, the part in parentheses... Does this make difference?
sed -n '/.*#STMT#/{s/[^[]*\[//;s/\].*//;p;};/--[^[:lower:]]*$/s/.*--//p' infile
reversing the two statements is faster:
sed -n '/--[^[:lower:]]*$/s/.*--//p;/.*#STMT#/{s/[^[]*\[//;s/\].*//;p;}' infile

Thank you. Yes, this runs faster, but again it would not finish 10,000 line log in 1 minute, I'm guessing usage of [:lower:] makes sed matching harder compared to a number of plain strings (not reg. expr) used in the original approach. Just guessing.

DGPickett · January 11, 2011, 2:48pm

Try combining the regex in front of the s into the one after the s, when possible, so it does not eval two regex.

What percentage of line match neither? A line-filtering sed or egrep up front can reduce the sed s load, if there are many lines entirely discarded.