Regular expression to validate sql's

Hi,

I have 2 different templates of sql's in a single file. From each sql I am trying to extract different segments of the sql's.
Please note that the number of fields and conditions would vary based on the sql. It might not adhere to a single format as below(but will adhere to allowable sql formats).

Below is the sample sql from which I have to identify the following

macroname      |REPLACE MACRO macroname
AS
(

/*  Comments  */

UPDATE tgt.table targ
FROM
(
SELECT
 tmp.col1
 ,tmp.col2
 ,tmp.col3

FROM src.table tmp
 INNER JOIN tgt.table tgt
 ON tmp.key1 = tgt.key1    AND
    tmp.key2 = tgt.key2  
	AND tmp.key3 = 
	tgt.key3    
	AND
    tgt.dt = '2899-12-31'
 
 CROSS JOIN ctltable tdc
--some comments

WHERE
 col1 =
 col1
 col2=col2
 col3=
 
 col3

) innertab

SET dt = currentdate

WHERE targ.col1 = innertab.col1 AND
      targ.col2 = innertab.col2 AND
      targ.col3 = innertab.col3 AND
      targ.dt = '2899-12-31'
;
);
macro2             |REPLACE MACRO macro2
AS
(
/*########################################################################*/
/*                                                                        */
/*########################################################################*/

INSERT INTO tgt.table tab
  (
     col1,
    col2
    ,col3
    ,col4
    ,col5
    ,col6
  )
SELECT
   col1
  ,col2
  ,col3
  ,col4
  ,col5
  ,col6
FROM
  stg.table                  stg
  LEFT OUTER JOIN
  tgt.table                      tgt
        ON   ( ( stg.col1 = tgt.col1 ) 
        AND    (stg.col2 = tgt.col2 ) )
        AND tgt.col3 =  '2899-12-31'
  CROSS JOIN
  ctltable              tdc
WHERE
  (    tab.col1 IS NULL
   AND tab.col2 IS NULL
  ); 
);

  1. whether the sql is an update or an insert query
  2. from update query I have to identify the inner query and validate if the query is not having any derivations or any additional joins from the one's given. Basically the joins should always be between 2 tables stg or temp table with the target table only. the target table is identified from the outermost insert or update statement. And stg or temp is identified from the schema names like stg.
  3. I have to extract each of the segments from the sql. like insert, update, sub queries, where, set.

can you please help me finding a regular expression to identify the sql's which does not comply to my above definition.

This task is beyond simple regular expressions. How did you go with the SQL Query Parser we discussed last week?

Could you use the same approach to help validate these queries?

Well, the sql parser I am not able to understand much with the java code or the instructions. However, I found this link but couldn't implement it as it is time consuming and I am running short of time. So, keeping it for the later tasks. However, I have to get this validation first so I was looking for a sed or awk expressions to validate it.

You may get some traction removing comments and replacing all white space with single spaces, but unless you can get a proper SQL parser you are still likely to be plagued by false positives/negatives.

Perhaps if you get together some plain English pusdo code like:

"JOIN" must always be followed with <whitespace> ( "tgt" or "ctltable") and then ("ON" or "WHERE")

Someone here could help you with a basic validator.

Well, my current plan is to flag for manual intervention if there is something that is not adhering to the standard templates. I have thousands of sql's with most of the with the standard structure. Currently it requires manual analysis to categorize them, if it can be validated then I could segregate only those which requires manual intervention to process.

Thanks for your help!:smiley: