Finding The Complete SQL statement Using PDFGREP Or Grep

metallica1973 · February 2, 2020, 10:52am

Linux Gods,

I am simply attempting to parse SQL statements from a PDF doc in creating a base SQL script at a later time but for the life of me, am having a tough time extracting this data.This exact string worked perfectly a couple of months ago and now it doesnt. Below is an example of the data structure.

show parameter os_authent_prefix

SHOW PARAMETER log_archive_dest;

Audit:

  SELECT AUD.POLICY_NAME, AUD.AUDIT_OPTION, AUD.AUDIT_OPTION_TYPE

  FROM AUDIT_UNIFIED_POLICIES AUD, AUDIT_UNIFIED_ENABLED_POLICIES ENABLED
  WHERE AUD.POLICY_NAME = ENABLED.POLICY_NAME
  AND AUD.AUDIT_OPTION = 'CREATE TRIGGER'
  AND AUD.AUDIT_OPTION_TYPE = 'STANDARD ACTION'
  AND ENABLED.SUCCESS = 'YES'
  AND ENABLED.FAILURE = 'YES'
  AND ENABLED.ENABLED_OPT = 'BY'
  AND ENABLED.USER_NAME = 'ALL USERS';

Other variations I have tried:

pdfgrep -i -PB 20 -A 20 "audit\:" ./Oracle-12.pdf | gawk '{IGNORECASE=1;} /show.*\;/ || /select.*\;/ {print "Here is the data \n\n",$0, "\n"}'

gawk: cmd. line:1: warning: regexp escape sequence `\;' is not a known regexp operator
Here is the data 

   REVOKE SELECT_ANY_DICTIONARY FROM <grantee>; 

Here is the data 

    REVOKE SELECT ANY TABLE FROM <grantee>; 

Here is the data 

   REVOKE SELECT_CATALOG_ROLE FROM <grantee>; 

Here is the data 

   AUDIT SELECT ANY DICTIONARY;

I suspect something changed in a binary or two. In attempting to get past this, I have attempted various regex variations:

pdfgrep -i -PB 20 -A 20 "audit\:" ./Oracle-12.pdf | gawk '{IGNORECASE=1;} /show.*|;/ || /select.*|;/ {print "Here is the data \n\n",$0, "\n"}'

pdftotext ./Oracle-12.pdf - | grep -i "select.*\; | show.*\;"

gawk '{IGNORECASE=1;} /show.*|;/ || /select.*|;/ {print "The Goodies \n\n",$0, "\n"}' ./Oracle-12.pdf.txt

Can someone shed some light? I am using distro Kali 2020.1 which I upgrade from 2019.4 and now the original string doesnt work. Thanks

Scrutinizer · February 2, 2020, 11:03am

Can you specify what does not work? The message presented is just a warning caused by the unnecessary backslash before the semicolon, but it should not change the outcome.

metallica1973 · February 2, 2020, 11:14am

Thank you for your response. When I run the one-liner, its doesnt apply the regex and produce the entire SQL statement like:

SELECT AUD.POLICY_NAME, AUD.AUDIT_OPTION, AUD.AUDIT_OPTION_TYPE
  FROM AUDIT_UNIFIED_POLICIES AUD, AUDIT_UNIFIED_ENABLED_POLICIES ENABLED
  WHERE AUD.POLICY_NAME = ENABLED.POLICY_NAME
  AND AUD.AUDIT_OPTION = 'CREATE TRIGGER'
  AND AUD.AUDIT_OPTION_TYPE = 'STANDARD ACTION'
  AND ENABLED.SUCCESS = 'YES'
  AND ENABLED.FAILURE = 'YES'
  AND ENABLED.ENABLED_OPT = 'BY'
  AND ENABLED.USER_NAME = 'ALL USERS';

Its not honoring the regex:

select.*\;

So when it was working, it would simply find each line that had the SQL statement as in the above sample,inspect and include each newline until it reached the semicolon. The SQL statements in the PDF doc are not in a single line but broken down into multiple newlines, ending with a semicolon.

metallica1973 · February 3, 2020, 9:46am

I tried all the tricks under the sun. No matter what I do, I cant get any of these combinations to work:

pdfgrep -i -PB 20 -A 20 "audit\:" ./Oracle-12.pdf | gawk '{IGNORECASE=1;} /select/,/\;/ {print $0}'
gawk '{IGNORECASE=1;} /select.*/,/\;/ {print $0}' ./Oracle-12.txt
gawk '{IGNORECASE=1} BEGIN { RS = ";" } /FROM/ { gsub("\n+|  +", " ");  print }' ./Oracle-12.txt
gawk '{IGNORECASE=1;} /select.*|;/ {f=1} f; !/\\$/{f=0}' ./Oracle-12.txt
gawk '{IGNORECASE=1;} BEGIN {p=0}; /select.*/ { p = ! p ; next } ; p  {print $0}' ./Oracle-12.txt

Could it be whitespace in the conversion from pdfgrep to a text?

vgersh99 · February 3, 2020, 9:55am

gawk -v IGNORECASE=1 '....'