Searching multiple patterns and construct SQL

nv186000 · March 7, 2018, 1:20am

Hello,

I have attached 2 files

1) Original_table_definition.txt => which has definition of 3 tables
2) describe_table_output.txt => which has again 3 tables definition gotten thorugh doing a show table or describe table way.

Now difference between 3 tables are that tablea has no partition tableb has 1 partition ( on column partb) and tablec has 2 partition ( partc1,partc2)

I am trying to extract data in below format from the describe_table_output.txt

tablea|NULL   --- because partitionKeys:[]  has nothing in it 
tableb|partb  --  because partitionKeys:[FieldSchema(name:partc1 means there is 1 partition for table
tablec|partc1|partc2 ---  because partitionKeys:[FieldSchema(name:partc1, type:string, comment:null), FieldSchema(name:partc2, type:string, comment:null)]  2 partiton for table

so that i can generate below SQL dynamically

ANALYZE table tableA COMPUTE STATISTICS;
ANALYZE TABLE tableb PARTITION(partb) COMPUTE STATISTICS;
ANALYZE TABLE tablec PARTITION(partc1,partc2) COMPUTE STATISTICS;

I have the below code to extract tablename from the describe_table_output.txt

grep -o 'Table(tableName:[^,]*' sample  | awk -F ':' '{ OFS="|";print $2}'

But i have a problem with how to search for the next pattern that is the partition columns

for tablea since there are no parttion column the text looks like this

for tablea since there is only 1 partition it looks like this

partitionKeys:[FieldSchema(name:partb,

for tablec there are 2 partition so it looks like

partitionKeys:[FieldSchema(name:partc1, type:string, comment:null), FieldSchema(name:partc2,

Now please note that FieldSchema: key alone is not unique as it is used against individual columns as well as partitioned columns and we want only those FieldSchema: which are preceded by partitionKeys: keyword.

Request your inputs on how to resolve this .

My environment details are

 echo $BASH_VERSION
3.2.57(1)-release

uname -a
Linux sedcahdp0390 3.0.101-80-default #1 SMP Fri Jul 15 14:30:41 UTC 2016 (eb2ba81) x86_64 x86_64 x86_64 GNU/Linux

awk --version
GNU Awk 3.1.8

sed --version
GNU sed version 4.1.5

Thanks

RudiC · March 7, 2018, 4:20am

How about

awk '
#                                       {gsub (/\r/, _)
#                                       }
match ($0, /tableName:[^,]*/)           {printf "Analyze Table %s ", substr ($0, RSTART+10, RLENGTH-10)
                                        }
match ($0, /partitionKeys:[[][^]]*]/)   {TMP = substr ($0, RSTART+15, RLENGTH-16)
                                         n   = gsub (/FieldSchema\(name:/, "\n", TMP)
                                         if (n) {printf "Partition ("
                                                 m = split (TMP, TARR, "\n")
                                                 for (i=2; i<=m; i++)   {sub (/,.*$/, _, TARR)
                                                                         printf "%s%s", TARR, i<m?",":""
                                                                        }
                                                 printf ") "
                                                }
                                         printf "Compute statistics;\n"
                                        }
' /tmp/describe_table_output.txt
Analyze Table tablea Compute statistics;
Analyze Table tableb Partition (partb) Compute statistics;
Analyze Table tablec Partition (partc1,partc2) Compute statistics;

Your files have non-*nix (DOS) line terminators (<CR> = 0x0D = \r = ^M), the commented-out gsub can take care of those should they irritate any algorithms.

nv186000 · March 7, 2018, 4:49am

Thanks RudiC . Appreciate your quick help. Non Unix line terminators might be because i copied the data from terminal to Notepad++ for uploading to the forum.