awk - Multi-line data to be stored in variable

Greetings Experts,

As part of automating the sql generation, I have the source table name, target table name, join condition stored in a file join_conditions.txt which is a delimited file (I can edit the file if for any reason). The reason I needed to store is I have built SELECT list without FROM clause according to source and target tables taken from another file and stored it in an array and now I need to build the complete sql query of select list and the from list;

Src_tbl       Tgt_tbl     join condition
Src_1        @Tgt_1       @FROM ( SELECT SRC_1.* FROM SRC_1 INNER JOIN SRC_TEMP_1 ON
                          SRC_1.COL1=SRC_TEMP_1.COL1) SRC JOIN (SELECT TGT_1.* FROM TGT_1 
                          INNER JOIN TGT_TEMP_1 ON TGT_1.COL1=TGT_TEMP_1.COL1) TGT ON 
                          SRC.COL1=TGT.COL1;
Src_2        @Tgt_2        ...................

Please note that I have aligned the 3rd column LPAD with spaces for better view; the provided join condition column can span multiple lines;
Similarly there are corresponding join conditions for other source and target tables in the file which is not mentioned here.

awk -F "@" 'BEGIN{RS=";"}
{
join_cond=$3
a[$1 OFS $2]=join_cond
}
END {
for (i in a) {
print a 
}
}' < join_conditions.txt > sql_query.txt

Expected assignment:

FROM ( SELECT SRC_1.* FROM SRC_1 INNER JOIN SRC_TEMP_1 ON
                          SRC_1.COL1=SRC_TEMP_1.COL1) SRC JOIN (SELECT TGT_1.* FROM TGT_1 
                          INNER JOIN TGT_TEMP_1 ON TGT_1.COL1=TGT_TEMP_1.COL1) TGT ON 
                          SRC.COL1=TGT.COL1;

Current assignment:

FROM ( SELECT SRC_1.* FROM SRC_1 INNER JOIN SRC_TEMP_1 ON

Even though I specified the RS=; I think it is still considering the \n as record separator; How can I assign the multi-line content present in
the join_conditions.txt to awk variable jon_cond for frame the sql query; I am struck up with this as I have no clue on how to proceed;

I can break the file manually to contain the join_condition with source and target repeating multiple times as

Src_tbl1   Tgt_tbl1     join condition
Src_1      Tgt_1        FROM ( SELECT SRC_1.* FROM SRC_1 INNER JOIN
Src_1      Tgt_1        SRC_TEMP_1 ON SRC_1.COL1=SRC_TEMP_1.COL1) SRC
Src_1      Tgt_1        JOIN (SELECT TGT_1.* FROM TGT_1 INNER JOIN TGT_TEMP_1 ON 
Src_1      Tgt_1        TGT_1.COL1=TGT_TEMP_1.COL1) TGT ON SRC.COL1=TGT.COL1;

and then carry out the required operation. But I assume that there is more elegant approach

Many thanks for your time..

---------- Post updated at 11:27 AM ---------- Previous update was at 10:52 AM ----------

Hi RavinderSingh,

I have a input delimited file whose layout is

Src_tbl@Tgt_tbl@join_condition
Src_1@Tgt_1@FROM ( SELECT SRC_1.* FROM SRC_1 INNER JOIN SRC_TEMP_1 ON  SRC_1.COL1=SRC_TEMP_1.COL1) SRC JOIN (SELECT TGT_1.* FROM TGT_1  INNER JOIN TGT_TEMP_1 ON TGT_1.COL1=TGT_TEMP_1.COL1) TGT ON  SRC.COL1=TGT.COL1;

The data in join_condition can span multiple lines here.

I need to get the data in the join_condition column and assign it to an array for further operations;

awk -F "@" ' BEGIN{RS=";"}
{
a[$1 OFS $2]=$3
print a[$1 OFS $2]
}'

Expected Output:

FROM ( SELECT SRC_1.* FROM SRC_1 INNER JOIN SRC_TEMP_1 ON  SRC_1.COL1=SRC_TEMP_1.COL1) SRC JOIN (SELECT TGT_1.* FROM TGT_1  INNER JOIN TGT_TEMP_1 ON TGT_1.COL1=TGT_TEMP_1.COL1) TGT ON  SRC.COL1=TGT.COL1;

Output what I am seeing:

FROM ( SELECT SRC_1.* FROM SRC_1 INNER JOIN

How to get the expected output here please.

Interesting, nawk variants treat an FS="@" in a way that line feeds are counted aw well.
Work-around:

awk -F"[@]" 'BEGIN {RS=";"} {print $3}' < join_conditions.txt

or

awk 'BEGIN {FS="\@"; RS=";"} {print $3}' < join_conditions.txt
1 Like

Thank You MadeInGermany; The first solution did work; however the second solution did produce the same output as my post (not able to recognize ; as RS) Can you please explain how

[@] 

different is to that of a simple ones;

awk -F "@"....
awk -F "[@]"...

I am not able to understand the difference as I am learning awk :slight_smile:

Edit:
Does [] ignore the conventional "\n" as RS unless specified other than this in all cases and does awk -F "@" considers the RS as "\n" when the specified RS is not specified or found in the entire line..

You are right, some nawk derivates require

awk 'BEGIN {FS="[@]"; RS=";"} {print $3}' join_conditions.txt

The RS=";" works multi-line, but the FS="@" treats line feeds as a field delimiter.
You see it with

awk 'BEGIN {FS="@"; RS=";"} {print NR,NF}' join_conditions.txt
awk 'BEGIN {FS="[@]"; RS=";"} {print NR,NF}' join_conditions.txt

(record number, number of fields)
Why is this? I am surprised myself...maybe a bug...there should be no difference between FS="@" and FS="[@]"

1 Like

Hi MadeinGermany,
The NR is same for both the FS="@" and FS="[@]" However the NF differs in the above cases; as you specified for FS="@" it is treating line feeds as a delimiter. Thanks again

Interesting. I had a look at this and it appears this behavior is with any character, not just "@", but also for "*", "a" or "b".

I only saw this with AIX and HPUX awk..

A possible explanation for the difference between [@] and @ may be that in the POSIX specification when a single character is used as FS, then this is regarded as a string and not as a regular expression. When [c] is used (where c is a character, for example @) then this is a regular expression.

In these two awks, it appears that, if an RS other than \n is used, and a single character is used as FS then a newline is still seen as a field separator, whereas this is not the case if FS is a regular expression.

I could not find anything in the POSIX specification that describes this behavior, so it is not as it should be. It does remind me a bit of how newlines are field separators irrespective of FS value when RS="" is used, so maybe it is a remnant behavior of some sort.

2 Likes