Join columns across multiple lines in a Text based on common column using BASH

nv186000 · March 6, 2018, 5:45am

Hello,

I have a file with 2 columns ( tableName , ColumnName) delimited by a Pipe like below . File is sorted by ColumnName.

Table1|Column1
Table2|Column1
Table5|Column1
Table3|Column2
Table2|Column2
Table4|Column3
Table2|Column3
Table2|Column4
Table5|Column4
Table2|Column5

From the below file i am trying to generate a Dynamic SQL Join between tablenames in column 1 when they have same value for columnname in column 2 of above table.

select * from Table1 a inner join Table2 b on a.Column1=b.column1 inner join Table5 c on a.Column1=c.column1

and

select * from Table3 a inner join Table2 b on a.column2 = b.column2

i want to do this by traversing through the file row by row till i reach end of file.

Can you please advise What is the best way to do it ?

Note: If i can just get the table names sharing same column names then also i can create the dynamic SQL outside the Bash logic.

RudiC · March 6, 2018, 5:54am

Welcome to the forum.

Any attempts / ideas / thoughts from your side? How would you "create the dynamic SQL outside the Bash logic"? Do solution proposals have to be shell or would text tools like e.g. awk be acceptable as well?

nv186000 · March 6, 2018, 7:06am

In another thread some one suggested below AWK code , unfortunately our awk version did not match and i was not able to use this .

{ a[$2][$1] };
END {
    for (col in a) {
        printf "%s", col;
        for (tab in a[col])
            printf "|%s", tab;
        print ""
    }
}

Another method i am trying now is using a self join

join -t "|" -1 2 -2 2  -o '1.1,2.1,1.2'  file  file

and i am getting below output

Table1|Table1|Column1
Table1|Table2|Column1
Table1|Table5|Column1
Table2|Table1|Column1
Table2|Table2|Column1
Table2|Table5|Column1
Table5|Table1|Column1
Table5|Table2|Column1
Table5|Table5|Column1
Table3|Table3|Column2
Table3|Table2|Column2
Table2|Table3|Column2
Table2|Table2|Column2
Table4|Table4|Column3
Table4|Table2|Column3
Table2|Table4|Column3
Table2|Table2|Column3
Table2|Table2|Column4
Table2|Table5|Column4
Table5|Table2|Column4
Table5|Table5|Column4
Table2|Table2|Column5

Now at least 2 problem with this approach

1) At max I can get 2 tables with common join columns . If there are 3 tables with same columns i will have to split it into 2 separate SQL statement with one join each.

2) I need to remove functional duplicates . that is joins between same tables , joins where just table names are reversed table1 join table2 == table2 join table1

Note:
1) I am open to using awk , or sed ..or anything that works.
2) outside Bash i meant , i can use hard coded concatenation to generate most part of the SQL statement apart from the join part.

Don_Cragun · March 6, 2018, 7:34am

nv186000:

In another thread some one suggested below AWK code , unfortunately our awk version did not match and i was not able to use this .
{ a[$2][$1] };
END {
   for (col in a) {
   printf "%s", col;
   for (tab in a[col])
   printf "|%s", tab;
   print ""
   }
}
Another method i am trying now is using a self join
join -t "|" -1 2 -2 2  -o '1.1,2.1,1.2'  file  file 
and i am getting below output
Table1|Table1|Column1
Table1|Table2|Column1
Table1|Table5|Column1
Table2|Table1|Column1
Table2|Table2|Column1
Table2|Table5|Column1
Table5|Table1|Column1
Table5|Table2|Column1
Table5|Table5|Column1
Table3|Table3|Column2
Table3|Table2|Column2
Table2|Table3|Column2
Table2|Table2|Column2
Table4|Table4|Column3
Table4|Table2|Column3
Table2|Table4|Column3
Table2|Table2|Column3
Table2|Table2|Column4
Table2|Table5|Column4
Table5|Table2|Column4
Table5|Table5|Column4
Table2|Table2|Column5
Now at least 2 problem with this approach

1) At max I can get 2 tables with common join columns . If there are 3 tables with same columns i will have to split it into 2 separate SQL statement with one join each.

2) I need to remove functional duplicates . that is joins between same tables , joins where just table names are reversed table1 join table2 == table2 join table1

Note:
1) I am open to using awk , or sed ..or anything that works.
2) outside Bash i meant , i can use hard coded concatenation to generate most part of the SQL statement apart from the join part.

Saying, "unfortunately our awk version did not match" makes it clear that the operating system and shell you're using is important information that we need to know to be able to supply you with suggestions that will work in your environment. (And, this information should be supplied in every thread started in this forum.) From what you have said above, I would assume that you're using bash , but we don't know which version. So, please tell us what operating system (including the release/version number) and which version of bash you're using so we can provide you with suggestions that will work in your environment.

nv186000 · March 6, 2018, 7:44am

Thanks Don . I will keep your points in mind before opening a new thread.

The details you asked are given below

 echo $BASH_VERSION
3.2.57(1)-release

 uname -a
Linux sedcahdp0390 3.0.101-80-default #1 SMP Fri Jul 15 14:30:41 UTC 2016 (eb2ba81) x86_64 x86_64 x86_64 GNU/Linux

 eedc_edg_s_d-itm_e@sedcahdp0390:/home/eedc_edg_s_d-itm_e/test_automation : lsb_release -a
LSB Version:    core-2.0-noarch:core-3.2-noarch:core-4.0-noarch:core-2.0-x86_64:core-3.2-x86_64:core-4.0-x86_64:desktop-4.0-amd64:desktop-4.0-noarch:graphics-2.0-amd64:graphics-2.0-noarch:graphics-3.2-amd64:graphics-3.2-noarch:graphics-4.0-amd64:graphics-4.0-noarch
Distributor ID: SUSE LINUX
Description:    SUSE Linux Enterprise Server 11 (x86_64)
Release:        11
Codename:       n/a

awk --version
GNU Awk 3.1.8

sed --version
GNU sed version 4.1.5

RudiC · March 6, 2018, 7:45am

How about (tested on linux with mawk 1.3.3)

awk -F\| '
        {TMP[$2] = TMP[$2] FA[$2] $1
         FA[$2] = FS
        }
END     {CH[1] = 97
         for (t in TMP) {n = split (TMP[t], T)
                         printf "select * from %s %c", T[1], CH[1]
                         for (i=2; i<=n; i++)   {CH = CH[i-1] + 1
                                                 printf " inner join %s %c on %c.%s=%c.%s", T, CH, CH[1], t, CH, t
                                                }
                         printf RS
                        }
        }
' file
select * from Table1 a inner join Table2 b on a.Column1=b.Column1 inner join Table5 c on a.Column1=c.Column1
select * from Table3 a inner join Table2 b on a.Column2=b.Column2
select * from Table4 a inner join Table2 b on a.Column3=b.Column3
select * from Table2 a inner join Table5 b on a.Column4=b.Column4
select * from Table2 a

EDIT: or mayhap with a simpler END section:

END     {split ("a b c d e f g", CH, " ")
         for (t in TMP) {n = split (TMP[t], T)
                         printf "select * from %s %c", T[1], CH[1]
                         for (i=2; i<=n; i++)   printf " inner join %s %c on %c.%s=%c.%s", T, CH, CH[1], t, CH, t
                         printf RS
                        }

nv186000 · March 6, 2018, 8:14am

Thank you RudiC . Both approaches are working fine. I am testing the solution with real data and will let you know in case something comes up