Awk: Append new elements to an array

Hi all,

I'm dealing with a bash script to merge the elements of a set of files and counting how many times each element is present. The last field is the file name.

Sample files:

head -5 *.tab

==> 3J373_P15Ac1y2_01_LS.tab <==

chr1    1956362 1956362 G       A       hom     3J373_P15Ac1y2_01_LS.tab
chr1    1957037 1957037 T       C       hom     3J373_P15Ac1y2_01_LS.tab
chr1    1960926 1960926 T       C       hom     3J373_P15Ac1y2_01_LS.tab
chr1    17359676        17359676        C       A       hom     3J373_P15Ac1y2_01_LS.tab
chr1    17371152        17371152        T       C       het     3J373_P15Ac1y2_01_LS.tab

==> 7D300_P15Ac1y2_01_GATK.tab <==

chr1    1956362 1956362 G       A       het     7D300_P15Ac1y2_01_GATK.tab
chr1    1957037 1957037 T       C       het     7D300_P15Ac1y2_01_GATK.tab
chr1    1959107 1959107 G       C       het     7D300_P15Ac1y2_01_GATK.tab
chr1    1959699 1959699 G       A       het     7D300_P15Ac1y2_01_GATK.tab
chr1    17359676        17359676        C       A       hom     7D300_P15Ac1y2_01_GATK.tab
.
.
.

Up to several dozens of files...

Here is my code:

cat *.tab \
    | awk 'BEGIN {FS="\t";OFS="\t"} {s[$1":"$2"-"$3";"$4"/"$5]=$0; c[$1":"$2"-"$3";"$4"/"$5]++} END {for (i in s) print i,c,$7}' \
    | sort -V \
    > CommonVariants.bed

Output file:

cat CommonVariants.bed
chr1:1956362-1956362;G/A    36    7D300_P15Ac1y2_01_LS.tab
chr1:1957037-1957037;T/C    36    7D300_P15Ac1y2_01_LS.tab
chr1:1957112-1957112;C/T    2    7D300_P15Ac1y2_01_LS.tab
chr1:1959107-1959107;G/C    2    7D300_P15Ac1y2_01_LS.tab
chr1:1959138-1959138;G/C    2    7D300_P15Ac1y2_01_LS.tab
chr1:1959549-1959549;G/A    2    7D300_P15Ac1y2_01_LS.tab
chr1:1959699-1959699;G/A    4    7D300_P15Ac1y2_01_LS.tab
chr1:1959789-1959789;A/G    3    7D300_P15Ac1y2_01_LS.tab
chr1:1960674-1960674;C/T    6    7D300_P15Ac1y2_01_LS.tab
chr1:1960926-1960926;T/C    18    7D300_P15Ac1y2_01_LS.tab
chr1:1961144-1961144;C/T    2    7D300_P15Ac1y2_01_LS.tab
chr1:1961408-1961408;C/T    6    7D300_P15Ac1y2_01_LS.tab
chr1:1961466-1961466;C/T    2    7D300_P15Ac1y2_01_LS.tab
chr1:17359676-17359676;C/A    36    7D300_P15Ac1y2_01_LS.tab

I can create the index and count the lines. However I can't figure out how to append the file names into the $7 column.
I guess I have to replace "$7" with an array in the awk statement, but this is too much for me.

I really appreciate any help.

Thank you in advance

cat *.tab \
    | awk 'BEGIN {FS="\t";OFS="\t"} {s[$1":"$2"-"$3";"$4"/"$5]=$0; c[$1":"$2"-"$3";"$4"/"$5]++; a[$1":"$2"-"$3";"$4"/"$5] = $7} END {for (i in s) print i,c,a}' \
    | sort -V \
    > CommonVariants.bed
chr1:1956362-1956362;G/A

is present in both 3J373_P15Ac1y2_01_LS.tab and 7D300_P15Ac1y2_01_GATK.tab , yet the output specifies: 7D300_P15Ac1y2_01_LS.tab . How does that work?

Thank you SriniShoo, but I think I didn't explain it propperly. I need the name of all files where the index was found.

Exprected output:

chr1:1959138-1959138;G/C    2    7D300_P15Ac1y2_01_LS.tab, 3H682_P15Ac1y2_01_LS.tab
chr1:1959549-1959549;G/A    2    7D300_P15Ac1y2_01_LS.tab, 3H682_P15Ac1y2_01_LS.tab
chr1:1959699-1959699;G/A    4    7D300_P15Ac1y2_01_LS.tab, 3H682_P15Ac1y2_01_LS.tab, 3J188_P15Ac1y2_01_LS.tab, 3J270_P15Ac1y2_01_GATK.tab
chr1:1959789-1959789;A/G    3    7D300_P15Ac1y2_01_LS.tab, 3H682_P15Ac1y2_01_LS.tab, 3J188_P15Ac1y2_01_LS.tab

Thank you again

---------- Post updated at 02:08 PM ---------- Previous update was at 02:05 PM ----------

Scrutinizer, that's my problem. It always displays the file name of the last file where it found the index.

Try something like:

awk '
  {
    i=$1":"$2"-"$3";"$4"/"$5
    c++
  } 
  !P[i,$7]++ {
    F=F (F?", ":x) $7
  } 
  END {
    for (i in c) print i,c,F
  }
' FS='\t' OFS='\t' *.tab

Single line (too long) version:

awk '{i=$1":"$2"-"$3";"$4"/"$5; c++} !P[i,$7]++{F=F (F?", ":x) $7} END{for (i in c) print i,c,F}' FS='\t' OFS='\t' *.tab

It works!

Awesome! Brilliant! Wonderful!

Thank you so much!!!

---------- Post updated at 03:44 PM ---------- Previous update was at 03:22 PM ----------

what does it means?

!P[i,$7]++{F=F (F?", ":x) $7}

You are welcome...

!P[i,$7]++                   # If an array element consisting of both i and $7 does not yet exist, 
                             # then ...  The first time this array element does not exist so 
                             # the negation becomes true. The second time the array
                             # is >0 so the negation becomes 0 (false)...
                             # This is a way to add a filename only for the first time to F

F=F (F?", ":x) $7   # append $7 to F but put a field separator ( ", " ) in between if 
                             # F already exists...

clear like water