Looping through entire directory and count unique values

Hello,

I`m a complete newbie to coding, please help with this problem.

I have multiple files in a directory, I have to loop through the contents of each file and extract number of unique isoforms in that file. Each file is tab delimited and only the line with the first parent (column 3) needs to be read.

Sample file

1 Graph graph 260052556 260054526 . + . ID=HCG177897.2_FG022
1 Graph parent 260052556 260052696 . + . ID=HCG177897.2_FG022_1;Isoforms=HCG177897.2_FGT022,HCG177897.2_FGT023
1 Graph child 260052742 260054066 . + . ID=HCG177897.2_FG022_2;Parent=HCG177897.2_FG022_1;Isoforms=HCG177897.2_FGT022
1 Graph child 260054161 260054239 . + . ID=HCG177897.2_FG022_3;Parent=HCG177897.2_FG022_2;Isoforms=HCG177897.2_FGT022
1 Graph child 260054323 260054526 . + . ID=HCG177897.2_FG022_4;Parent=HCG177897.2_FG022_3;Isoforms=HCG177897.2_FGT022
1 Graph child 260054323 260054526 . + . ID=HCG177897.2_FG022_4;Parent=HCG177897.2_FG022_3;Isoforms=HCG177897.2_FGT022
1 Graph child 260054323 260054526 . + . ID=HCG177897.2_FG022_4;Parent=HCG177897.2_FG022_3;Isoforms=HCG177897.2_FGT023
1 Graph child 260054323 260054526 . + . ID=HCG177897.2_FG022_4;Parent=HCG177897.2_FG022_3;Isoforms=HCG177897.2_FGT023
1 Graph child 260054323 260054526 . + . ID=HCG177897.2_FG022_4;Parent=HCG177897.2_FG022_3;Isoforms=HCG177897.2_FGT022
1 Graph child 260054323 260054526 . + . ID=HCG177897.2_FG022_4;Parent=HCG177897.2_FG022_3;Isoforms=HCG177897.2_FGT022

In the above sample the only revelant line is with the word 'parent' in column 3.

1 Graph parent 260052556 260052696 . + . ID=HCG177897.2_FG022_1;Isoforms=HCG177897.2_FGT022,HCG177897.2_FGT023

Column 9 is a composite column, where subfields are separated by ';' but the last subfield doesnt end with ';' as in the above line.

The subfield Isoforms=HCG177897.2_FGT022,HCG177897.2_FGT023 needs to be extracted and the number of isoforms need to be counted.
In this case there are 2 isoforms, which can be also calculated by number of (commas+1). Here there are 2 isoforms, separated by a comma.

I should note that sometimes there may be irrelevant fields after Isoforms,

example

1 Graph parent 260052556 260052696 . + . ID=HCG177897.2_FG022_1;Isoforms=HCG177897.2_FGT022,HCG177897.2_FGT023
;code=TRUE;FGTLL

but the calculations remain the same.

So I want to count the number of isoforms in each file in the directory
and write output in the form

HCG177897.2_FG022.txt 2

where HCG177897.2_FG022.txt is the filename

Desired output for sample inputs (attached)

 
HCG177897.2_FG002.txt 1
HCG177897.2_FG022.txt 2
HCG186375.3_FG001.txt 4

Try this

$ awk '$3 == "parent" {
    gsub(/.*Isoforms=/,//);
        gsub(/;.*/, "");
        print FILENAME " " split($0, dummy, ",")
}' *

HCG177897.2_FG002.txt 1
HCG177897.2_FG022.txt 2
HCG186375.3_FG001.txt 4
1 Like