Speed up bash loop?

cmccabe · November 17, 2015, 1:45pm

I am running the below bash loop on all the files of a specific type (highlighted in bold) in a directory. There are 4 awk commands that use the input files to search another and look for a match. The input files range from 27 - 259 and are a list of names. The file that is searched is 11,137,660 lines. The loop does run, however, it takes ~20 hours to complete on a computer with 64GB and a xeon 8 core processor. Is this normal and can it be made faster (more efficient)? Thank you :).

for f in /home/cmccabe/Desktop/HiQ/*base_counts.txt ; do
     bname=`basename $f`
     pref=${bname%%.txt}
     awk -f /home/cmccabe/Desktop/match.awk /home/cmccabe/Desktop/panels/PCD_unix_corrected.bed $f > /home/cmccabe/Desktop/HiQ/${pref}_PCD_coverage.
     awk -f /home/cmccabe/Desktop/match.awk /home/cmccabe/Desktop/panels/BMF_unix_corrected.bed $f > /home/cmccabe/Desktop/HiQ/${pref}_BMF_coverage.bed
     awk -f /home/cmccabe/Desktop/match.awk /home/cmccabe/Desktop/panels/PAH_unix_corrected.bed $f > /home/cmccabe/Desktop/HiQ/${pref}_PAH_coverage.bed
     awk -f /home/cmccabe/Desktop/match.awk /home/cmccabe/Desktop/panels/PID_unix_corrected.bed $f > /home/cmccabe/Desktop/HiQ/${pref}_PID_coverage.bed
done

awk

BEGIN {
  FS="[ \t|]*"
}
# Read search terms from file1 into 's'
FNR==NR {
    s[$0]
    next
}
{
    # Check if $5 matches one of the search terms
    for(i in s) {
        if($5 ~ i) {

            # Store first two fields for later usage
            a[$5]=$1
            b[$5]=$2

            # Add $9 to total of $9 per $5
            t[$5]+=$8
            # Increment count of occurences of $5
            c[$5]++

            next
        }
    }
}
END {

    # Calculate average and print output for all search terms
    # that has been found
    for( i in t ) {
        avg = t / c
        printf "%s:%s\t%s\t%s\n", a, b, i, avg | "sort -k3,3n"
    }
}

ripat · November 17, 2015, 2:27pm

Can you post sample files?

Is the $5 string longer than the searched pattern?

cmccabe · November 17, 2015, 2:37pm

the search string (the 27 - 259 names in a file):

ABCA3
ACVRL1
AGRN

the file to search in $5 (the 11,137,660 line file)

chr1    955543    955763    chr1:955543    AGRN-6|gc=75    1    0
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    2    2
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    3    2

So the expected output would be:

chr1:955543    AGRN-6|gc=75     3

only $4, $5 where the match was found and the average of $7 are printed

Thank you very much :).

ripat · November 17, 2015, 3:04pm

If $5 can always be spilt on the hyphen i.e. AGRN-6|gc=75 to AGRN - 6|gc=75 this could speed up the process.

To put you on track:


BEGIN{FS="[\t| -]+"}

FNR==NR {
	s[$0]=1
	next
}

# if s[$5] exists --> do something
s[$5] {
	# do something
}

If mawk is available on your box, it's usually faster.

Don_Cragun · November 17, 2015, 4:08pm

cmccabe:

the search string (the 27 - 259 names in a file):
ABCA3
ACVRL1
AGRN 
the file to search in $5 (the 11,137,660 line file)
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    1    0
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    2    2
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    3    2 
So the expected output would be:
chr1:955543    AGRN-6|gc=75     3
only $4, $5 where the match was found and the average of $7 are printed

Thank you very much :).

In your code, you are saving $1 in a[] and $2 in b[] and at the end you are printing them with a colon between them. In you sample data above, $4 is always the same as $1:$2 . Does that same relationship occur in all lines in your file? (Saving and printing $4 in an array will be faster than saving $1 in an array, saving $2 in another array, and printing both of them.) And, you say above that you want the output to be $4, $5, and the average, but you show the output being $4 , $5 , a "|", $6 , and the average??? Please clarify!

Your sample output above shows that the average of 1 , 2 , and 3 is 3 . Why not 2 (i.e., (1+2+3)/3) ? How many decimal places do you want printed in the average?

Are your search strings always to be exactly matched by the string starting with the 1st character of $5 and ending with the character before the <minus-sign> character in $5 ? (Your script will run MUCH faster if you perform one test to determine if a string is a subscript in an array instead of an average of 14-130 regular expression matches.)

RudiC · November 17, 2015, 5:01pm

You're reading a 3/4 GB file four times - I don't know if disk I/O buffering will easily cater for that. Why dont you read your four .bed files into four different (multidimensional?) arrays ( 259 is not too large an array element count), then do your four independent calculations on each large file's input line, and then output to the four different result files?

cmccabe · November 18, 2015, 11:23am

@Don Cragun

Yes $4 is always the same as $1:$2

The output should be $4 , $5 , a "|", $6 , and the average

You aree correct in that 2 (i.e., (1+2+3)/3) is better and just one decimal place in the average.

Yes it is the first character in $5 to the "-" sign ( so in AGRN-6|gc=75) it is AGRN.

Thank you :).

---------- Post updated at 10:23 AM ---------- Previous update was at 10:01 AM ----------

@RudiC

I'm not sure what you mean, sorry and thank you :).

ripat · November 18, 2015, 12:26pm

In that case the example that I gave above should put you on track. Split the AGRN or whatever part of $5 that could possibly been *exactly* matched by the search pattern. This will avoid searching a pattern in a longer string which is rather greedy.

RudiC · November 18, 2015, 12:42pm

Instead of reading 750 MB, you are reading 3GB to operate on. With the four input files in arrays and an extended algorithm, the performance might be way faster.
If we had some meaningful samples, we could work out a small test script...

cmccabe · November 19, 2015, 10:16am

one of the 4 files used is below (all four are 1 field and a list of names)

PAH.bed

AGRN
CCDC39 
CCDC40 
CFTR
DNAAF1
DNAAF2 
DNAAF3 
DNAH11 
DNAH5 
DNAI1 
DNAI2 
DNAL1 
DYX1C1
HEATR2 
HYDIN 
LRRC6 
NME8 
OFD1
RPGR
RSPH4A 
RSPH9

The file that is searched in is 11,137,660 lines in the format below:

chr1    955543    955763    chr1:955543    AGRN-6|gc=75    1    0
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    2    2
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    3    2

Thank you :).

Don_Cragun · November 19, 2015, 3:15pm

Part of what is confusing is that you have input files with the filename extension .bed that you show as having a single field such as:

AGRN
CCDC39 
CCDC40 
CFTR
DNAAF1
...

and you have four output files (three of which have the same filename extension in a completely different format):

chr1:955543    AGRN-6|gc=75     3

and one other output file has the filename extension . . Why aren't the names of your output files consistent? Why aren't all files with the filename extension .bed in the same format?

And we have an unknown number of files matching the pattern /home/cmccabe/Desktop/HiQ/*base_counts.txt and no indication of what is actually matched by the asterisk. Please give us some actual sample pathnames that this pattern might match.

You have said your input files have more than 11 million lines each and have shown us the 3 line sample:

chr1    955543    955763    chr1:955543    AGRN-6|gc=75    1    0
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    2    2
chr1    955543    955763    chr1:955543    AGRN-6|gc=75    3    2

and your code accumulates totals based on the string AGRN-6 and prints results assuming that AGRN-6 and AGRN-6|gc=75 select the same set of lines from your huge input files. Please give us a few more lines (some with strings that will be selected for output from the .bed input file and some that won't. And show us the exact output you hope to get in your four output files for that sample input. (Note that that means we need to see four sample .bed input files and four corresponding output files in your sample.)

From your description I am assuming that there could be multiple AGRN-x values in the input but for a given AGRN-x the string following the | will be a constant. I.e., for $5 in your code having the value AGRN-6 the only value for $6 will be gc=75 , but there could be an AGRN-otherstring and all AGRN-otherstring entries would have a string something like xyz=somenumber but xyz and somenumber would always be the same for any given AGRN-otherstring . Is this assumption correct?

Adding | as a field separator character seems to be creating unneeded work for you. It would seem that using - as a field separator instead of using | as a field separator would help. Will there ever be more than one - in an input line?

RudiC · November 19, 2015, 4:42pm

Well, I obviously missed that: there are n *base_counts.txt files, and for each of them you scan four times through the huge file, so 4 * n * 11,137,660 lines are read.
As has been offered before, with some meaningful samples we perhaps could give some decent help.
Please post at least two ???_unix_corrected.bed (partly) , two *base_counts.txt (partly) files, and a representative part of the huge file so we can build a meaningful test scenario.