I am running the below bash loop on all the files of a specific type (highlighted in bold) in a directory. There are 4 awk commands that use the input files to search another and look for a match. The input files range from 27 - 259 and are a list of names. The file that is searched is 11,137,660 lines. The loop does run, however, it takes ~20 hours to complete on a computer with 64GB and a xeon 8 core processor. Is this normal and can it be made faster (more efficient)? Thank you :).
for f in /home/cmccabe/Desktop/HiQ/*base_counts.txt ; do
bname=`basename $f`
pref=${bname%%.txt}
awk -f /home/cmccabe/Desktop/match.awk /home/cmccabe/Desktop/panels/PCD_unix_corrected.bed $f > /home/cmccabe/Desktop/HiQ/${pref}_PCD_coverage.
awk -f /home/cmccabe/Desktop/match.awk /home/cmccabe/Desktop/panels/BMF_unix_corrected.bed $f > /home/cmccabe/Desktop/HiQ/${pref}_BMF_coverage.bed
awk -f /home/cmccabe/Desktop/match.awk /home/cmccabe/Desktop/panels/PAH_unix_corrected.bed $f > /home/cmccabe/Desktop/HiQ/${pref}_PAH_coverage.bed
awk -f /home/cmccabe/Desktop/match.awk /home/cmccabe/Desktop/panels/PID_unix_corrected.bed $f > /home/cmccabe/Desktop/HiQ/${pref}_PID_coverage.bed
done
awk
BEGIN {
FS="[ \t|]*"
}
# Read search terms from file1 into 's'
FNR==NR {
s[$0]
next
}
{
# Check if $5 matches one of the search terms
for(i in s) {
if($5 ~ i) {
# Store first two fields for later usage
a[$5]=$1
b[$5]=$2
# Add $9 to total of $9 per $5
t[$5]+=$8
# Increment count of occurences of $5
c[$5]++
next
}
}
}
END {
# Calculate average and print output for all search terms
# that has been found
for( i in t ) {
avg = t / c
printf "%s:%s\t%s\t%s\n", a, b, i, avg | "sort -k3,3n"
}
}
In your code, you are saving $1 in a[] and $2 in b[] and at the end you are printing them with a colon between them. In you sample data above, $4 is always the same as $1:$2 . Does that same relationship occur in all lines in your file? (Saving and printing $4 in an array will be faster than saving $1 in an array, saving $2 in another array, and printing both of them.) And, you say above that you want the output to be $4, $5, and the average, but you show the output being $4 , $5 , a "|", $6 , and the average??? Please clarify!
Your sample output above shows that the average of 1 , 2 , and 3 is 3 . Why not 2 (i.e., (1+2+3)/3) ? How many decimal places do you want printed in the average?
Are your search strings always to be exactly matched by the string starting with the 1st character of $5 and ending with the character before the <minus-sign> character in $5 ? (Your script will run MUCH faster if you perform one test to determine if a string is a subscript in an array instead of an average of 14-130 regular expression matches.)
You're reading a 3/4 GB file four times - I don't know if disk I/O buffering will easily cater for that. Why dont you read your four .bed files into four different (multidimensional?) arrays ( 259 is not too large an array element count), then do your four independent calculations on each large file's input line, and then output to the four different result files?
In that case the example that I gave above should put you on track. Split the AGRN or whatever part of $5 that could possibly been *exactly* matched by the search pattern. This will avoid searching a pattern in a longer string which is rather greedy.
Instead of reading 750 MB, you are reading 3GB to operate on. With the four input files in arrays and an extended algorithm, the performance might be way faster.
If we had some meaningful samples, we could work out a small test script...
Part of what is confusing is that you have input files with the filename extension .bed that you show as having a single field such as:
AGRN
CCDC39
CCDC40
CFTR
DNAAF1
...
and you have four output files (three of which have the same filename extension in a completely different format):
chr1:955543 AGRN-6|gc=75 3
and one other output file has the filename extension . . Why aren't the names of your output files consistent? Why aren't all files with the filename extension .bed in the same format?
And we have an unknown number of files matching the pattern /home/cmccabe/Desktop/HiQ/*base_counts.txt and no indication of what is actually matched by the asterisk. Please give us some actual sample pathnames that this pattern might match.
You have said your input files have more than 11 million lines each and have shown us the 3 line sample:
and your code accumulates totals based on the string AGRN-6 and prints results assuming that AGRN-6 and AGRN-6|gc=75 select the same set of lines from your huge input files. Please give us a few more lines (some with strings that will be selected for output from the .bed input file and some that won't. And show us the exact output you hope to get in your four output files for that sample input. (Note that that means we need to see four sample .bed input files and four corresponding output files in your sample.)
From your description I am assuming that there could be multiple AGRN-x values in the input but for a given AGRN-x the string following the | will be a constant. I.e., for $5 in your code having the value AGRN-6 the only value for $6 will be gc=75 , but there could be an AGRN-otherstring and all AGRN-otherstring entries would have a string something like xyz=somenumber but xyz and somenumber would always be the same for any given AGRN-otherstring . Is this assumption correct?
Adding | as a field separator character seems to be creating unneeded work for you. It would seem that using - as a field separator instead of using | as a field separator would help. Will there ever be more than one - in an input line?
Well, I obviously missed that: there are n *base_counts.txt files, and for each of them you scan four times through the huge file, so 4 * n * 11,137,660 lines are read.
As has been offered before, with some meaningful samples we perhaps could give some decent help.
Please post at least two ???_unix_corrected.bed (partly) , two *base_counts.txt (partly) files, and a representative part of the huge file so we can build a meaningful test scenario.