The below code works great, kindly provided by @Don Cragun, the lines in bold print the current output . Since some of the fields printed can be blank some of the fields are shifted. I can not seem too add . to the blank fields like in the desired output. Basically, if there is nothing in the field then . otherwise print what the script matches. Thank you :).
script
for file in /home/cmccabe/Desktop/concordance/comparison/update/*.txt ; do
file1=${file##*/} # Strip off directory
getprefix=${file1%%_*.txt}
file1=$(printf '%s\n' "/home/cmccabe/Desktop/concordance/reference/files/${file1%%_*.txt}_"*.txt) # look for matching file
if [[ -f "$file1" ]]
then
awk '
BEGIN {FS = OFS = "\t"
}
NR == 1 {
outfile = FILENAME
}
FNR == NR {
o[i[++ic] = $1 OFS $2 OFS $3] = $0
}
{if($2 OFS $4 OFS $5 in o)
o[$2 OFS $4 OFS $5] = $1 OFS $2 OFS $4 OFS $5 OFS $6 OFS $7 OFS $8 OFS $9 OFS $10 OFS $11 OFS $12 OFS $13 OFS $14 OFS $15 OFS $16 OFS $17 OFS $18 OFS $19 }
END {for(j = 1; j <= ic; j++)
print o[i[j]] > outfile
}' $file $file1
fi
done
current output
Missing in IDP but found in Reference:
2 166848646 G A exonic SCN1A 68 13 16;20 0;0 17;15 0;0 0;0 0;0 c.[5139C>T]+[=] 52.94 Not low found
12 52200340 A C exonic SCN8A 4129 28.3 1560;1672 413;453 0;0 0;0 0;2 31;0 c.[5070A>C]+[=] 20.97 Not low Not found
13 77570076 - A exonic CLN5 2762 26.6 2060;702 0;0 0;0 0;0 2050;696 0;0 c.526_527insA 99.42 TP Not low Not found
7 148106478 - GT intronic CNTNAP2 4051 28.5 0;1 0;0 0;0 2220;1829 1085;887 0;1 rs60451214 c.3716-5_3716-4insGT 48.68 Not low Not found
9 138678036 TGCCC - intronic KCNT1 834 23.1 0;0 0;0 0;31 0;1 0;0 0;802 rs141359570 c.3178-7_3178-3delTGCCC 96.16 Not low Not found
7 148106476 - TT intronic CNTNAP2 4052 28.8 0;0 5;0 0;0 2221;1826 1081;884 0;0 rs61232377 c.3716-7_3716-6insTT 48.49 Not low Not found
2 166245425 C T exonic SCN2A 49 12.6 0;0 13;9 0;0 18;9 0;0 0;0 c.[5109C>T]+[=] 55.1 Not low found
desired output
Missing in IDP but found in Reference:
CHR POS REF ALT FUNC GENE COVERAGE PHRED A[#F,#R] C[#F,#R] G[#F,#R] T[#F,#R] INS[#F,#R] DEL[#F,#R] SNP MUT FREQ SANGER REGION TVC
2 166848646 G A exonic SCN1A 68 13 16;20 0;0 17;15 0;0 0;0 0;0 . c.[5139C>T]+[=] 52.94 . Not low found
12 52200340 A C exonic SCN8A 4129 28.3 1560;1672 413;453 0;0 0;0 0;2 31;0 . c.[5070A>C]+[=] 20.97 . Not low Not found
13 77570076 - A exonic CLN5 2762 26.6 2060;702 0;0 0;0 0;0 2050;696 0;0 . c.526_527insA 99.42 TP Not low Not found
7 148106478 - GT intronic CNTNAP2 4051 28.5 0;1 0;0 0;0 2220;1829 1085;887 0;1 rs60451214 c.3716-5_3716-4insGT 48.68 . Not low Not found
9 138678036 TGCCC - intronic KCNT1 834 23.1 0;0 0;0 0;31 0;1 0;0 0;802 rs141359570 c.3178-7_3178-3delTGCCC 96.16 . Not low Not found
7 148106476 - TT intronic CNTNAP2 4052 28.8 0;0 5;0 0;0 2221;1826 1081;884 0;0 rs61232377 c.3716-7_3716-6insTT 48.49 . Not low Not found
2 166245425 C T exonic SCN2A 49 12.6 0;0 13;9 0;0 18;9 0;0 0;0 . c.[5109C>T]+[=] 55.1 . Not low found
Hi Ravinder,
I don't remember which thread the code shown in post #1 was addressing so I don't have any sample input either and I haven't tested your code. Note, however, that the code:
if(!$i){$i="."}
will not only change field # i to a <period> if the field is empty, it was also change it to a period if the field contains a numeric string that evaluates to zero (e.g., 0 , 0.000 , and 0e+10 ). For cases like this, the following would be safer:
Sorry about that I was leaving for a weekend trip. Anyway here are the files:
file that is updated:
Missing in IDP but found in Reference:
2 166848646 G A exonic SCN1A 68 13 16;20 0;0 17;15 0;0 0;0 0;0 c.[5139C>T]+[=] 52.94 Not low
12 52200340 A C exonic SCN8A 4129 28.3 1560;1672 413;453 0;0 0;0 0;2 31;0 c.[5070A>C]+[=] 20.97 Not low
13 77570076 - A exonic CLN5 2762 26.6 2060;702 0;0 0;0 0;0 2050;696 0;0 c.526_527insA 99.42 TP Not low
7 148106478 - GT intronic CNTNAP2 4051 28.5 0;1 0;0 0;0 2220;1829 1085;887 0;1 rs60451214 c.3716-5_3716-4insGT 48.68 Not low
9 138678036 TGCCC - intronic KCNT1 834 23.1 0;0 0;0 0;31 0;1 0;0 0;802 rs141359570 c.3178-7_3178-3delTGCCC 96.16 Not low
7 148106476 - TT intronic CNTNAP2 4052 28.8 0;0 5;0 0;0 2221;1826 1081;884 0;0 rs61232377 c.3716-7_3716-6insTT 48.49 Not low
2 166245425 C T exonic SCN2A 49 12.6 0;0 13;9 0;0 18;9 0;0 0;0 c.[5109C>T]+[=] 55.1 Not low
current output:
Missing in IDP but found in Reference: has no . so fields shift when blank
CHR POS REF ALT FUNC GENE COVERAGE PHRED A[#F,#R] C[#F,#R] G[#F,#R] T[#F,#R] INS[#F,#R] DEL[#F,#R] SNP MUT FREQ SANGER REGION
TVC
2 166848646 G A exonic SCN1A 68 13 16;20 0;0 17;15 0;0 0;0 0;0 c.[5139C>T]+[=] 52.94 Not low found
12 52200340 A C exonic SCN8A 4129 28.3 1560;1672 413;453 0;0 0;0 0;2 31;0 c.[5070A>C]+[=] 20.97 Not low Not found
13 77570076 - A exonic CLN5 2762 26.6 2060;702 0;0 0;0 0;0 2050;696 0;0 c.526_527insA 99.42 TP Not low Not found
7 148106478 - GT intronic CNTNAP2 4051 28.5 0;1 0;0 0;0 2220;1829 1085;887 0;1 rs60451214 c.3716-5_3716-4insGT 48.68 Not low Not found
9 138678036 TGCCC - intronic KCNT1 834 23.1 0;0 0;0 0;31 0;1 0;0 0;802 rs141359570 c.3178-7_3178-3delTGCCC 96.16 Not low Not found
7 148106476 - TT intronic CNTNAP2 4052 28.8 0;0 5;0 0;0 2221;1826 1081;884 0;0 rs61232377 c.3716-7_3716-6insTT 48.49 Not low Not found
2 166245425 C T exonic SCN2A 49 12.6 0;0 13;9 0;0 18;9 0;0 0;0 c.[5109C>T]+[=] 55.1 Not low found
desired output: tab-delimited with . if the field is blank
Missing in IDP but found in Reference:
CHR POS REF ALT FUNC GENE COVERAGE PHRED "A[#F,#R]" "C[#F,#R]" "G[#F,#R]" "T[#F,#R]" "INS[#F,#R]" "DEL[#F,#R]" SNP MUT FREQ SANGER REGION TVC
2 166848646 G A exonic SCN1A 68 13 16;20 0;0 17;15 0;0 0;0 0;0 . c.[5139C>T]+[=] 52.94 . Not low found
12 52200340 A C exonic SCN8A 4129 28.3 1560;1672 413;453 0;0 0;0 0;2 31;0 . c.[5070A>C]+[=] 20.97 . Not low Not found
13 77570076 - A exonic CLN5 2762 26.6 2060;702 0;0 0;0 0;0 2050;696 0;0 . c.526_527insA 99.42 TP Not low Not found
7 148106478 - GT intronic CNTNAP2 4051 28.5 0;1 0;0 0;0 2220;1829 1085;887 0;1 rs60451214 c.3716-5_3716-4insGT 48.68 . Not low Not found
9 138678036 TGCCC - intronic KCNT1 834 23.1 0;0 0;0 0;31 0;1 0;0 0;802 rs141359570 c.3178-7_3178-3delTGCCC 96.16 . Not low Not found
7 148106476 - TT intronic CNTNAP2 4052 28.8 0;0 5;0 0;0 2221;1826 1081;884 0;0 rs61232377 c.3716-7_3716-6insTT 48.49 . Not low Not found
2 166245425 C T exonic SCN2A 49 12.6 0;0 13;9 0;0 18;9 0;0 0;0 . c.[5109C>T]+[=] 55.1 Not low found
I updated the code with:
{for(i=1;i<=19;i++)
{if($i == "")$i = "."}
}
but that seemed to remove the last 6 fields from the output. Thank you :).
Could you please make sure that your Input_file has delimiter as TAB as example shown by you, doesn't seems to have TABs as delimiter in it, please do confirm on same.
If I am not wrong code would have given for TAB delimited Input_file only as you could see we are setting it in BEGIN section, so how it will read the fields correctly if there are NO TABS in Input_file. So with space why it will NOT work out because let's have an example of following line 1 2 3 4 5 6 . So let's run the code with a TAB delimited field separator first as follows.
So in above outputs left side of digits before --> shows the number of field and after arrow it shows the field's value. So one thing you could try here, if none of your fields into Input_file have space in their values that you could substitute space with TAB and then try to run above code. As a hint you could use gsub(/ +/,"\t",$0) utility of awk for doing so.
Kindly try it and do let us know how it goes then.
Here is the output I get using the F113.txt attached. Thank you :).
for file in /home/cmccabe/Desktop/concordance/comparison/update/*.txt ; do
file1=${file##*/} # Strip off directory
getprefix=${file1%%_*.txt}
file1=$(printf '%s\n' "/home/cmccabe/Desktop/concordance/reference/files/${file1%%_*.txt}_"*.txt) # look for matching file
if [[ -f "$file1" ]]
then
awk '
BEGIN {FS = OFS = "\t"
}
NR == 1 {
outfile = FILENAME
}
FNR == NR {
o[i[++ic] = $1 OFS $2 OFS $3] = $0
}
{for(i=1;i<=19;i++)
{if($i == "")$i = "."}
}
{if($2 OFS $4 OFS $5 in o)
o[$2 OFS $4 OFS $5] = $1 OFS $2 OFS $4 OFS $5 OFS $6 OFS $7 OFS $8 OFS $9 OFS $10 OFS $11 OFS $12 OFS $13 OFS $14 OFS $15 OFS $16 OFS $17 OFS $18 OFS $19 }
END {for(j = 1; j <= ic; j++)
print o[i[j]] > outfile
}' $file $file1
fi
done
awk: cmd. line:10: (FILENAME=/home/cmccabe/Desktop/concordance/comparison/update/F113.txt FNR=1) fatal: attempt to use array `i' in a scalar context
output
Missing in IDP but found in Reference:
CHR POS REF ALT FUNC GENE COVERAGE PHRED A[#F,#R] C[#F,#R] G[#F,#R] T[#F,#R] INS[#F,#R] DEL[#F,#R] SNP MUT FREQ SANGER REGION
74992800 A G Not low Not found
100794363 C T Not low Not found
189931518 A - Not low Not found
There are 2 points here.
1st: You are getting following error:
Because you are using variable i as an variable and trying to use it as an array later in line print o[i[j]] > outfile .
2nd: As I explained in previous post of mine like if you don't have a TAB delimited Input_file and you have only space as a delimiter then it is quite difficult to find out which fields are missing in a line/record as in awk , if you give space or a single space , it will be considered as one field only so we could find out the number of fields are more or less into a line/record but can't find which fields are missing, until/unless there is a rule like eg--> 1st field is a string, 2nd field is a digit etc and so on.
Just an observation: the reason you gave for inserting a dot (".") instead of a blank field was to keep the output from shifting rightwards if the field is empty. Wouldn't it be easier in this case to simply employ the printf function instead of the print statement? Consider: