In the below awk if I use the attached file as the input, I get no results for TCF4 . However, if I just copy that line from the attached file and use that as input I get results for TCF4 .
Basically the gene file is a 1 column list that is used to filter $8 of the attached file. When there is a match that entire line is printed. I am not sure why the awk works on the smaller input but not the attached file, which is the real input. Thank you :).
The tab-delimited file is ~8,500 lines.
contents of gene
SCN1A
SCN2A
TCF4
TCF4 line as input
7722 chr18 53303101 53303101 C G intergenic TCF4;ST8SIA3 dist=47241;dist=1716620 . . . rs611326 1. 1. 0.99 1. 1. 1. 1. 1. 0.99 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.99 . . 1 T . B . B . . 1.000 P . . . . . . . GOOD 80 hom 23 . .
result
7722 chr18 53303101 53303101 C G intergenic TCF4;ST8SIA3 dist=47241;dist=1716620 . . . rs611326 1. 1. 0.99 1. 1. 1. 1. 1. 0.99 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.99 . . 1 T . B . B . . 1.000 P . . . . . . . GOOD 80 hom 23 . .
awk
awk -F'\t' 'NR==FNR{a[$0];next} FNR==1{print} $8 in a{$1=++c; print}' gene file
I don't see why you would think that $8 ( TCF4;ST8SIA3 ) in that line in that file would be found in the array a[] when the only values you put into that array are SCN1A , SCN2A , and TCF4 .
What would you recommend? The awk seems to work as expected with a limited data set. There are many lines that are similar in that they have ; separating but the name will be in there.
Not sure if this is the required output you need by seeing your try only I have made it, could you please try following and let me know if this helps you.
awk -F"\t" 'FNR==NR{A[$0];next} {split($8, B,";");P=B[1]} (P in A){$1=++c;print}' gene file
Output will be as follows.
1 chr18 53303101 53303101 C G intergenic TCF4;ST8SIA3 dist=47241;dist=1716620 . . . rs611326 1. 1. 0.99 1. 1. 1. 1. 1. 0.99 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.99 . . 1 T . B . B . . 1.000 P . . . . . . . GOOD 80 hom 23 . .
You could set output field seprator as TAB in case you need it.
NOTE: You haven't splited 8th field in Input_file named file so only it can't find it in the array which is being created during first file reading.
I apologize, I am on my cell and its hard to post but in geneTCF4 is the name. In file it may exist or be in there as TCF4 or TCF4;xxx . I will try the code. Thank you :).
---------- Post updated 12-03-16 at 09:49 AM ---------- Previous update was 12-02-16 at 10:32 PM ----------
I can not seem to adjust the awk] to capture all conditions of KCNMA1 , the line in gene.txt attached. I have also attached data.txt , which is tab-delimeted
So in the below example both NONE;KCNMA1 and KCNMA1 would be captured in the output. The only other possibility would be KCNMA1;NONE , though that is not in the file it is a possibility.
There could also be multiple ; , however the name, in this case KCNMA1 will be included. Thank you :).
awk
awk -F'\t' -v OFS='\t' 'NR==FNR{a[$0];next} FNR==1{print} {x=$8; sub(/;.*/,"",x)} x in a{$1=++c; print}' gene.txt data.txt > out
awk -F'\t' -v OFS='\t' '
NR == FNR {
a[$0]
next
}
FNR == 1
{ n = split($8, x, /;/)
for(i = 1; i <= n; i++)
if(x in a) {
print
next
}
}' gene.txt data.txt > out
which produces the output you said you wanted with those two input files (as long as we change each occurrence of four adjacent <space> characters in the output you said you wanted to a single <tab> character).
As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .