awk to filter file using another working on smaller subset

cmccabe · December 2, 2016, 5:27pm

In the below awk if I use the attached file as the input, I get no results for TCF4 . However, if I just copy that line from the attached file and use that as input I get results for TCF4 .

Basically the gene file is a 1 column list that is used to filter $8 of the attached file. When there is a match that entire line is printed. I am not sure why the awk works on the smaller input but not the attached file, which is the real input. Thank you :).

The tab-delimited file is ~8,500 lines.

contents of gene

SCN1A
SCN2A
TCF4

TCF4 line as input

7722    chr18    53303101    53303101    C    G    intergenic    TCF4;ST8SIA3    dist=47241;dist=1716620    .    .    .    rs611326    1.    1.    0.99    1.    1.    1.    1.    1.    0.99    1.    1.    1.    1.    1.    1.    1.    1.    1.    0.99    .    .    1    T    .    B    .    B    .    .    1.000    P    .    .    .    .    .    .    .    GOOD    80    hom    23    .    .

result

7722    chr18    53303101    53303101    C    G    intergenic    TCF4;ST8SIA3    dist=47241;dist=1716620    .    .    .    rs611326    1.    1.    0.99    1.    1.    1.    1.    1.    0.99    1.    1.    1.    1.    1.    1.    1.    1.    1.    0.99    .    .    1    T    .    B    .    B    .    .    1.000    P    .    .    .    .    .    .    .    GOOD    80    hom    23    .    .

awk

awk -F'\t' 'NR==FNR{a[$0];next} FNR==1{print} $8 in a{$1=++c; print}' gene file

Don_Cragun · December 2, 2016, 9:32pm

What "below awk "???

cmccabe · December 2, 2016, 9:50pm

Sorry, I added it to the post. Thank you :).

Don_Cragun · December 2, 2016, 10:31pm

I don't see why you would think that $8 ( TCF4;ST8SIA3 ) in that line in that file would be found in the array a[] when the only values you put into that array are SCN1A , SCN2A , and TCF4 .

cmccabe · December 2, 2016, 10:51pm

What would you recommend? The awk seems to work as expected with a limited data set. There are many lines that are similar in that they have ; separating but the name will be in there.

Thank you

Don_Cragun · December 2, 2016, 11:12pm

"the name will be in there" is quite vague, but you might try (totally untested):

awk -F'\t' 'NR==FNR{a[$0];next} FNR==1{print} {x=$8; sub(/;.*/,"",x)} x in a{$1=++c; print}' gene file

RavinderSingh13 · December 2, 2016, 11:21pm

Hello cmccabe,

Not sure if this is the required output you need by seeing your try only I have made it, could you please try following and let me know if this helps you.

awk -F"\t" 'FNR==NR{A[$0];next} {split($8, B,";");P=B[1]} (P in A){$1=++c;print}' gene file

Output will be as follows.

1 chr18 53303101 53303101 C G intergenic TCF4;ST8SIA3 dist=47241;dist=1716620 . . . rs611326 1. 1. 0.99 1. 1. 1. 1. 1. 0.99 1. 1. 1. 1. 1. 1. 1. 1. 1. 0.99 . . 1 T . B . B . . 1.000 P . . . . . . . GOOD 80 hom 23 . .

You could set output field seprator as TAB in case you need it.

NOTE: You haven't splited 8th field in Input_file named file so only it can't find it in the array which is being created during first file reading.

Thanks,
R. Singh

cmccabe · December 3, 2016, 10:49am

I apologize, I am on my cell and its hard to post but in gene TCF4 is the name. In file it may exist or be in there as TCF4 or TCF4;xxx . I will try the code. Thank you :).

---------- Post updated 12-03-16 at 09:49 AM ---------- Previous update was 12-02-16 at 10:32 PM ----------

Thank you both, they both work great :).

cmccabe · December 6, 2016, 3:27pm

I can not seem to adjust the awk] to capture all conditions of KCNMA1 , the line in gene.txt attached. I have also attached data.txt , which is tab-delimeted

So in the below example both NONE;KCNMA1 and KCNMA1 would be captured in the output. The only other possibility would be KCNMA1;NONE , though that is not in the file it is a possibility.

There could also be multiple ; , however the name, in this case KCNMA1 will be included. Thank you :).

awk

awk -F'\t' -v OFS='\t' 'NR==FNR{a[$0];next} FNR==1{print} {x=$8; sub(/;.*/,"",x)} x in a{$1=++c; print}' gene.txt data.txt  > out

desired out

R_Index    Chr    Start    End    Ref    Alt    Func.IDP.refGene    Gene.IDP.refGene    GeneDetail.IDP.refGene    Inheritence    ExonicFunc.IDP.refGene    AAChange.IDP.refGene    avsnp147    PopFreqMax    1000G_ALL    1000G_AFR    1000G_AMR    1000G_EAS    1000G_EUR    1000G_SAS    ExAC_ALL    ExAC_AFR    ExAC_AMR    ExAC_EAS    ExAC_FIN    ExAC_NFE    ExAC_OTH    ExAC_SAS    ESP6500siv2_ALL    ESP6500siv2_AA    ESP6500siv2_EA    CG46    dpsi_max_tissue    dpsi_zscore    SIFT_score    SIFT_pred    Polyphen2_HDIV_score    Polyphen2_HDIV_pred    Polyphen2_HVAR_score    Polyphen2_HVAR_pred    LRT_score    LRT_pred    MutationTaster_score    MutationTaster_pred    MutationAssessor_score    MutationAssessor_pred    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    Quality    Reads    Zygosity    Phred    Classification    HGMD    Sanger
4629    chr10    78944590    78944590    G    A    intergenic    NONE;KCNMA1    dist=NONE;dist=451371    .    .    .    rs1131824    0.7    0.41    0.7    0.27    0.25    0.34    0.33    0.36    0.64    0.19    0.27    0.38    0.37    0.35    0.32    0.45    0.62    0.36    0.47    -1.6276    -1.768    .    .    .    .    .    .    .    .    .    P    .    .    other    not_specified    RCV000117331.6    MedGen    CN169374    GOOD    117    het    6    .    .    .
4630    chr10    79396463    79396463    C    T    intronic    KCNMA1    .    .    .    .    rs12217221    0.21    0.14    0.02    0.16    0.18    0.21    0.17    .    .    .    .    .    .    .    .    .    .    .    0.14    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    .    GOOD    160    hom    43    .    .    .

Maybe:

awk -F'\t' -v OFS='\t' 'NR==FNR{a[$0];next} FNR==1{print} {x=$8; sub(/;.*/,"",""/;.*,x)} x in a{$1=++c; print}' gene.txt data.txt  > out

Don_Cragun · December 6, 2016, 10:06pm

You might want to try something more like:

awk -F'\t' -v OFS='\t' '
NR == FNR {
	a[$0]
	next
}
FNR == 1
{	n = split($8, x, /;/)
	for(i = 1; i <= n; i++)
		if(x in a) {
			print
			next
		}
}' gene.txt data.txt > out

which produces the output you said you wanted with those two input files (as long as we change each occurrence of four adjacent <space> characters in the output you said you wanted to a single <tab> character).

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .