awk to ignore whitespace in field

The awk below executes and update the desired field in my first awk . However, the white space between
nonsynonymous SNV in $9 is being split into tabs and my attempt to correct this does not update the field
unless it is removed. I am not sure what I am doing wrong? Thank you :).

file1

R_Index	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	Inheritence	ExonicFunc.refGene	AAChange.refGene	avsnp147	PopFreqMax	1000G_ALL	1000G_AFR	1000G_AMR	1000G_EAS	1000G_EUR	1000G_SAS	ExAC_ALL	ExAC_AFR	ExAC_AMR	ExAC_EAS	ExAC_FIN	ExAC_NFE	ExAC_OTH	ExAC_SAS	ESP6500siv2_ALL	ESP6500siv2_AA	ESP6500siv2_EA	CG46	SIFT_score	SIFT_pred	Polyphen2_HDIV_score	Polyphen2_HDIV_pred	Polyphen2_HVAR_score	Polyphen2_HVAR_pred	LRT_score	LRT_pred	MutationTaster_score	MutationTaster_pred	MutationAssessor_score	MutationAssessor_pred	dpsi_max_tissue	dpsi_zscore	CLINSIG	CLNDBN	CLNACC	CLNDSDB	CLNDSDBID	Quality	Reads	Zygosity	Score	Classification	Rank	HGMD	Sanger
11	chr2	220494118	220494118	A	C	exonic	SLC4A3	.	.	nonsynonymous SNV	SLC4A3:NM_001326559:exon4:c.470A>C:p.H157P,SLC4A3:NM_005070:exon4:c.470A>C:p.H157P,SLC4A3:NM_201574:exon4:c.470A>C:p.H157P	rs597306	1.	0.95	0.84	0.98	1.	1.	1.	0.98	0.84	0.99	1.	1.	1.	0.99	1.	0.95	0.85	1.	0.84	1.0	T	0.0	B	0.0	B	0.013	N	1	P	-1.545	N	-0.0806	-0.387	.	.	.	.	.	GOOD	78	hom	22

file2

SLC4A3 unknown

current output

R_Index	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	Inheritence	ExonicFunc.refGene	AAChange.refGene	avsnp147	PopFreqMax	1000G_ALL	1000G_AFR	1000G_AMR	1000G_EAS	1000G_EUR	1000G_SAS	ExAC_ALL	ExAC_AFR	ExAC_AMR	ExAC_EAS	ExAC_FIN	ExAC_NFE	ExAC_OTH	ExAC_SAS	ESP6500siv2_ALL	ESP6500siv2_AA	ESP6500siv2_EA	CG46	SIFT_score	SIFT_pred	Polyphen2_HDIV_score	Polyphen2_HDIV_pred	Polyphen2_HVAR_score	Polyphen2_HVAR_pred	LRT_score	LRT_pred	MutationTaster_score	MutationTaster_pred	MutationAssessor_score	MutationAssessor_pred	dpsi_max_tissue	dpsi_zscore	CLINSIG	CLNDBN	CLNACC	CLNDSDB	CLNDSDBID	Quality	Reads	Zygosity	Score	Classification	Rank	HGMD	Sanger
11	chr2	220494118	220494118	A	C	exonic	SLC4A3	.	unknown	nonsynonymous	SNV	SLC4A3:NM_001326559:exon4:c.470A>C:p.H157P,SLC4A3:NM_005070:exon4:c.470A>C:p.H157P,SLC4A3:NM_201574:exon4:c.470A>C:p.H157P	rs597306	1.	0.95	0.84	0.98	1.	1.	1.	0.98	0.84	0.99	1.	1.	1.	0.99	1.	0.95	0.85	1.	0.84	1.0	T	0.0	B	0.0	B	0.013	N	1	P	-1.545	N	-0.0806	-0.387	.	.	.	.	.	GOOD	78	hom	22

desired output field in bold updated and not split

R_Index	Chr	Start	End	Ref	Alt	Func.refGene	Gene.refGene	GeneDetail.refGene	Inheritence	ExonicFunc.refGene	AAChange.refGene	avsnp147	PopFreqMax	1000G_ALL	1000G_AFR	1000G_AMR	1000G_EAS	1000G_EUR	1000G_SAS	ExAC_ALL	ExAC_AFR	ExAC_AMR	ExAC_EAS	ExAC_FIN	ExAC_NFE	ExAC_OTH	ExAC_SAS	ESP6500siv2_ALL	ESP6500siv2_AA	ESP6500siv2_EA	CG46	SIFT_score	SIFT_pred	Polyphen2_HDIV_score	Polyphen2_HDIV_pred	Polyphen2_HVAR_score	Polyphen2_HVAR_pred	LRT_score	LRT_pred	MutationTaster_score	MutationTaster_pred	MutationAssessor_score	MutationAssessor_pred	dpsi_max_tissue	dpsi_zscore	CLINSIG	CLNDBN	CLNACC	CLNDSDB	CLNDSDBID	Quality	Reads	Zygosity	Score	Classification	Rank	HGMD	Sanger
11	chr2	220494118	220494118	A	C	exonic	SLC4A3	.	unknown	nonsynonymous SNV	SLC4A3:NM_001326559:exon4:c.470A>C:p.H157P,SLC4A3:NM_005070:exon4:c.470A>C:p.H157P,SLC4A3:NM_201574:exon4:c.470A>C:p.H157P	rs597306	1.	0.95	0.84	0.98	1.	1.	1.	0.98	0.84	0.99	1.	1.	1.	0.99	1.	0.95	0.85	1.	0.84	1.0	T	0.0	B	0.0	B	0.013	N	1	P	-1.545	N	-0.0806	-0.387	.	.	.	.	.	GOOD	78	hom	22

awk

awk 'FNR==NR {a[$1]=$2; next} a[$8]{$10=a[$8]}1' OFS="\t" file2 file1 > output

To ignore the whitespace I tried:

awk -F '' 'FNR==NR {a[$1]=$2; next} a[$8]{$10=a[$8]}1' OFS="\t" file2 file1 > output

Try:

awk -F '\t' ...
1 Like

I didn't know that -F'\t' could be used for setting the deliminator within each field as well (thought is set only between each field). Thank you very much :).

---------- Post updated at 02:11 PM ---------- Previous update was at 01:22 PM ----------

I spoke too soon and $8 does not update, I think because file1 is space-delimited. If I add a tab in file1 I get the desired output. If a tab is not added to file1 is there a way to ignore the whitespace in $9 of the output? The space seems to be causing an issue, so maybe just removing it before processing will be the best. Thank you :).

If you want different input field separators for your two input files, you need to change the value of FS when you switch input files. Try:

awk 'FNR==NR {a[$1]=$2; next} a[$8]{$10=a[$8]}1' OFS="\t" file2 FS='\t' file1 > output

which uses the default (strings of one or more <space> and/or <tab> characters) as the field separator when reading from file2 and uses a single <tab> character as the field separator when reading from file1 .

1 Like

Thank you very much :).