Hello,
I have two text tab delimited files
File 1 has 30 columns. I am pasting only first 9
Chr Position Ref Alt Score Gene HGVS_C HGVS_P Coding_Consequence dbSNP
1 17312743 C T 1 Gene1 - ATP13A2:NM_001141974.2:exon27:c.3214G>A:p.A1072T;ATP13A2:NM_001141973.2:exon29:c.3501G>A:p.P1167P,ATP13A2:NM_022089.3:exon29:c.3516G>A:p.P1172P SNV 1
1 17313343 G A 1 Gene2 - ATP13A2:NM_001141973.2:exon27:c.3177C>T:p.A1059A,ATP13A2:NM_001141974.2:exon26:c.3060C>T:p.A1020A,ATP13A2:NM_022089.3:exon27:c.3192C>T:p.A1064A SNV 2
1 17313654 C T 1 Gene3 - ATP13A2:NM_001141973.2:exon26:c.2955G>A:p.V985V,ATP13A2:NM_001141974.2:exon25:c.2838G>A:p.V946V,ATP13A2:NM_022089.3:exon26:c.2970G>A:p.V990Vsynonymous SNV 3
1 17314942 G A 1 Gene4 - ATP13A2:NM_001141973.2:exon24:c.2622C>T:p.G874G,ATP13A2:NM_001141974.2:exon23:c.2505C>T:p.G835G,ATP13A2:NM_022089.3:exon24:c.2637C>T:p.G879Gsynonymous SNV 4
1 17319011 G A 1 Gene5 - ATP13A2:NM_001141973.2:exon17:c.1800C>T:p.P600P,ATP13A2:NM_001141974.2:exon17:c.1800C>T:p.P600P,ATP13A2:NM_022089.3:exon17:c.1815C>T:p.P605Psynonymous SNV 5
1 20960230 C T 1 Gene6 - PINK1:NM_032409.2:exon1:c.189C>T:p.L63L SNV 6
1 20964328 A G 1 Gene7 NM_032409.2:exon2:c.388-7A>G - - 7
1 20972048 G A 1 Gene8 NM_032409.2:exon5:c.960-5G>A;NR_046507.1:exon2:c.3981+30C>T - - 8
1 43395635 C T 1 Gene9 - SLC2A1:NM_006516.2:exon5:c.588G>A:p.P196P synonymous 9
File 2 has 4 columns
CHROM POS ID REF ALT ANN
[*].FEATUREID:ANN
[*].HGVS_C ANN
[*].HGVS_P
1 17312743 rs3170740 C T NM_001141974.2:c.3214G>A,NM_022089.3:c.3516G>A,NM_001141973.2:c.3501G>A,NM_001135247.1:c.-7975G>A,NM_017459.2:c.-7975G>A p.Ala1072Thr,p.Pro1172Pro,p.Pro1167Pro,.,.
1 17313343 rs9435659 G A NM_022089.3:c.3192C>T,NM_001141973.2:c.3177C>T,NM_001141974.2:c.3060C>T p.Ala1064Ala,p.Ala1059Ala,p.Ala1020Ala
1 17313654 rs761421 C T NM_022089.3:c.2970G>A,NM_001141973.2:c.2955G>A,NM_001141974.2:c.2838G>A p.Val990Val,p.Val985Val,p.Val946Val
1 17314942 rs9435662 G A NM_022089.3:c.2637C>T,NM_001141973.2:c.2622C>T,NM_001141974.2:c.2505C>T p.Gly879Gly,p.Gly874Gly,p.Gly835Gly
1 17319011 rs2076603 G A NM_022089.3:c.1815C>T,NM_001141973.2:c.1800C>T,NM_001141974.2:c.1800C>T p.Pro605Pro,p.Pro600Pro,p.Pro600Pro
1 20960230 rs45530340 C T NM_032409.2:c.189C>T,NR_106732.1:n.59C>T p.Leu63Leu,.
1 20964328 rs2298298 A G NM_032409.2:c.388-7A>G,NR_106732.1:n.*4047A>G,NR_046507.1:n.*4822T>C .,.,.
1 20972048 rs3131713 G A NM_032409.2:c.960-5G>A,NR_046507.1:n.3981+30C>T .,.
1 43395635 rs2229682 C T NM_006516.2:c.588G>A p.Pro196Pro
I would like to
1) compare Chr:Position from File 1 with CHROM:POS from file 2 and then
2) if values match replace column 7 (HGVS_Noncoding) and 8 (HGVS_Coding) from file 1 with values from file 2, column 6 and 7 respectively.
3) The original header from file 1 remains
4) Column 4 from file 2 has many ".," of various length which needs to be removed and changed to "-" in the final file
Desired output
Chr Position Ref Alt Score Gene HGVS_C HGVS_P Coding_Consequence dbSNP
1 17312743 C T 1 Gene1 NM_001141974.2:c.3214G>A,NM_022089.3:c.3516G>A,NM_001141973.2:c.3501G>A,NM_001135247.1:c.-7975G>A,NM_017459.2:c.-7975G>A p.Ala1072Thr,p.Pro1172Pro,p.Pro1167Pro,.,. SNV 1
1 17313343 G A 1 Gene2 NM_022089.3:c.3192C>T,NM_001141973.2:c.3177C>T,NM_001141974.2:c.3060C>T p.Ala1064Ala,p.Ala1059Ala,p.Ala1020Ala SNV 2
1 17313654 C T 1 Gene3 NM_022089.3:c.2970G>A,NM_001141973.2:c.2955G>A,NM_001141974.2:c.2838G>A p.Val990Val,p.Val985Val,p.Val946Val SNV 3
1 17314942 G A 1 Gene4 NM_022089.3:c.2637C>T,NM_001141973.2:c.2622C>T,NM_001141974.2:c.2505C>T p.Gly879Gly,p.Gly874Gly,p.Gly835Gly SNV 4
1 17319011 G A 1 Gene5 NM_022089.3:c.1815C>T,NM_001141973.2:c.1800C>T,NM_001141974.2:c.1800C>T p.Pro605Pro,p.Pro600Pro,p.Pro600Pro SNV 5
1 20960230 C T 1 Gene6 NM_032409.2:c.189C>T,NR_106732.1:n.59C>T p.Leu63Leu SNV 6
1 20964328 A G 1 Gene7 NM_032409.2:c.388-7A>G,NR_106732.1:n.*4047A>G,NR_046507.1:n.*4822T>C - - 7
1 20972048 G A 1 Gene8 NM_032409.2:c.960-5G>A,NR_046507.1:n.3981+30C>T - - 8
1 43395635 C T 1 Gene9 NM_006516.2:c.588G>A p.Pro196Pro synonymous 9
I am not able to write a command that fulfils all conditions, I am only able to replace columns using two different commands
#replace column 6 from file 1 with column 7 from file 2
awk 'FNR==NR{a[NR]=$6; next}{$7=a[FNR]}1' FS='\t' OFS='\t' file2 file1 > file3
#replace column 7 from file 1 with column 8 from file 2
awk 'FNR==NR{a[NR]=$7; next}{$8=a[FNR]}1' FS='\t' OFS='\t' file2 file3 > file4
#remove pattern from column 7 from file
awk '{gsub(/\.,.*/,"-");}1' file4 > final.txt #patterns not changed with this command
Suggestions are appreciated. Many thanks.