Compare and replace two columns from two files

nans · July 3, 2018, 5:10am

Hello,

I have two text tab delimited files

File 1 has 30 columns. I am pasting only first 9

Chr    Position    Ref    Alt    Score    Gene    HGVS_C   HGVS_P    Coding_Consequence    dbSNP
1    17312743    C    T    1    Gene1    -    ATP13A2:NM_001141974.2:exon27:c.3214G>A:p.A1072T;ATP13A2:NM_001141973.2:exon29:c.3501G>A:p.P1167P,ATP13A2:NM_022089.3:exon29:c.3516G>A:p.P1172P    SNV    1
1    17313343    G    A    1    Gene2    -    ATP13A2:NM_001141973.2:exon27:c.3177C>T:p.A1059A,ATP13A2:NM_001141974.2:exon26:c.3060C>T:p.A1020A,ATP13A2:NM_022089.3:exon27:c.3192C>T:p.A1064A    SNV    2
1    17313654    C    T    1    Gene3    -    ATP13A2:NM_001141973.2:exon26:c.2955G>A:p.V985V,ATP13A2:NM_001141974.2:exon25:c.2838G>A:p.V946V,ATP13A2:NM_022089.3:exon26:c.2970G>A:p.V990Vsynonymous    SNV    3
1    17314942    G    A    1    Gene4    -    ATP13A2:NM_001141973.2:exon24:c.2622C>T:p.G874G,ATP13A2:NM_001141974.2:exon23:c.2505C>T:p.G835G,ATP13A2:NM_022089.3:exon24:c.2637C>T:p.G879Gsynonymous    SNV    4
1    17319011    G    A    1    Gene5    -    ATP13A2:NM_001141973.2:exon17:c.1800C>T:p.P600P,ATP13A2:NM_001141974.2:exon17:c.1800C>T:p.P600P,ATP13A2:NM_022089.3:exon17:c.1815C>T:p.P605Psynonymous    SNV    5
1    20960230    C    T    1    Gene6    -    PINK1:NM_032409.2:exon1:c.189C>T:p.L63L    SNV    6
1    20964328    A    G    1    Gene7    NM_032409.2:exon2:c.388-7A>G    -    -    7
1    20972048    G    A    1    Gene8    NM_032409.2:exon5:c.960-5G>A;NR_046507.1:exon2:c.3981+30C>T    -    -    8
 1    43395635    C    T    1    Gene9    -    SLC2A1:NM_006516.2:exon5:c.588G>A:p.P196P    synonymous    9

File 2 has 4 columns

CHROM    POS    ID    REF    ALT    ANN
[*].FEATUREID:ANN
[*].HGVS_C    ANN
[*].HGVS_P
1    17312743    rs3170740    C    T    NM_001141974.2:c.3214G>A,NM_022089.3:c.3516G>A,NM_001141973.2:c.3501G>A,NM_001135247.1:c.-7975G>A,NM_017459.2:c.-7975G>A    p.Ala1072Thr,p.Pro1172Pro,p.Pro1167Pro,.,.
1    17313343    rs9435659    G    A    NM_022089.3:c.3192C>T,NM_001141973.2:c.3177C>T,NM_001141974.2:c.3060C>T    p.Ala1064Ala,p.Ala1059Ala,p.Ala1020Ala
1    17313654    rs761421    C    T    NM_022089.3:c.2970G>A,NM_001141973.2:c.2955G>A,NM_001141974.2:c.2838G>A    p.Val990Val,p.Val985Val,p.Val946Val
1    17314942    rs9435662    G    A    NM_022089.3:c.2637C>T,NM_001141973.2:c.2622C>T,NM_001141974.2:c.2505C>T    p.Gly879Gly,p.Gly874Gly,p.Gly835Gly
1    17319011    rs2076603    G    A    NM_022089.3:c.1815C>T,NM_001141973.2:c.1800C>T,NM_001141974.2:c.1800C>T    p.Pro605Pro,p.Pro600Pro,p.Pro600Pro
1    20960230    rs45530340    C    T    NM_032409.2:c.189C>T,NR_106732.1:n.59C>T    p.Leu63Leu,.
1    20964328    rs2298298    A    G    NM_032409.2:c.388-7A>G,NR_106732.1:n.*4047A>G,NR_046507.1:n.*4822T>C    .,.,.
1    20972048    rs3131713    G    A    NM_032409.2:c.960-5G>A,NR_046507.1:n.3981+30C>T    .,.
1    43395635    rs2229682    C    T    NM_006516.2:c.588G>A    p.Pro196Pro

I would like to

1) compare Chr:Position from File 1 with CHROM:POS from file 2 and then

2) if values match replace column 7 (HGVS_Noncoding) and 8 (HGVS_Coding) from file 1 with values from file 2, column 6 and 7 respectively.

3) The original header from file 1 remains

4) Column 4 from file 2 has many ".," of various length which needs to be removed and changed to "-" in the final file

Desired output

Chr    Position    Ref    Alt    Score    Gene    HGVS_C    HGVS_P    Coding_Consequence    dbSNP
1    17312743    C    T    1    Gene1    NM_001141974.2:c.3214G>A,NM_022089.3:c.3516G>A,NM_001141973.2:c.3501G>A,NM_001135247.1:c.-7975G>A,NM_017459.2:c.-7975G>A    p.Ala1072Thr,p.Pro1172Pro,p.Pro1167Pro,.,.    SNV    1
1    17313343    G    A    1    Gene2    NM_022089.3:c.3192C>T,NM_001141973.2:c.3177C>T,NM_001141974.2:c.3060C>T    p.Ala1064Ala,p.Ala1059Ala,p.Ala1020Ala    SNV    2
1    17313654    C    T    1    Gene3    NM_022089.3:c.2970G>A,NM_001141973.2:c.2955G>A,NM_001141974.2:c.2838G>A    p.Val990Val,p.Val985Val,p.Val946Val    SNV    3
1    17314942    G    A    1    Gene4    NM_022089.3:c.2637C>T,NM_001141973.2:c.2622C>T,NM_001141974.2:c.2505C>T    p.Gly879Gly,p.Gly874Gly,p.Gly835Gly    SNV    4
1    17319011    G    A    1    Gene5    NM_022089.3:c.1815C>T,NM_001141973.2:c.1800C>T,NM_001141974.2:c.1800C>T    p.Pro605Pro,p.Pro600Pro,p.Pro600Pro    SNV    5
1    20960230    C    T    1    Gene6    NM_032409.2:c.189C>T,NR_106732.1:n.59C>T    p.Leu63Leu    SNV    6
1    20964328    A    G    1    Gene7    NM_032409.2:c.388-7A>G,NR_106732.1:n.*4047A>G,NR_046507.1:n.*4822T>C    -    -    7
1    20972048    G    A    1    Gene8    NM_032409.2:c.960-5G>A,NR_046507.1:n.3981+30C>T    -    -    8
 1    43395635    C    T    1    Gene9    NM_006516.2:c.588G>A    p.Pro196Pro    synonymous    9

I am not able to write a command that fulfils all conditions, I am only able to replace columns using two different commands


#replace column 6 from file 1 with column 7 from file 2

 awk 'FNR==NR{a[NR]=$6; next}{$7=a[FNR]}1' FS='\t' OFS='\t' file2 file1 > file3


#replace column 7 from file 1 with column 8 from file 2

 awk 'FNR==NR{a[NR]=$7; next}{$8=a[FNR]}1' FS='\t' OFS='\t' file2 file3 > file4


#remove pattern from column 7 from file
awk '{gsub(/\.,.*/,"-");}1' file4 > final.txt #patterns not changed with this command

Suggestions are appreciated. Many thanks.

Chubler_XL · July 3, 2018, 10:42pm

Here is my attempt, I've only replace fields that start with two or more ., characters as you seem to have fields ending with these characters in your desired output:

awk '
FNR==1 {
   if (NR>1) print
   next
}
{ key = $1 FS $2 }
FNR==NR{
    hgvs_c[key]=$6
    hgvs_p[key]=$7
    next
}
key in hgvs_c {
   $7 = hgvs_c[key]
   $8 = hgvs_p[key]
}
{ gsub(/\t[\.,]{2,}[^\t]*/, "\t-") }
1' FS='\t' OFS='\t' file2 file1

nans · July 4, 2018, 6:08am

Thank you very much, this works well.