awk to remove field and match strings to add text

cmccabe · May 28, 2016, 12:08pm

In file1 field $18 is removed.... column header is "Otherinfo", then each line in file1 is used to search file2 for a match. When a match is found the last four strings in file2 are copied to file1 .

Maybe:
cut -f1-17 file1 and then match each line to file2

file1

Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    GeneDetail.refGene    ExonicFunc.refGene    AAChange.refGene    PopFreqMax    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    common    Otherinfo
chr1    949654    949654    A    G    exonic    ISG15    .    synonymous SNV    ISG15:NM_005101:exon2:c.294A>G:p.V98V    0.96    .    .    .    .    .    .    1    3825.28    624    chr1    949654    .    A    G    3825.28    PASS    AF=1;AO=621;DP=624;FAO=399;FDP=399;FR=.;FRO=0;FSAF=225;FSAR=174;FSRF=0;FSRR=0;FWDB=0.00425236;FXX=0.00249994;HRUN=1;LEN=1;MLLD=97.922;OALT=G;OID=.;OMAPALT=G;OPOS=949654;OREF=A;PB=0.5;PBP=1;QD=38.3487;RBI=0.0367904;REFB=0.0353003;REVB=-0.0365438;RO=2;SAF=335;SAR=286;SRF=0;SRR=2;SSEN=0;SSEP=0;SSSB=0.00332809;STB=0.5;STBP=1;TYPE=snp;VARB=-3.42335e-05;ANN=ISG15    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:171:624:399:2:0:621:399:1:286:335:0:2:174:225:0:0 GOOD 399 reads
chr1    977330    977330    T    C    intronic    AGRN    .    .    .    1.    Benign    not_specified    RCV000116254.1    MedGen    CN169374    Common    1    1555.72    163    chr1    977330    .    T    C    1555.72    PASS    AF=1;AO=163;DP=163;FAO=163;FDP=163;FR=.;FRO=0;FSAF=65;FSAR=98;FSRF=0;FSRR=0;FWDB=0.0181869;FXX=0;HRUN=1;LEN=1;MLLD=117.006;OALT=C;OID=.;OMAPALT=C;OPOS=977330;OREF=T;PB=0.5;PBP=1;QD=38.1771;RBI=0.0262515;REFB=0;REVB=-0.0189308;RO=0;SAF=65;SAR=98;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=6.62803e-08;STB=0.5;STBP=1;TYPE=snp;VARB=-1.32905e-06;ANN=AGRN    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:74:163:163:0:0:163:163:1:98:65:0:0:98:65:0:0 GOOD 163 reads
chr1    981931    981931    A    G    exonic    AGRN    .    synonymous SNV    AGRN:NM_198576:exon18:c.3066A>G:p.S1022S    0.96    Benign    not_specified    RCV000116259.1    MedGen    CN169374    Common    1    915.076    96    chr1    981931    .    A    G    915.076    PASS    AF=1;AO=96;DP=96;FAO=96;FDP=96;FR=.;FRO=0;FSAF=34;FSAR=62;FSRF=0;FSRR=0;FWDB=0.0308666;FXX=0;HRUN=1;LEN=1;MLLD=62.5559;OALT=G;OID=.;OMAPALT=G;OPOS=981931;OREF=A;PB=0.5;PBP=1;QD=38.1282;RBI=0.0605651;REFB=0;REVB=-0.0521094;RO=0;SAF=34;SAR=62;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=-4.95992e-08;STB=0.5;STBP=1;TYPE=snp;VARB=-0.000155929;ANN=AGRN    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:43:96:96:0:0:96:96:1:62:34:0:0:62:34:0:0 GOOD 96 reads

file2

##.....
##.....
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    NS12911_BC1
chr1    949654    .    A    G    3825.28    PASS    AF=1;AO=621;DP=624;FAO=399;FDP=399;FR=.;FRO=0;FSAF=225;FSAR=174;FSRF=0;FSRR=0;FWDB=0.00425236;FXX=0.00249994;HRUN=1;LEN=1;MLLD=97.922;OALT=G;OID=.;OMAPALT=G;OPOS=949654;OREF=A;PB=0.5;PBP=1;QD=38.3487;RBI=0.0367904;REFB=0.0353003;REVB=-0.0365438;RO=2;SAF=335;SAR=286;SRF=0;SRR=2;SSEN=0;SSEP=0;SSSB=0.00332809;STB=0.5;STBP=1;TYPE=snp;VARB=-3.42335e-05;ANN=ISG15    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:171:624:399:2:0:621:399:1:286:335:0:2:174:225:0:0 GOOD 399 reads hom
chr1    977330    .    T    C    1555.72    PASS    AF=1;AO=163;DP=163;FAO=163;FDP=163;FR=.;FRO=0;FSAF=65;FSAR=98;FSRF=0;FSRR=0;FWDB=0.0181869;FXX=0;HRUN=1;LEN=1;MLLD=117.006;OALT=C;OID=.;OMAPALT=C;OPOS=977330;OREF=T;PB=0.5;PBP=1;QD=38.1771;RBI=0.0262515;REFB=0;REVB=-0.0189308;RO=0;SAF=65;SAR=98;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=6.62803e-08;STB=0.5;STBP=1;TYPE=snp;VARB=-1.32905e-06;ANN=AGRN    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:74:163:163:0:0:163:163:1:98:65:0:0:98:65:0:0 GOOD 163 reads hom
chr1    981931    .    A    G    915.076    PASS    AF=1;AO=96;DP=96;FAO=96;FDP=96;FR=.;FRO=0;FSAF=34;FSAR=62;FSRF=0;FSRR=0;FWDB=0.0308666;FXX=0;HRUN=1;LEN=1;MLLD=62.5559;OALT=G;OID=.;OMAPALT=G;OPOS=981931;OREF=A;PB=0.5;PBP=1;QD=38.1282;RBI=0.0605651;REFB=0;REVB=-0.0521094;RO=0;SAF=34;SAR=62;SRF=0;SRR=0;SSEN=0;SSEP=0;SSSB=-4.95992e-08;STB=0.5;STBP=1;TYPE=snp;VARB=-0.000155929;ANN=AGRN    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:43:96:96:0:0:96:96:1:62:34:0:0:62:34:0:0 GOOD 96 reads hom

desired output

Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    GeneDetail.refGene    ExonicFunc.refGene    AAChange.refGene    PopFreqMax    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    common    Otherinfo
chr1    949654    949654    A    G    exonic    ISG15    .    synonymous SNV    ISG15:NM_005101:exon2:c.294A>G:p.V98V    0.96    .    .    .    .    .    .     GOOD 399 reads hom
chr1    977330    977330    T    C    intronic    AGRN    .    .    .    1.    Benign    not_specified    RCV000116254.1    MedGen    CN169374    Common    GOOD 163 reads hom
chr1    981931    981931    A    G    exonic    AGRN    .    synonymous SNV    AGRN:NM_198576:exon18:c.3066A>G:p.S1022S    0.96    Benign    not_specified    RCV000116259.1    MedGen    CN169374    Common    GOOD 96 reads hom

Don_Cragun · May 28, 2016, 4:19pm

While it is true that there are 18 fields in the heading line in file1, there are a lot more fields on subsequent lines in that file. The cut command you suggested would copy the input file to its output and add 16 trailing <tab> characters to each input line. Using awk to delete field $18 (assuming we set the awk field delimiter to four <space> characters would seem to be a request to delete the header line completely ( $otherinfo expanding to $0 unless otherinfo is defined to be some numeric value) and to remove the first field on your other sample lines (since field 18 on the other lines is in file1 is 1 ).

And you haven't said how you "match" 17 fields from file1 (after removing the last 13 fields) with lines from file2 containing 10 fields. Do you match lines from file1 (after skipping line1) with lines from file2 (after skipping lines that start with a "#") sequentially? Do you match lines with identical field 2 values in both files? Or, is there some other criteria used to determine a match?

From the sample input and output you provided, I would guess that you are trying to:

Keep the header from file1 unchanged.
Keep the 1st 17 fields from other lines in file1 unchanged; delete all other fields at the end of those lines; and if and only if there is a line in file2 with a field 2 that matches field 2 in this line, add the last four single <space> separated fields from the matched line in file2 .

Is that an accurate statement of your requirements?

cmccabe · May 28, 2016, 5:03pm

Yes, what you have stated above is correct. Thank you :).

Don_Cragun · May 28, 2016, 5:41pm

The following should work as long as there aren't any spaces in fields 1 and 2 in file2 :

awk '
BEGIN {	OFS = "    "
}
FNR == NR {
	if($0 !~ /^#/ )
		d[$2] = $(NF-3) " " $(NF-2) " " $(NF-1) " " $NF
	next
}
FNR > 1 {
	NF = 17
	$18 = d[$2]
}
1' file2 FS='    ' file1

If there could be spaces in those two fields, extracting the desired strings from file2 will be a little more complicated.

As always, if you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

cmccabe · May 28, 2016, 7:58pm

The awk is close but hopefully the below explanation helps. Both file1 and file2 are tab-delineated, since $1 and $2 of file1 match $1 and $2 in file2 , the last 4 strings in file2 are copied to the end of the matching file1 line. In thiscase the last 4 strings are GOOD 399 reads hom and are in bold in file2 . Thank you very much :).

file1

Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene    GeneDetail.refGene    ExonicFunc.refGene    AAChange.refGene    PopFreqMax    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID    common
chr1    949654    949654    A    G    exonic    ISG15    .    synonymous SNV    ISG15:NM_005101:exon2:c.294A>G:p.V98V    0.96    .    .    .    .    .    .

file2

##.....
##.....
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    NS12911_BC1
chr1    949654    .    A    G    3825.28    PASS    AF=1;AO=621;DP=624;FAO=399;FDP=399;FR=.;FRO=0;FSAF=225;FSAR=174;FSRF=0;FSRR=0;FWDB=0.00425236;FXX=0.00249994;HRUN=1;LEN=1;MLLD=97.922;OALT=G;OID=.;OMAPALT=G;OPOS=949654;OREF=A;PB=0.5;PBP=1;QD=38.3487;RBI=0.0367904;REFB=0.0353003;REVB=-0.0365438;RO=2;SAF=335;SAR=286;SRF=0;SRR=2;SSEN=0;SSEP=0;SSSB=0.00332809;STB=0.5;STBP=1;TYPE=snp;VARB=-3.42335e-05;ANN=ISG15    GT:GQ:DP:FDP:RO:FRO:AO:FAO:AF:SAR:SAF:SRF:SRR:FSAR:FSAF:FSRF:FSRR    1/1:171:624:399:2:0:621:399:1:286:335:0:2:174:225:0:0 GOOD 399 reads hom

desired output

Chr    Start    End    Ref    Alt    Func.refGene    Gene.refGene     GeneDetail.refGene    ExonicFunc.refGene    AAChange.refGene     PopFreqMax    CLINSIG    CLNDBN    CLNACC    CLNDSDB    CLNDSDBID     common
chr1    949654    949654    A    G    exonic    ISG15    .     synonymous SNV    ISG15:NM_005101:exon2:c.294A>G:p.V98V    0.96     .    .    .    .    .    . GOOD 399 reads hom

Don_Cragun · May 28, 2016, 9:06pm

You say fields are <tab> delimited, but there are absolutely no <tab> characters in any of your sample inputs and outputs. In your sample inputs and outputs, the field delimiter is four adjacent <space> characters and some fields in addition contain some single space characters. Using the exact contents of what you provided in post #1 in this thread for file1 and file2 contents, the script I suggested produced the exact output that you said you wanted for the 1st, 2nd, and 4th output lines. The output my script put out exactly four <space> characters before the four single-<space> separated strings added to the end of the 3rd line where the output you said you wanted had five spaces instead of four at that location.

If your real data is <tab> separated instead of 4 adjacent <space> character separated as in the data you showed us, change the following two lines in the script I suggested:

BEGIN {	OFS = "    "
	and
1' file2 FS='    ' file1

to:

BEGIN {	OFS = "\t"
	and
1' file2 FS='\t' file1

respectively.
Note that this will put out a <tab> (not a <space>) before the four <space> separated strings are used to replace the "OtherInfo" data. And, as requested in post #1 in this thread, the "OtherInfo" heading field will be retained; not discarded.

If this is not what you want, please post sample inputs and outputs that match the description of the real data you are processing.

cmccabe · May 31, 2016, 8:57am

That fixed it... thank you very much :).