awk split after second underscore in field

cmccabe · November 19, 2016, 11:46am

I am trying to split a tab-delimeted file using awk after the second _ in bold. The awk below is close but splits on the first

, and I am not sure how to use the second _ . Thank you :).

file

chr1    92145889    92149424    NM_001195684_exon_0_10_chr1_92145900_r    0    -
chr1    92161218    92161346    NM_001195684_exon_1_10_chr1_92161229_r    0    -

desired output tab-delimeted

chr1    92145889    92149424    NM_001195684
chr1    92161218    92161346    NM_001195684

awk

awk -F'\t' -v OFS='\t' '{split($4,a,"_"); print $1,$2,$3,a[1]}' file

Don_Cragun · November 19, 2016, 4:18pm

Oh, come on. :rolleyes: You know how to do this...

awk -F'\t' -v OFS='\t' '{split($4,a,"_"); print $1,$2,$3,a[1]"_"a[2]}' file

cmccabe · November 19, 2016, 5:38pm

I know its an easy thing but I couldn't figure it out. I am reading effective awk programming edition 4 , but still have a lot too learn. Thank you very much :).

RavinderSingh13 · November 19, 2016, 10:23pm

Hello cmccabe,

If your Input_file have string named _exon only in 4th field and not repeating further into any field then you could try following too.

awk '{sub(/_exon.*/,X,$0);print}'  Input_file

If you have multiple strings to be removed from 4th field(2nd underscore) onwards and they are unique and not coming into any other field and you could change above code to following with a minor change too.

awk '{sub(/_exon.*|_etc.*|_another_string.*/,X,$0);print}'   Input_file

Thanks,
R. Singh

cmccabe · November 23, 2016, 8:35am

Thank you all, I appreciate the help :).