Delete parts of a string of character in one given column of a tab delimited file

matlavmac · March 2, 2009, 3:39pm

I would like to remove characters from column 7 so that from an input file looking like this:

>HWI-EAS422_12:4:1:69:89 GGTTTAAATATTGCACAAAAGGTATAGAGCGT U0 1 0 0 ref_chr8.fa 6527777 F DD

I get something like that in an output file:

>HWI-EAS422_12:4:1:69:89 GGTTTAAATATTGCACAAAAGGTATAGAGCGT U0 1 0 0 chr8 6527777 F DD

where in column 7, "ref_chr8.fa" becomes "chr8" only.

Note: some lines of the file may present a letter instead of a number after chr, and two numbers before the dot and after chr: e.g. "ref_chrY.fa" should become "chrY", or "ref_chr10.fa" should become "chr10"

Thanks in advance for your help!!!!

vgersh99 · March 2, 2009, 4:10pm

echo 'HWI-EAS422_12:4:1:69:89 GGTTTAAATATTGCACAAAAGGTATAGAGCGT U0 1 0 0 ref_chr8.fa 6527777 F DD' | nawk '{n=split($7, a, "[_.]"); $7=a[2]}1'

cfajohnson · March 2, 2009, 4:14pm

awk '{ sub(/.*_/,"",$7); sub(/\..*/,"",$7); print }' FILE

ShawnMilo · March 2, 2009, 4:40pm

 echo 'HWI-EAS422_12:4:1:69:89 GGTTTAAATATTGCACAAAAGGTATAGAGCGT U0 1 0 0 ref_chr8.fa 6527777 F DD' | perl -pe 's/ref_(chr\w+)\.fa/$1/'

matlavmac · March 2, 2009, 4:50pm

thanks for all your suggestion,
vgersh99 and ShawnMilo, I forgot to mention that the rest of the line is different for every line in my file.

cfajohnson, your suggestion is good, but i am loosing the tab delimitations for those lines that have been modified, and i need them for the rest of my process?...
applying your script:

My file looks like this originally:
>HWI-EAS422_12:4:1:69:89 GGTTTAAATATTGCACAAAAGGTATAGAGCGT U0 1 0 0 ref_chr8.fa 6527777 F DD
>HWI-EAS422_12:4:1:1296:114 GAGATTGATCTTAAGCCTTTGGCACAGTTAAC U0 1 0 0 ref_chr12.fa 4777762 R DD
>HWI-EAS422_12:4:1:223:1514 GAATGATGTTGTTTGCTTAGACATGATTTTGT NM 0 0 0
>HWI-EAS422_12:4:1:1150:122 GAGCTTACATTGGACTATGAAAGAGGACAATT U0 1 0 0 ref_chr16.fa 30593383 F DD
>HWI-EAS422_12:4:1:190:83 GGTTTATCAAATACTCTGAAAATAAAATGGGC R0 19 2 0
>HWI-EAS422_12:4:1:151:1463 GATCTGGGACCCTTAATTTTTGGGAATCTGTT U1 0 1 0 ref_chr17.fa 52460364 R DD 16T
>HWI-EAS422_12:4:1:567:228 GATTTAACCGAAGATGATTTCGATTTTCTGAC NM 0 0 0
>HWI-EAS422_12:4:1:954:124 GATATGTATACCAGTGGAAGACAATGGAGAAT U0 1 0 0 ref_chr10.fa 57535899 F DD
>HWI-EAS422_12:4:1:193:486 GCACAGAGAGAGACAAAGGTGCCAACCTTGCT U0 1 0 0 ref_chr22.fa 32814752 R DD
>HWI-EAS422_12:4:1:621:157 GTCGAGCTTCTGGCCATCGGCATCGGCCATGA NM 0 0 0

and it becomes

>HWI-EAS422_12:4:1:69:89 GGTTTAAATATTGCACAAAAGGTATAGAGCGT U0 1 0 0 chr8 6527777 F DD
>HWI-EAS422_12:4:1:1296:114 GAGATTGATCTTAAGCCTTTGGCACAGTTAAC U0 1 0 0 chr12 4777762 R DD
>HWI-EAS422_12:4:1:223:1514 GAATGATGTTGTTTGCTTAGACATGATTTTGT NM 0 0 0
>HWI-EAS422_12:4:1:1150:122 GAGCTTACATTGGACTATGAAAGAGGACAATT U0 1 0 0 chr16 30593383 F DD
>HWI-EAS422_12:4:1:190:83 GGTTTATCAAATACTCTGAAAATAAAATGGGC R0 19 2 0
>HWI-EAS422_12:4:1:151:1463 GATCTGGGACCCTTAATTTTTGGGAATCTGTT U1 0 1 0 chr17 52460364 R DD 16T
>HWI-EAS422_12:4:1:567:228 GATTTAACCGAAGATGATTTCGATTTTCTGAC NM 0 0 0
>HWI-EAS422_12:4:1:954:124 GATATGTATACCAGTGGAAGACAATGGAGAAT U0 1 0 0 chr10 57535899 F DD
>HWI-EAS422_12:4:1:193:486 GCACAGAGAGAGACAAAGGTGCCAACCTTGCT U0 1 0 0 chr22 32814752 R DD
>HWI-EAS422_12:4:1:621:157 GTCGAGCTTCTGGCCATCGGCATCGGCCATGA NM 0 0 0

how can i resolve this issue?

ShawnMilo · March 2, 2009, 4:53pm

The Perl one-liner I posted ignores the rest of the line, so it shouldn't make a difference. Did you try it? Maybe I'm misunderstanding your requirements.

matlavmac · March 2, 2009, 4:57pm

yes i would need this transformation to be applied to all the lines though, i will try it.

matlavmac · March 2, 2009, 5:09pm

ok, i tried again with: less INPUT_FILE | perl -pe 's/ref_(chr\w+)\.fa/$1/' > OUTPUT_FILE and it works great!!!

I ALSO WOULD LIKE To replace the small case c at the begining of chr by a capital case so my line is like this:

HWI-EAS422_12:4:1:69:89 GGTTTAAATATTGCACAAAAGGTATAGAGCGT U0 1 0 0 Chr8 6527777 F DD

is that possible within the same script? or with an add-on?

thanks a lot!!!

cfajohnson · March 2, 2009, 5:12pm

Add:

BEGIN { OFS = '\t' }

matlavmac · March 2, 2009, 5:53pm

I ALSO WOULD LIKE To replace the small case c at the begining of chr by a capital case so my line is like this:

HWI-EAS422_12:4:1:69:89 GGTTTAAATATTGCACAAAAGGTATAGAGCGT U0 1 0 0 Chr8 6527777 F DD

is that possible within the same script? or with an add-on?

thanks a lot!!!

vgersh99 · March 2, 2009, 6:00pm

nawk -v OFS='\t' '{n=split($7, a, "[_.]"); $7=toupper(substr(a[2],1,1)) substr(a[2],2)}1' myFile

cfajohnson · March 2, 2009, 6:02pm

awk 'BEGIN { FS = OFS = "\t" }
{ sub(/.*_/,"",$7)
  sub(/\..*/,"",$7)
  $7 = toupper(substr($7,1,1))substr($7,2)
  print }' FILE

ShawnMilo · March 2, 2009, 6:31pm

To apply it to all the lines, just cat the file and pipe it to the Perl one-liner I posted.

summer_cherry · March 2, 2009, 10:02pm

#!/usr/bin/perl
use strict;
my $str="HWI-EAS422_12:4:1:69:89 GGTTTAAATATTGCACAAAAGGTATAGAGCGT U0 1 0 0 ref_chrY.fa 6527777 F DD";
my @arr=split(" ",$str);
$arr[6]=~s/(?:^.*_)([^.]*)(?:\..*$)/$1/;
print join " ",@arr;