I would like to remove characters from column 7 so that from an input file looking like this:
>HWI-EAS422_12:4:1:69:89 GGTTTAAATATTGCACAAAAGGTATAGAGCGT U0 1 0 0 ref_chr8.fa 6527777 F DD
I get something like that in an output file:
>HWI-EAS422_12:4:1:69:89 GGTTTAAATATTGCACAAAAGGTATAGAGCGT U0 1 0 0 chr8 6527777 F DD
where in column 7, "ref_chr8.fa" becomes "chr8" only.
Note: some lines of the file may present a letter instead of a number after chr, and two numbers before the dot and after chr: e.g. "ref_chrY.fa" should become "chrY", or "ref_chr10.fa" should become "chr10"
Thanks in advance for your help!!!!
echo 'HWI-EAS422_12:4:1:69:89 GGTTTAAATATTGCACAAAAGGTATAGAGCGT U0 1 0 0 ref_chr8.fa 6527777 F DD' | nawk '{n=split($7, a, "[_.]"); $7=a[2]}1'
awk '{ sub(/.*_/,"",$7); sub(/\..*/,"",$7); print }' FILE
echo 'HWI-EAS422_12:4:1:69:89 GGTTTAAATATTGCACAAAAGGTATAGAGCGT U0 1 0 0 ref_chr8.fa 6527777 F DD' | perl -pe 's/ref_(chr\w+)\.fa/$1/'
thanks for all your suggestion,
vgersh99 and ShawnMilo, I forgot to mention that the rest of the line is different for every line in my file.
cfajohnson, your suggestion is good, but i am loosing the tab delimitations for those lines that have been modified, and i need them for the rest of my process?...
applying your script:
My file looks like this originally:
>HWI-EAS422_12:4:1:69:89 GGTTTAAATATTGCACAAAAGGTATAGAGCGT U0 1 0 0 ref_chr8.fa 6527777 F DD
>HWI-EAS422_12:4:1:1296:114 GAGATTGATCTTAAGCCTTTGGCACAGTTAAC U0 1 0 0 ref_chr12.fa 4777762 R DD
>HWI-EAS422_12:4:1:223:1514 GAATGATGTTGTTTGCTTAGACATGATTTTGT NM 0 0 0
>HWI-EAS422_12:4:1:1150:122 GAGCTTACATTGGACTATGAAAGAGGACAATT U0 1 0 0 ref_chr16.fa 30593383 F DD
>HWI-EAS422_12:4:1:190:83 GGTTTATCAAATACTCTGAAAATAAAATGGGC R0 19 2 0
>HWI-EAS422_12:4:1:151:1463 GATCTGGGACCCTTAATTTTTGGGAATCTGTT U1 0 1 0 ref_chr17.fa 52460364 R DD 16T
>HWI-EAS422_12:4:1:567:228 GATTTAACCGAAGATGATTTCGATTTTCTGAC NM 0 0 0
>HWI-EAS422_12:4:1:954:124 GATATGTATACCAGTGGAAGACAATGGAGAAT U0 1 0 0 ref_chr10.fa 57535899 F DD
>HWI-EAS422_12:4:1:193:486 GCACAGAGAGAGACAAAGGTGCCAACCTTGCT U0 1 0 0 ref_chr22.fa 32814752 R DD
>HWI-EAS422_12:4:1:621:157 GTCGAGCTTCTGGCCATCGGCATCGGCCATGA NM 0 0 0
and it becomes
>HWI-EAS422_12:4:1:69:89 GGTTTAAATATTGCACAAAAGGTATAGAGCGT U0 1 0 0 chr8 6527777 F DD
>HWI-EAS422_12:4:1:1296:114 GAGATTGATCTTAAGCCTTTGGCACAGTTAAC U0 1 0 0 chr12 4777762 R DD
>HWI-EAS422_12:4:1:223:1514 GAATGATGTTGTTTGCTTAGACATGATTTTGT NM 0 0 0
>HWI-EAS422_12:4:1:1150:122 GAGCTTACATTGGACTATGAAAGAGGACAATT U0 1 0 0 chr16 30593383 F DD
>HWI-EAS422_12:4:1:190:83 GGTTTATCAAATACTCTGAAAATAAAATGGGC R0 19 2 0
>HWI-EAS422_12:4:1:151:1463 GATCTGGGACCCTTAATTTTTGGGAATCTGTT U1 0 1 0 chr17 52460364 R DD 16T
>HWI-EAS422_12:4:1:567:228 GATTTAACCGAAGATGATTTCGATTTTCTGAC NM 0 0 0
>HWI-EAS422_12:4:1:954:124 GATATGTATACCAGTGGAAGACAATGGAGAAT U0 1 0 0 chr10 57535899 F DD
>HWI-EAS422_12:4:1:193:486 GCACAGAGAGAGACAAAGGTGCCAACCTTGCT U0 1 0 0 chr22 32814752 R DD
>HWI-EAS422_12:4:1:621:157 GTCGAGCTTCTGGCCATCGGCATCGGCCATGA NM 0 0 0
how can i resolve this issue?
The Perl one-liner I posted ignores the rest of the line, so it shouldn't make a difference. Did you try it? Maybe I'm misunderstanding your requirements.
yes i would need this transformation to be applied to all the lines though, i will try it.
ok, i tried again with: less INPUT_FILE | perl -pe 's/ref_(chr\w+)\.fa/$1/' > OUTPUT_FILE and it works great!!!
I ALSO WOULD LIKE To replace the small case c at the begining of chr by a capital case so my line is like this:
HWI-EAS422_12:4:1:69:89 GGTTTAAATATTGCACAAAAGGTATAGAGCGT U0 1 0 0 Chr8 6527777 F DD
is that possible within the same script? or with an add-on?
thanks a lot!!!
I ALSO WOULD LIKE To replace the small case c at the begining of chr by a capital case so my line is like this:
HWI-EAS422_12:4:1:69:89 GGTTTAAATATTGCACAAAAGGTATAGAGCGT U0 1 0 0 Chr8 6527777 F DD
is that possible within the same script? or with an add-on?
thanks a lot!!!
nawk -v OFS='\t' '{n=split($7, a, "[_.]"); $7=toupper(substr(a[2],1,1)) substr(a[2],2)}1' myFile
awk 'BEGIN { FS = OFS = "\t" }
{ sub(/.*_/,"",$7)
sub(/\..*/,"",$7)
$7 = toupper(substr($7,1,1))substr($7,2)
print }' FILE
To apply it to all the lines, just cat the file and pipe it to the Perl one-liner I posted.
#!/usr/bin/perl
use strict;
my $str="HWI-EAS422_12:4:1:69:89 GGTTTAAATATTGCACAAAAGGTATAGAGCGT U0 1 0 0 ref_chrY.fa 6527777 F DD";
my @arr=split(" ",$str);
$arr[6]=~s/(?:^.*_)([^.]*)(?:\..*$)/$1/;
print join " ",@arr;