raj_k
December 30, 2013, 3:42am
1
Hi
I have a file like that contains infomation about genes exons introns made as a single string. i am just planning to get the gene name alone with out any extra information.
intergenic_Nedd4_exon_0_F
Gapvd1_intron_24_R
Gapvd1_exon_25_R
my output file should be
intergenic_Nedd4
Gapvd1
Gapvd1
so i want to replace
_intron*
and
_exon*
with nothing
kurumi
December 30, 2013, 3:49am
2
if you have Ruby
# ruby -ne 'puts $_.sub(/_(exon|intron).*/,"")' file
intergenic_Nedd4
Gapvd1
Gapvd1
You may try Awk
$ awk 'gsub(/_(exon|intron).*/,x)' file
$ sed 's/_intron\|_exon.*//g' file
1 Like
kurumi
December 30, 2013, 4:00am
4
that's not the correct output
raj_k
December 30, 2013, 4:05am
5
I have tried
awk '{sub("_intron_\.[0-9]+\.[A-Z]", "")}1' file |awk '{sub("_exon_\.[0-9]+\.[A-Z]", "")}1'
but unfortunately for intergenic its not doing anythg
I usually use Character classes try following
$ cat <<test | awk 'sub(/_(intron|exon)_([[:digit:]]+)_([[:upper:]])/, x) + 1'
intergenic_Nedd4_exon_0_F
Gapvd1_intron_24_R
Gapvd1_exon_25_R
test
intergenic_Nedd4
Gapvd1
Gapvd1
Hello,
May be this will be helpful.
$ awk -F"_" 'NR==1 {print $1 OFS $2} NR==2 || NR==3 {print $1}' OFS=_ file_name
Output will be as follows:
intergenic_Nedd4
Gapvd1
Gapvd1
Thanks,
R. Singh
Ravinder :
Assume if file contains 100 lines,
intergenic_Nedd4_exon_0_F
---> will be line no 1
Gapvd1_intron_24_R
---> will be line no 2 and 3
intergenic_Nedd4_exon_0_F
---> will be in between somewhere within 100
What do you do ?
raj_k
December 30, 2013, 4:47am
9
for entries like this
filed1 field2 gene_intron_24_R ...filed10 field11 gene_intron_24_
it is just replacing only those that occur after field2. it is not replacing after field11. I should have been clear with the question
It's not clear to me, show your real input and expected output.
raj_k
December 30, 2013, 5:14am
11
input
Id1 chr9 fox12_exon_0_F chr9 56 72 72660772 H3K4ME2_E 726 72 promoter intergenic_Nedd4_exon_0_F
output
Id1 chr9 fox12 chr9 56 72 72660772 H3K4ME2_E 726 72 promoter intergenic_Nedd4
Posted by Akshay:
Ravinder :
Assume if file contains 100 lines,
intergenic_Nedd4_exon_0_F ---> will be line no 1
Gapvd1_intron_24_R ---> will be line no 2 and 3
intergenic_Nedd4_exon_0_F ---> will be in between somewhere within 100
What do you do ?
Hello Akshay,
I have just given code for the lines given by user. Will try to give a code which may apply to all.
Thanks,
R. Singh
We shall replace sub
with gsub
, here is code
$ awk 'gsub(/_(intron|exon)_([[:digit:]]+)_([[:upper:]])/, x) + 1' file
Id1 chr9 fox12 chr9 56 72 72660772 H3K4ME2_E 726 72 promoter intergenic_Nedd4
1 Like
Hello Raj/Akashay,
Here are the some more solutions.
1st: When we have a file in which Input is having Record seprator as new line.
sed 's/\(.*\)\(_exon.*\)/\1/g;s/\(.*\)\(_intron.*\)/\1/g' check_lines12
2nd: When we have record seprator as space then code is as follows.
awk -F" " '{for(i=1;i<=NF;i++) print $i"\n"}' check_lines12_2 | sed 's/\(.*\)\(_exon.*\)/\1/g;s/\(.*\)\(_intron.*\)/\1/g' | tr '\n' ' '
OR
cat file_name | tr ' ' '\n' | sed 's/\(.*\)\(_exon.*\)/\1/g;s/\(.*\)\(_intron.*\)/\1/g' | tr '\n' ' '
Output will be as follows in all these commands.
Id1 chr9 fox12 chr9 56 72 72660772 H3K4ME2_E 726 72 promoter intergenic_Nedd4
NOTE: Where file_name and check_lines12_2 are the input files.
Thanks,
R. Singh