Replace specific words with nothing

raj_k · December 30, 2013, 3:42am

Hi
I have a file like that contains infomation about genes exons introns made as a single string. i am just planning to get the gene name alone with out any extra information.

intergenic_Nedd4_exon_0_F
Gapvd1_intron_24_R
Gapvd1_exon_25_R

my output file should be

intergenic_Nedd4 
Gapvd1
Gapvd1

so i want to replace

_intron*

and

_exon*

with nothing

kurumi · December 30, 2013, 3:49am

if you have Ruby

# ruby -ne 'puts $_.sub(/_(exon|intron).*/,"")' file
intergenic_Nedd4
Gapvd1
Gapvd1

Akshay_Hegde · December 30, 2013, 3:59am

You may try Awk

$ awk 'gsub(/_(exon|intron).*/,x)' file

$ sed 's/_intron\|_exon.*//g' file

kurumi · December 30, 2013, 4:00am

that's not the correct output

raj_k · December 30, 2013, 4:05am

I have tried

awk '{sub("_intron_\.[0-9]+\.[A-Z]", "")}1' file |awk '{sub("_exon_\.[0-9]+\.[A-Z]", "")}1'

but unfortunately for intergenic its not doing anythg

Akshay_Hegde · December 30, 2013, 4:25am

I usually use Character classes try following

$ cat <<test | awk 'sub(/_(intron|exon)_([[:digit:]]+)_([[:upper:]])/, x) + 1' 
intergenic_Nedd4_exon_0_F
Gapvd1_intron_24_R
Gapvd1_exon_25_R
test

intergenic_Nedd4
Gapvd1
Gapvd1

RavinderSingh13 · December 30, 2013, 4:34am

Hello,

May be this will be helpful.

$ awk -F"_" 'NR==1 {print $1 OFS $2} NR==2 || NR==3 {print $1}' OFS=_ file_name
 
Output will be as follows:
 
intergenic_Nedd4
Gapvd1
Gapvd1

Thanks,
R. Singh

Akshay_Hegde · December 30, 2013, 4:39am

ravindersingh13:

Hello,

May be this will be helpful.

$ awk -F"_" 'NR==1 {print $1 OFS $2} NR==2 || NR==3 {print $1}' OFS=_ file_name
 
Output will be as follows:
 
intergenic_Nedd4
Gapvd1
Gapvd1

Thanks,
R. Singh

Ravinder :
Assume if file contains 100 lines,

intergenic_Nedd4_exon_0_F ---> will be line no 1

Gapvd1_intron_24_R ---> will be line no 2 and 3

intergenic_Nedd4_exon_0_F ---> will be in between somewhere within 100

What do you do ?

raj_k · December 30, 2013, 4:47am

for entries like this

filed1 field2 gene_intron_24_R ...filed10 field11 gene_intron_24_

it is just replacing only those that occur after field2. it is not replacing after field11. I should have been clear with the question

Akshay_Hegde · December 30, 2013, 5:06am

raj_k:

for entries like this
filed1 field2 gene_intron_24_R ...filed10 field11 gene_intron_24_
it is just replacing only those that occur after field2. it is not replacing after field11. I should have been clear with the question

It's not clear to me, show your real input and expected output.

raj_k · December 30, 2013, 5:14am

input

Id1   chr9   fox12_exon_0_F    chr9    56    72        72660772    H3K4ME2_E    726  72  promoter       intergenic_Nedd4_exon_0_F

output

Id1   chr9   fox12   chr9    56    72         72660772    H3K4ME2_E    726  72  promoter        intergenic_Nedd4

RavinderSingh13 · December 30, 2013, 5:17am

Hello Akshay,

I have just given code for the lines given by user. Will try to give a code which may apply to all.

Thanks,
R. Singh

Akshay_Hegde · December 30, 2013, 5:20am

raj_k:

input

Id1   chr9   fox12_exon_0_F    chr9    56    72        72660772    H3K4ME2_E    726  72  promoter       intergenic_Nedd4_exon_0_F

output

Id1   chr9   fox12   chr9    56    72         72660772    H3K4ME2_E    726  72  promoter        intergenic_Nedd4

We shall replace sub with gsub , here is code

$ awk 'gsub(/_(intron|exon)_([[:digit:]]+)_([[:upper:]])/, x) + 1' file
Id1   chr9   fox12    chr9    56    72        72660772    H3K4ME2_E    726  72  promoter       intergenic_Nedd4

RavinderSingh13 · December 30, 2013, 6:14am

Hello Raj/Akashay,

Here are the some more solutions.

1st: When we have a file in which Input is having Record seprator as new line.

sed 's/\(.*\)\(_exon.*\)/\1/g;s/\(.*\)\(_intron.*\)/\1/g' check_lines12

2nd: When we have record seprator as space then code is as follows.

awk -F" " '{for(i=1;i<=NF;i++) print $i"\n"}' check_lines12_2 | sed 's/\(.*\)\(_exon.*\)/\1/g;s/\(.*\)\(_intron.*\)/\1/g' | tr '\n' ' '
 
OR
 
cat file_name | tr ' ' '\n' | sed 's/\(.*\)\(_exon.*\)/\1/g;s/\(.*\)\(_intron.*\)/\1/g' | tr '\n' ' '

Output will be as follows in all these commands.

Id1  chr9  fox12  chr9  56  72  72660772  H3K4ME2_E  726  72  promoter  intergenic_Nedd4

NOTE: Where file_name and check_lines12_2 are the input files.

Thanks,
R. Singh