Removing a character at specific position in a column

Syeda_Sumayya · October 13, 2015, 12:55am

Hi,

I have a file like this (about 8 columns in total, this being the 2nd column)

gi_49482297_ref_YP_039521.1_
gi_49482297_ref_YP_039521.1_
gi_49482315_ref_YP_039539.1_
gi_49482315_ref_YP_039539.1_

I want to remove the _ at the end of the line.
And at later stages I would want to replace the _ with another character perhaps.

how can I do it using awk or sed?

Any help would be highly appreciated.

RavinderSingh13 · October 13, 2015, 1:16am

Hello Syeda,

Following may help you in same, let's say you have a Input_file as follows(which is an example as you haven't shown us complete input and didn't tell us about field separator so I am taking it as a test, where field separator is a space and which has 7 columns in it.)
Input_file:

cat Input_file
Ravinder gi_49482297_ref_YP_039521.1_ TESTing test123 sixth_column_ seventh eight_column_test
TEST121 gi_49482297_ref_YP_039521.1_ TESTing test123 sixth_column_ seventh eight_column_test
TEST1211 gi_49482315_ref_YP_039539.1_ TESTing test123 sixth_column_ seventh eight_column_test
TEST12134 gi_49482315_ref_YP_039539.1_ TESTing test123 sixth_column_ seventh eight_column_test

Now following code may help in same.

awk '{for(i=1;i<=NF;i++){if(i==2){sub(/\_$/,X,$i)} else {sub(/\_$/,"_new charachter",$i)};}} 1'  Input_file

Output will be as follows.

Ravinder gi_49482297_ref_YP_039521.1 TESTing test123 sixth_column_new charachter seventh eight_column_test
TEST121 gi_49482297_ref_YP_039521.1 TESTing test123 sixth_column_new charachter seventh eight_column_test
TEST1211 gi_49482315_ref_YP_039539.1 TESTing test123 sixth_column_new charachter seventh eight_column_test
TEST12134 gi_49482315_ref_YP_039539.1 TESTing test123 sixth_column_new charachter seventh eight_column_test

Where I am changing 2nd columns _ with NULL and other columns (only 5th column in my example file) _ with a string _new charachter which you can put it as per your requirement into code. Let us know if this helps you.

Thanks,
R. Singh

Syeda_Sumayya · October 13, 2015, 1:37am

Thanks R. Singh but I am not really getting it, possibly because i have a very limited knowledge of awk commands.
what do I have to do if I only want to remove the _ from 2nd column? I have tried using the first part of your code but its not working.

awk '{for(i=1;i<=NF;i++){if(i==2){sub(/\_$/,X,$i)}

what am I doing wrong?

RavinderSingh13 · October 13, 2015, 1:45am

Hello Syeda,

If you want to only substitute $2 's _ present at last of $2 then following may help you. As you had mentioned in first post that you need to substitute other columns _ too so I have taken POST#2 example, please try following and let me know if this helps you.
Input_file:

cat Input_file
Ravinder gi_49482297_ref_YP_039521.1_ TESTing test123 sizth_column_ seventh eight_column_test
TEST121 gi_49482297_ref_YP_039521.1_ TESTing test123 sizth_column_ seventh eight_column_test
TEST1211 gi_49482315_ref_YP_039539.1_ TESTing test123 sizth_column_ seventh eight_column_test
TEST12134 gi_49482315_ref_YP_039539.1_ TESTing test123 sizth_column_ seventh eight_column_test

awk '{sub(/\_$/,X,$2);print}'  Input_file

Output will be as follows.

Ravinder gi_49482297_ref_YP_039521.1 TESTing test123 sizth_column_ seventh eight_column_test
TEST121 gi_49482297_ref_YP_039521.1 TESTing test123 sizth_column_ seventh eight_column_test
TEST1211 gi_49482315_ref_YP_039539.1 TESTing test123 sizth_column_ seventh eight_column_test
TEST12134 gi_49482315_ref_YP_039539.1 TESTing test123 sizth_column_ seventh eight_column_test

Thanks,
R. Singh

Syeda_Sumayya · October 13, 2015, 2:10am

Oh yes I got it. thanks.
now i can change the code into

 awk '{sub(/\_$/,"anything",$2);print}

to print anything I want at the end of column 2.

Thanks a lot

One thing more, how can I specify the specific position at which I want to make the change? I mean if I want to change something that is not at the end of the column.

RavinderSingh13 · October 13, 2015, 2:34am

Hello Syeda,

Here is an example suppose you want to substitute the 2nd occurrence of _ in $2 then following may help you.
Input_file:

Ravinder gi_49482297_ref_YP_039521.1_ TESTing test123 sizth_column_ seventh eight_column_test
TEST121 gi_49482297_ref_YP_039521.1_ TESTing test123 sizth_column_ seventh eight_column_test
TEST1211 gi_49482315_ref_YP_039539.1_ TESTing test123 sizth_column_ seventh eight_column_test
TEST12134 gi_49482315_ref_YP_039539.1_ TESTing test123 sizth_column_ seventh eight_column_test

Following is the code for same.

awk -vvar=2 '{split($2, A,"_");{for(i=1;i<=length(A);i++){if((i-1)==var){k=""} else {k="_"};q=q?q k A:A};$2=q;;q=""}} 1'  Input_file

Output will be as follows.

Ravinder gi_49482297ref_YP_039521.1_ TESTing test123 sizth_column_ seventh eight_column_test
TEST121 gi_49482297ref_YP_039521.1_ TESTing test123 sizth_column_ seventh eight_column_test
TEST1211 gi_49482315ref_YP_039539.1_ TESTing test123 sizth_column_ seventh eight_column_test
TEST12134 gi_49482315ref_YP_039539.1_ TESTing test123 sizth_column_ seventh eight_column_test

Here I have given a variable named var=2 in my code as I wanted to change only second occurrence in $2 of _ .
You could change it accordingly as per your requirement too. Hope this helps.

Thanks,
R. Singh