Insert underscore in place of column

Hi

I tried to put underscore in place of column in a big file with lots of oclumns using the programm

  sed 's/[\t]/_/g' 

Its showing error

bash-3.2$ sed 's/[\t]/_/g'saradrugbankgenedrugnewlist.txt >saradrugbankgenedrugnewlist3.txt
sed: -e expression #1, char 11: unknown option to `s'

if input is like these so many columns

AST3  GUD  GDY

JHF    HGA   HAY

I want output shuld be

AST3_GUD_GDY

JHF_HGA_HAY

Kindly guide

Try (GNU sed):

$ sed 's/[ \t]\+/_/g' infile
AST3_GUD_GDY

JHF_HGA_HAY

I think you're just missing the space before the filename...

its placing underscore even if there is some space within a columns which I dont want

for eg

if there are 3 columns

AST not available  sht

its shows

AST_not_available_sht

but I want it shuld show

AST_not available_sht

so there is an error.

That's a new aspect. Please show a representative input example and a related output example. Also describe the rules when and where something should be substituted to avoid that people keep guessing or requirements are different inbetween posts, thanks.

Hi

As I mentioned above only column has to be replce with underscore no change within the column has to be made and I made it clear by citing above sample example no other chsnge has to be made. Kindly guide if possible.

manigrover, in your 1st example you want all spaces replaced by an underscore. In the second example it seems you have this:

AST not available  sht

This is a very limited excerpt. From a look of awk this has 4 columns, not 3, since a blank, space and tabs form a field separator and in your input there is nothing but blanks as separator.
One could parse that if the line contains "not available", that this is protected from substitution, but I bet you have more than this line - else you would not need a script.

So to help you describing your input, is there just 3 columns usually and when it is 4, then there is the string "not available" in there or is there even more kinds of different lines.

Actually there are many other columns in the file with words seprated by space the only which i have to do is only replace column which i think is tab seprated to be replaced by underscore not the words within column so input can be 5 column

AST  aluminum tungstate   Al    available     confirm

GST   guanine tungstate     Gn     not availble    not confirmed

And i have place 4 underscore because there are only 4 place column lines are prrsent

Ok, so it looks like a blank, spaces or tabs should be replaced by 1 underscore, but the only exception is "not available" and "not confirmed". Correct?
Is this a database dump? How is the file created?

No there might be other words its not just not confirmed or not availableble

I came to know that column is nothing but tab separation so i feel if tab is replaced by underscore it may wrk because i have to replace column not double space/blank/space jst column which seems equals to tab as far as I read till now.

You can replace a tab or more of them by an underscore, but as long as you just "feel" or "think" they are tab separated, this will be not exact. If it is not the case, you will have a wrong output.

You did not answer how the file is being produced. I do not ask again even though you could maybe get a better field separator by this at this stage. Anyway:

$ od -c infile
0000000   A   S   T  \t   a   l   u   m   i   n   u   m  \t   t   u   n
0000020   g   s   t   a   t   e  \t  \t   A   l  \t   a   v   a   i   l
0000040   a   b   l   e  \t  \t   c   o   n   f   i   r   m  \n  \n   G
0000060   S   T  \t  \t   g   u   a   n   i   n   e  \t   t   u   n   g
0000100   s   t   a   t   e  \t  \t  \t   G   n  \t   n   o   t       a
0000120   v   a   i   l   a   b   l   e  \t  \t   n   o   t       c   o
0000140   n   f   i   r   m   e   d  \n
0000150
$ cat infile
AST     aluminum        tungstate               Al      available               confirm

GST             guanine tungstate                       Gn      not available           not confirmed
$ sed 's/\t\+/_/g' infile
AST_aluminum_tungstate_Al_available_confirm

GST_guanine_tungstate_Gn_not available_not confirmed

Hi

The file is from a database. Actually i studied once column is nothing but tab separation and in the code mentioned above here is undrrscore between aluminium and tungstate which I dont want as both words are in same column.

Ok, I have just placed some tabs between the input columns as I thought it is correct. You can change it yourself in the input on your own. Anyways as it is just an example, the code still applies.
If it is from a database, usually you can give the database export/dump a delimeter (which is strongly recommended), let's say a semicolon or some other unique character, that would make handling such files much easier. We would not have to guess heh.

Hi

My input file is like this:

000000   F   H   I   T      \t   A   d   e   n   o   s   i   n   e    
0000020   M   o   n   o   t   u   n   g   s   t   a   t   e  \t   N   o
0000040   t       A   v   a   i   l   a   b   l   e       C   A   D   ,
0000060   H   T   ,   H   T   ,   T   2   D  \t   A   d   o   -   P   -
0000100   C   h   2   -   P   -   P   s   -   A   d   o  \t   N   o   t
0000120       A   v   a   i   l   a   b   l   e       C   A   D   ,   H
0000140   T   ,   H   T   ,   T   2   D  \n   C   S      \t   T   r   i
0000160   f   l   u   o   r   o   a   c   e   t   o   n   y   l       C
0000200   o   e   n   z   y   m   e       A  \t   N   o   t       A   v
0000220   a   i   l   a   b   l   e       C   A   D   ,   C   A   D   ,
0000240   T   1   D  \t   A   l   p   h   a   -   F   l   u   o   r   o
0000260   -   C   a   r   b   o   x   y   m   e   t   h   y   l   d   e
0000300   t   h   i   a       C   o   e   n   z   y   m   e       a    
0000320   C   o   m   p   l   e   x  \t   N   o   t       A   v   a   i
0000340   l   a   b   l   e       C   A   D   ,   C   A   D   ,   T   1
0000360   D  \t   N   i   t   r   o   m   e   t   h   y   l   d   e   t
0000400   h   i   a       C   o   e   n   z   y   m   e       A  \t   N
0000420   o   t       A   v   a   i   l   a   b   l   e       C   A   D
0000440   ,   C   A   D   ,   T   1   D  \n   C   H   R   M   1      \t
0000460   T   r   o   s   p   i   u   m  \t   S   a   n   c   t   u   r
0000500   a       T   2   D  \t   O   x   y   p   h   e   n   o   n   i
0000520   u   m  \t   A   n   t   r   e   n   y   l       T   2   D  \t
0000540   P   i   r   e   n   z   e   p   i   n   e  \t   B   i   s   v
0000560   a   n   i   l       T   2   D  \t   C   l   i   d   i   n   i
0000600   u   m  \t   Q   u   a   r   z   a   n       T   2   D  \t   P
0000620   r   o   p   a   n   t   h   e   l   i   n   e  \t   P   r   o
0000640   -   B   a   n   t   h   i   n   e       T   2   D  \t   C   y
0000660   c   r   i   m   i   n   e  \t   P   a   g   i   t   a   n   e
0000700       T   2   D  \t   C   y   c   l   o   p   e   n   t   o   l
0000720   a   t   e  \t   A   K   -   P   e   n   t   o   l   a   t   e
0000740       T   2   D  \t   G   l   y   c   o   p   y   r   r   o   l
0000760   a   t   e  \t   A   s   e   c   r   y   l       T   2   D  \t
0001000   A   r   e   c   o   l   i   n   e  \t   N   o   t       A   v
0001020   a   i   l   a   b   l   e       T   2   D  \n   P   D   E   3
0001040   A      \t   M   i   l   r   i   n   o   n   e  \t   C   o   r
0001060   o   t   r   o   p       H   T  \t   A   n   a   g   r   e   l
0001100   i   d   e  \t   A   g   r   y   l   i   n       H   T  \t   C
0001120   i   l   o   s   t   a   z   o   l  \t   P   l   e   t   a   a
0001140   l       H   T  \t   E   n   o   x   i   m   o   n   e  \t   P
0001160   e   r   f   a   n       H   T  \n   P   D   E   3   B      \t
0001200 001   5   r 001   -   6   - 001   4   - 001 001   2   - 001   3
0001220   -   I   o   d   o   b   e   n   z   y   l 001   -   3   -   O
0001240   x   o   c   y   c   l   o   h   e   x   -   1   -   E   n   -

I have to put underscore in place of column but file is so big that I can not place tab at all places where is column adn then change is to underscore . Kindly guide if possible.

And, I am not aware of export/dump a delimiter function for the whole file of a database.

I am not sure if you just ignore my tips or don't understand.
I posted the output of od in the shorter example to show you, where tabs are placed. Having now an example about 5 times the size as od output is really not helpful.
You have been presented some ideas but it's still a lot of guessing and ignoring of tips without explanation, so I personally quit here and wish you good luck.

awk 'BEGIN{OFS="_"}{$1=$1;print}'  filename

Hi Raj

Thansk for reply

But I tried this before and it's inserting underscore even if there is not column means its insertign underscore wherever there is space for eg

if input is

FHIT     Adenosine Monotungstate    Not Available CAD,HT,HT,T2D    Ado-P-Ch2-P-Ps-Ado    Not Available CAD,HT,HT,T2D
CS     Trifluoroacetonyl Coenzyme A    Not Available CAD,CAD,T1D    Alpha-Fluoro-Carboxymethyldethia Coenzyme a Complex    Not Available CAD,CAD,T1D    Nitromethyldethia Coenzyme A    Not Available CAD,CAD,T1D
FHIT_Adenosine_Monotungstate_Not_Available_CAD,HT,HT,T2D_Ado-P-Ch2-P-Ps-Ado_Not_Available_CAD,HT,HT,T2D
CS_Trifluoroacetonyl_Coenzyme_A_Not_Available_CAD,CAD,T1D_Alpha-Fluoro-Carboxymethyldethia_Coenzyme_a_Complex_Not_Available_CAD,CAD,T1D_Nitromethyldethia_Coenzyme_A_Not_Available_CAD,CAD,T1D



The expected output is to put underscore where there is column means tab not at all palces where there is space.

FHIT_Adenosine Monotungstate_Not Available CAD,HT,HT,T2D_Ado-P-Ch2-P-Ps-Ado_Not Available CAD,HT,HT,T2D
CS_Trifluoroacetonyl Coenzyme A _Not Available CAD,CAD,T1D_Alpha-Fluoro-Carboxymethyldethia Coenzyme a Complex_Not Available CAD,CAD,T1D_Nitromethyldethia Coenzyme A_Not Available CAD,CAD,T1D
sed 's/  /_/g' file | sed 's/  /_/g' |  sed 's/[_\t]\+/_/g' | sed 's/_ /_/g'
awk '{gsub(/\t/,"_")}1' file

Or

tr '\t' '_' < file

Or

sed 'y/[Tab]/_/' file

Use the Tab key in place of [Tab].

1 Like

try this

awk -F"  " 'BEGIN{OFS="_"}{$1=$1;print}'  filename