Split certain strings in a line for a specific column.

Hi,

i need help to extract certain strings/words from lines with different length. I have 3 columns separated by tab delimiter. like below

Probable arabinan endo-1,5-alpha-L-arabinosidase A	(EC 3.2.1.99) (Endo-1,5-alpha-L-arabinanase A) (ABN A) abnA	Ady3G14620
Probable arabinan endo-1,5-alpha-L-arabinosidase B	(EC 3.2.1.99) (Endo-1,5-alpha-L-arabinanase B) (ABN B) abnB	Ady2G14150
Probable arabinan endo-1,5-alpha-L-arabinosidase C	(EC 3.2.1.99) (Endo-1,5-alpha-L-arabinanase C) (ABN C) abnC	Ady6G00770
Isocitrate lyase (ICL) (Isocitrase) (Isocitratase)	(EC 4.1.3.1) icl1 icl	Ady4G13510
Putative aconitate hydratase, mitochondrial (Aconitase 2)	(EC 4.2.1.-) acoB	Ady1g06810
Putative aconitate hydratase (Aconitase 3)	(EC 4.2.1.-) acoC	Ady8g07140
Aconitate hydratase, mitochondrial (Aconitase)	(EC 4.2.1.3) (Citrate hydro-lyase) (Homocitrate dehydratase)	Ady6g12930
Adenine deaminase (ADE)	(EC 3.5.4.2) (Adenine aminohydrolase) (AAH) aah1	Ady2G09150
Disintegrin and metalloproteinase domain-containing protein B (ADAM B)	(EC 3.4.24.-) ADM-B	Ady4G11150
Probable alpha-galactosidase D	(EC 3.2.1.22) (Melibiase D) aglD	Ady4G03585
Arginine biosynthesis bifunctional protein ArgJ, mitochondrial [Cleaved into: Arginine biosynthesis bifunctional protein ArgJ alpha chain; Arginine biosynthesis bifunctional protein ArgJ beta chain] [Includes: Glutamate N-acetyltransferase (GAT)	(EC 2.3.1.35) (Ornithine acetyltransferase) (OATase) (Ornithine transacetylase); Amino-acid acetyltransferase	Ady5G08120

I want to split $2 to take only the "EC x.x.x.x" for it and ignore the rest of the words in $2 and print $1,$2 (EC x.x.x.x only) and $3. and i want to remove it's "brackets" too. The output should be like below

Probable arabinan endo-1,5-alpha-L-arabinosidase A	EC 3.2.1.99	Ady3G14620
Probable arabinan endo-1,5-alpha-L-arabinosidase B	EC 3.2.1.99	Ady2G14150
Probable arabinan endo-1,5-alpha-L-arabinosidase C	EC 3.2.1.99	Ady6G00770
Isocitrate lyase (ICL) (Isocitrase) (Isocitratase)	EC 4.1.3.1	Ady4G13510
Putative aconitate hydratase, mitochondrial (Aconitase 2)	EC 4.2.1.-	Ady1g06810
Putative aconitate hydratase (Aconitase 3)	EC 4.2.1.-	Ady8g07140
Aconitate hydratase, mitochondrial (Aconitase)	EC 4.2.1.3	Ady6g12930
Adenine deaminase (ADE)	EC 3.5.4.2      Ady2G09150
Disintegrin and metalloproteinase domain-containing protein B (ADAM B)	EC 3.4.24.-	Ady4G11150
Probable alpha-galactosidase D	EC 3.2.1.22 (Melibiase D)	Ady4G03585
Arginine biosynthesis bifunctional protein ArgJ, mitochondrial [Cleaved into: Arginine biosynthesis bifunctional protein ArgJ alpha chain; Arginine biosynthesis bifunctional protein ArgJ beta chain] [Includes: Glutamate N-acetyltransferase GAT	EC 2.3.1.35	Ady5G08120

I did the following codes but still i could not remove the words following the "EC x.x.x.x" for $2. and the sed scripts remove all brackets, i just need to remove brackets for EC.x.x.x.x only. I am sure it should not be that complicated but just couldn't figure out.

awk -F. '{print $1"."$2"."$3"."$4,$4}' inputfile | sed 's/(\|)//g'

Any help would be appreciated.

"awk | sed" is - whatever the arguments to the commands might be - eo ipso wrong.

First, read the lines and split them into field. You say they are separated by tabs. In the following "<t>" means a literal tab, "<b>" a blank char.

field1=""
field2=""
field3=""

while IFS='<t>' read field1 field2 field3 ; do
     print - "field1: ${field1}"
     print - "field2: ${field2}"
     print - "field3: ${field3}"
     print - "----------"
done < /path/to/your/data

First, check the output! You may have to adjust your definitions maybe. If you are satisfied, add the next step: extract content from field2. We use shell variable expansion for this, see the man page of ksh for details.

field1=""
field2=""
field3=""

while IFS='<t>' read field1 field2 field3 ; do
     print - "field1: ${field1}"

     field2="${field2#?}"         # split off first character "("
     field2="${field2%%\)*}"      # split off everything after first ")"

     print - "field2: ${field2}"
     print - "field3: ${field3}"
     print - "----------"
done < /path/to/your/data

Test again. If you are still satisfied, "print" the final version:

field1=""
field2=""
field3=""

while IFS='<t>' read field1 field2 field3 ; do
     field2="${field2#?}"         # split off first character "("
     field2="${field2%%\)*}"      # split off everything after first ")"

     print - "${field1}\t${field2}\t${field3}"
done < /path/to/your/data > /path/to/your/output

I hope this helps.

bakunin

1 Like

With awk you can use split() to further select the subfields that you need:

awk '{split($2,F,/[()]/); print $1, F[2], $3}' FS='\t' OFS='\t' file

/[()]/ means that parentheses should be used to further split the second field into array "F". The second array element should then contain the first text in parentheses..

1 Like

Hi Scrutinizer,

It worked awesome!!! and thanks so much for your explanation. I might sound stupid, but, i don't understand why the second array (F[2]) is the first text in parenthesis? what is the first array then? is it the parenthesis itself? thanks

---------- Post updated at 10:44 AM ---------- Previous update was at 10:41 AM ----------

Hi bakunin,

Thanks so much for your great response. I normally use awk and sed in my work and i am still learning. and your code is quite new to me. I will try to look into it and work on it and give feedback asap. This is great as i got a chance to learn new stuff :).

Good to hear... The first element of the array contains the empty string before the first opening parenthesis..

If we take the second field of the first line as an example:

(EC 3.2.1.99) (Endo-1,5-alpha-L-arabinanase A) (ABN A) abnA
F[1] contains ""
F[2] contains "EC 3.2.1.99"
F[3] contains " "
F[4] contains "Endo-1,5-alpha-L-arabinanase A"
F[5] contains " "
F[6] contains "ABN A"
F[7] contains " abnA"
1 Like

cool... thanks a bunch... now i get it :wink: