i need help to extract certain strings/words from lines with different length. I have 3 columns separated by tab delimiter. like below
Probable arabinan endo-1,5-alpha-L-arabinosidase A (EC 3.2.1.99) (Endo-1,5-alpha-L-arabinanase A) (ABN A) abnA Ady3G14620
Probable arabinan endo-1,5-alpha-L-arabinosidase B (EC 3.2.1.99) (Endo-1,5-alpha-L-arabinanase B) (ABN B) abnB Ady2G14150
Probable arabinan endo-1,5-alpha-L-arabinosidase C (EC 3.2.1.99) (Endo-1,5-alpha-L-arabinanase C) (ABN C) abnC Ady6G00770
Isocitrate lyase (ICL) (Isocitrase) (Isocitratase) (EC 4.1.3.1) icl1 icl Ady4G13510
Putative aconitate hydratase, mitochondrial (Aconitase 2) (EC 4.2.1.-) acoB Ady1g06810
Putative aconitate hydratase (Aconitase 3) (EC 4.2.1.-) acoC Ady8g07140
Aconitate hydratase, mitochondrial (Aconitase) (EC 4.2.1.3) (Citrate hydro-lyase) (Homocitrate dehydratase) Ady6g12930
Adenine deaminase (ADE) (EC 3.5.4.2) (Adenine aminohydrolase) (AAH) aah1 Ady2G09150
Disintegrin and metalloproteinase domain-containing protein B (ADAM B) (EC 3.4.24.-) ADM-B Ady4G11150
Probable alpha-galactosidase D (EC 3.2.1.22) (Melibiase D) aglD Ady4G03585
Arginine biosynthesis bifunctional protein ArgJ, mitochondrial [Cleaved into: Arginine biosynthesis bifunctional protein ArgJ alpha chain; Arginine biosynthesis bifunctional protein ArgJ beta chain] [Includes: Glutamate N-acetyltransferase (GAT) (EC 2.3.1.35) (Ornithine acetyltransferase) (OATase) (Ornithine transacetylase); Amino-acid acetyltransferase Ady5G08120
I want to split $2 to take only the "EC x.x.x.x" for it and ignore the rest of the words in $2 and print $1,$2 (EC x.x.x.x only) and $3. and i want to remove it's "brackets" too. The output should be like below
Probable arabinan endo-1,5-alpha-L-arabinosidase A EC 3.2.1.99 Ady3G14620
Probable arabinan endo-1,5-alpha-L-arabinosidase B EC 3.2.1.99 Ady2G14150
Probable arabinan endo-1,5-alpha-L-arabinosidase C EC 3.2.1.99 Ady6G00770
Isocitrate lyase (ICL) (Isocitrase) (Isocitratase) EC 4.1.3.1 Ady4G13510
Putative aconitate hydratase, mitochondrial (Aconitase 2) EC 4.2.1.- Ady1g06810
Putative aconitate hydratase (Aconitase 3) EC 4.2.1.- Ady8g07140
Aconitate hydratase, mitochondrial (Aconitase) EC 4.2.1.3 Ady6g12930
Adenine deaminase (ADE) EC 3.5.4.2 Ady2G09150
Disintegrin and metalloproteinase domain-containing protein B (ADAM B) EC 3.4.24.- Ady4G11150
Probable alpha-galactosidase D EC 3.2.1.22 (Melibiase D) Ady4G03585
Arginine biosynthesis bifunctional protein ArgJ, mitochondrial [Cleaved into: Arginine biosynthesis bifunctional protein ArgJ alpha chain; Arginine biosynthesis bifunctional protein ArgJ beta chain] [Includes: Glutamate N-acetyltransferase GAT EC 2.3.1.35 Ady5G08120
I did the following codes but still i could not remove the words following the "EC x.x.x.x" for $2. and the sed scripts remove all brackets, i just need to remove brackets for EC.x.x.x.x only. I am sure it should not be that complicated but just couldn't figure out.
awk -F. '{print $1"."$2"."$3"."$4,$4}' inputfile | sed 's/(\|)//g'
First, check the output! You may have to adjust your definitions maybe. If you are satisfied, add the next step: extract content from field2. We use shell variable expansion for this, see the man page of ksh for details.
field1=""
field2=""
field3=""
while IFS='<t>' read field1 field2 field3 ; do
print - "field1: ${field1}"
field2="${field2#?}" # split off first character "("
field2="${field2%%\)*}" # split off everything after first ")"
print - "field2: ${field2}"
print - "field3: ${field3}"
print - "----------"
done < /path/to/your/data
Test again. If you are still satisfied, "print" the final version:
field1=""
field2=""
field3=""
while IFS='<t>' read field1 field2 field3 ; do
field2="${field2#?}" # split off first character "("
field2="${field2%%\)*}" # split off everything after first ")"
print - "${field1}\t${field2}\t${field3}"
done < /path/to/your/data > /path/to/your/output
/[()]/ means that parentheses should be used to further split the second field into array "F". The second array element should then contain the first text in parentheses..
It worked awesome!!! and thanks so much for your explanation. I might sound stupid, but, i don't understand why the second array (F[2]) is the first text in parenthesis? what is the first array then? is it the parenthesis itself? thanks
---------- Post updated at 10:44 AM ---------- Previous update was at 10:41 AM ----------
Hi bakunin,
Thanks so much for your great response. I normally use awk and sed in my work and i am still learning. and your code is quite new to me. I will try to look into it and work on it and give feedback asap. This is great as i got a chance to learn new stuff :).