I need to grep for the pattern text inside the square brackets which are in red and not in green..my current code greps patterns both of them, which i don't want
Input file
ref|XP_002371341.1| oxoacyl-ACP reductase, putative [Toxoplasma gondii ME49] gb|EPT24759.1| 3-ketoacyl-(acyl-carrier-protein) reductase [Toxoplasma gondii ME49] gb|ESS34081.1| 3-ketoacyl-(acyl-carrier-protein) reductase [Toxoplasma gondii VEG](376) - 243 134 61.4617940199336 1 230 2e-71 80.7308970099668
gb|EPR63881.1| 3-ketoacyl-(acyl-carrier-protein) reductase [Toxoplasma gondii GT1](376) - 243 134 61.4617940199336 1 230 2e-71 80.7308970099668
ref|XP_003885852.1| 3-ketoacyl-(Acyl-carrier-protein) reductase, related [Neospora caninum Liverpool] emb|CBZ55826.1| 3-ketoacyl-(Acyl-carrier-protein) reductase, related [Neospora caninum Liverpool](376) - 242 137 61.7940199335548 1 229 8e-71 80.3986710963455
emb|CDJ42835.1| oxoacyl-ACP reductase, putative [Eimeria tenella](347) - 240 141 61.7940199335548 1 211 3e-64 79.734219269103
emb|CDJ64722.1| oxoacyl-ACP reductase, putative [Eimeria necatrix](347)
My current code
while read line
do
echo $line | awk 'NR>1{print $1}' RS=[ FS=] >> $OUTPUTFILE
done <$list
any help or suggestions please..
Hint: only positive is for the patterns in red there is a number in brackets next to the pattern like=> (347), which can be used as markers
You do not need the shell loop, since awk has an implicit loop built in in the middle section:
awk 'NR>1{print $1}' RS=[ FS=] "$list" >> "$OUTPUTFILE"
will accomplish the same.
It does not print the part in parentheses which you also indicated in red. So it is unclear whether you want that printed or not.
If not, try this modification:
awk 'NR>1 && $2~/^\(/{print $1}' RS=[ FS=] "$list" >> "$OUTPUTFILE"
If so, try:
awk 'NR>1 && $2~/^\(/{sub(/\).*/,")",$2); print $1 $2}' RS=[ FS=] "$list" >> "$OUTPUTFILE"
or if your grep has the -o
option, try:
grep -o '\[[^]]*\]([^)]*)' "$list" >> "$OUTPUTFILE"
But that will include that square brackets
RudiC
February 24, 2015, 3:56am
3
Try (making use of your footnote hint):
sed 's/\[[^][]*\]([0-9]\{1,3\})//' file3
ref|XP_002371341.1| oxoacyl-ACP reductase, putative [Toxoplasma gondii ME49] gb|EPT24759.1| 3-ketoacyl-(acyl-carrier-protein) reductase [Toxoplasma gondii ME49] gb|ESS34081.1| 3-ketoacyl-(acyl-carrier-protein) reductase - 243 134 61.4617940199336 1 230 2e-71 80.7308970099668
gb|EPR63881.1| 3-ketoacyl-(acyl-carrier-protein) reductase - 243 134 61.4617940199336 1 230 2e-71 80.7308970099668
ref|XP_003885852.1| 3-ketoacyl-(Acyl-carrier-protein) reductase, related [Neospora caninum Liverpool] emb|CBZ55826.1| 3-ketoacyl-(Acyl-carrier-protein) reductase, related - 242 137 61.7940199335548 1 229 8e-71 80.3986710963455
emb|CDJ42835.1| oxoacyl-ACP reductase, putative - 240 141 61.7940199335548 1 211 3e-64 79.734219269103
emb|CDJ64722.1| oxoacyl-ACP reductase, putative