Extract all content that match exactly only specific word

Input:

21      templeta        parent  35718   36554   .       -       .       ID=parent_cluster_50.21.11; Name=Partial%20parent%20for%20training%20set;
21      templeta        kids    35718   36554   .       -       .       ID=_52; Parent=parent_cluster_5085.21.11;
21      templeta        location        35840   36073   .       -       .       ID=_5285.location4; Parent=_5285
21      templeta        pattern 35840   36073   .       -       0       ID=_52.cds4; Parent=_5285
21      templeta        location        35718   35778   .       -       .       ID=_5285.location5; Parent=_5285
21      templeta        pattern 35758   35778   .       -       0       ID=_52.cds5; Parent=_5285
21      templeta        length  35718   35757   .       -       .       ID=_52.utr3p1; Parent=_5285

21      templeta        parent  43191   43851   .       +       .       ID=parent_cluster_5086.21.12; Name=Partial%20parent%20for%20training%20set;
21      templeta        kids    43191   43851   .       +       .       ID=_5286; Parent=parent_cluster_5086.21.12;
21      templeta        length  43191   43192   .       +       .       ID=_5286.utr5p1; Parent=_5286
21      templeta        location        43191   43851   .       +       .       ID=_5286.location1; Parent=_5286
21      templeta        pattern 43193   43819   .       +       0       ID=_5286.cds1; Parent=_5286; 5_prime_partial=true
21      templeta        length  43820   43851   .       +       .       ID=_5286.utr3p1; Parent=_5286

22      templeta        parent  4204    4962    .       -       .       ID=parent_cluster_5087.22.1; Name=Partial%20parent%20for%20training%20set;
22      templeta        kids    4204    4962    .       -       .       ID=_5287; Parent=parent_cluster_5087.22.1;
22      templeta        length  4876    4962    .       -       .       ID=_5287.utr5p1; Parent=_5287
22      templeta        location        4204    4962    .       -       .       ID=_5287.location1; Parent=_5287
22      templeta        pattern 4204    4875    .       -       0       ID=_5287.cds1; Parent=_5287; 3_prime_partial=true

Desired output:

21      templeta        parent  35718   36554   .       -       .       ID=parent_cluster_50.21.11; Name=Partial%20parent%20for%20training%20set;
21      templeta        kids    35718   36554   .       -       .       ID=_52; Parent=parent_cluster_5085.21.11;
21      templeta        location        35840   36073   .       -       .       ID=_5285.location4; Parent=_5285
21      templeta        pattern 35840   36073   .       -       0       ID=_52.cds4; Parent=_5285
21      templeta        location        35718   35778   .       -       .       ID=_5285.location5; Parent=_5285
21      templeta        pattern 35758   35778   .       -       0       ID=_52.cds5; Parent=_5285
21      templeta        length  35718   35757   .       -       .       ID=_52.utr3p1; Parent=_5285

Awk code that I have tried:

awk 'BEGIN {RS=""; FS="\n"}  {for (i=1;i<=NF;i++) {if ($i~/ID=_52/) {print $_}}}' input_file

Output I get:

21      templeta        parent  35718   36554   .       -       .       ID=parent_cluster_50.21.11; Name=Partial%20parent%20for%20training%20set;
21      templeta        kids    35718   36554   .       -       .       ID=_52; Parent=parent_cluster_5085.21.11;
21      templeta        location        35840   36073   .       -       .       ID=_5285.location4; Parent=_5285
21      templeta        pattern 35840   36073   .       -       0       ID=_52.cds4; Parent=_5285
21      templeta        location        35718   35778   .       -       .       ID=_5285.location5; Parent=_5285
21      templeta        pattern 35758   35778   .       -       0       ID=_52.cds5; Parent=_5285
21      templeta        length  35718   35757   .       -       .       ID=_52.utr3p1; Parent=_5285

21      templeta        parent  43191   43851   .       +       .       ID=parent_cluster_5086.21.12; Name=Partial%20parent%20for%20training%20set;
21      templeta        kids    43191   43851   .       +       .       ID=_5286; Parent=parent_cluster_5086.21.12;
21      templeta        length  43191   43192   .       +       .       ID=_5286.utr5p1; Parent=_5286
21      templeta        location        43191   43851   .       +       .       ID=_5286.location1; Parent=_5286
21      templeta        pattern 43193   43819   .       +       0       ID=_5286.cds1; Parent=_5286; 5_prime_partial=true
21      templeta        length  43820   43851   .       +       .       ID=_5286.utr3p1; Parent=_5286

22      templeta        parent  4204    4962    .       -       .       ID=parent_cluster_5087.22.1; Name=Partial%20parent%20for%20training%20set;
22      templeta        kids    4204    4962    .       -       .       ID=_5287; Parent=parent_cluster_5087.22.1;
22      templeta        length  4876    4962    .       -       .       ID=_5287.utr5p1; Parent=_5287
22      templeta        location        4204    4962    .       -       .       ID=_5287.location1; Parent=_5287
22      templeta        pattern 4204    4875    .       -       0       ID=_5287.cds1; Parent=_5287; 3_prime_partial=true

My purpose is plan to use awk or any other programming language to extract those content that match exactly only "ID=_52" word instead of extract all the content that slightly match to "ID=_52" like "ID_05286", "ID_05287" .
Thanks for any advice.

Did try using ^,$ ?

/^ID=_52$/
egrep -w ID=_52 file

Hi,
Thanks for your reply.
It seems like no worked in my case :frowning:

---------- Post updated at 05:05 AM ---------- Previous update was at 05:01 AM ----------

Hi,
Thanks for your sugguestion.
But it seems like the grep code can't print out the first line inside my output result.

egrep -w ID=_52 file
21      templeta        kids    35718   36554   .       -       .       ID=_52; Parent=parent_cluster_5085.21.11;
21      templeta        location        35840   36073   .       -       .       ID=_5285.location4; Parent=_5285
21      templeta        pattern 35840   36073   .       -       0       ID=_52.cds4; Parent=_5285
21      templeta        location        35718   35778   .       -       .       ID=_5285.location5; Parent=_5285
21      templeta        pattern 35758   35778   .       -       0       ID=_52.cds5; Parent=_5285
21      templeta        length  35718   35757   .       -       .       ID=_52.utr3p1; Parent=_5285

Desired output:

21      templeta        parent  35718   36554   .       -       .       ID=parent_cluster_50.21.11; Name=Partial%20parent%20for%20training%20set;
21      templeta        kids    35718   36554   .       -       .       ID=_52; Parent=parent_cluster_5085.21.11;
21      templeta        location        35840   36073   .       -       .       ID=_5285.location4; Parent=_5285
21      templeta        pattern 35840   36073   .       -       0       ID=_52.cds4; Parent=_5285
21      templeta        location        35718   35778   .       -       .       ID=_5285.location5; Parent=_5285
21      templeta        pattern 35758   35778   .       -       0       ID=_52.cds5; Parent=_5285
21      templeta        length  35718   35757   .       -       .       ID=_52.utr3p1; Parent=_5285

Do you got any other suggestion?
Thanks.

Something like this?

awk '$9~"^ID=_52[^0-9]"' infile

---------- Post updated at 12:22 ---------- Previous update was at 12:20 ----------

Your first line does not contain ID=_52. Also ID=_5285.location4 and 5 do not match this pattern. So it is not clear what you are trying to achieve..

Yup. My first line don't have "ID=_52"
Thus I used plan to extract the content based on two condition:

  1. Use New line to be as Field separator
  2. Once match "ID=_52" word, extract its all content include the first line.
    Sorry if I misunderstanding you :frowning:
    Thanks for any advice to improve my awk code to archive my desired goal :slight_smile:

You mean something like this:

 awk '$9~"ID=parent_cluster"{h=$0;p=0} $9~"ID=_52;"{print h;p=1} p&&NF' infile

Thanks, Scrutinizer :slight_smile:
You are right.
Your awk code work perfectly and fast in my case ^^
Thanks.