Find key pattern and print selected lines for each record

Hi,

I need help on a complicated file that I am working on. I wanted to extract important info from a very huge file. It is space delimited file. I have hundred thousands of records in this file. An example content of the inputfile as below:-

##
ID    Ser402             Old;         23 mins .
ACC   P669GM;
DAT   MAY-2014, the old episode.
TOS   Japanes Anime. one piece
TMA   Pirates; animation; cartoon.
POT   DownloadID=5445;
HEW   StreamID=792; watchop (eu).
HEW   AnotherOnlineID=823; narutowire (same).
COM   -@- Simple Comment: Ace died and Luffy is miserable. 
COM      None of his nakama was with him {SOV:000250}.
COM   -@- Full Comment: Host channel {SOV:000305}; Multi-chanel
COM      streaming {SOV:000305}.
COM   -@- Another Comment: Belongs to the same server.
COM      {SOV:000305}.
COM  -----------------------------------------------------------------------
COM   Can be watched online, see http://www.watchop.eu
DOR   Data; packet; -; Unknown; Anime.
DOR   TDP; TDP:0034; PPQ:host for sub channel; ASA:Subchannel.
DOR   TDP; TDP:0021; PPQ:internal channel; ASA:Unknown.
PPE   Torrent unapplicable;
KAW   Complete episode; Early release; Host channel;
KAW   Repeat; subchannel; subchannel host.
FEA   link          1    20         unavailable
FEA                                /F3184.
FEA   TOP_CHAN      1      1       unavailable (will be determined).
FEA   SUBCHAN       2      18      at 9 (confirmed!).
FEA   TOP_CHAN      19     117     unavailable (No info).
FEA   SUBCHAN       118    138     at 10 (confirmed!).
FEA   TOP_CHAN      139    145     unavailable (will be determined).
FEA   SUBCHAN       146    166     at 12 (confirmed!).
FEA   TOP_CHAN      167    269     unavailable (the source is unknown).
FEA   REP           1      146     A.
FEA   CAD           75     75      by host.
FEA                                {undetermined}.
SYN   synopsis for this episode is unavailable.
##
ID    MOV10               NewMov;         90 mins.
ACC   PPDFB1;
TOS   Japanes Anime. Naruto shippuden
TMA   Ninja; shinobi, konoha; hokage; Pain.
CC    Distributed under the Creative License
CC   -----------------------------------------------------------------------
DOR   Data; packet; -; Unknown; Anime movie.
DOR   movie; new movie; 90 mins only
DOR   MOVID; 299; -.
DOR   MOV3D; -; 1.
PPE   10; torrent
KAW   new movie; Complete movie.
FEA   Null         1    683        Unknown
FEA                                /F82.
FEA   mov       62    124       (SOV:005).
FEA   mov      155    259       (SOV:005).
FEA   mov      346    376       (SOV:025).
SYN   In this episode, Dresrossa has been surrounded by a cage known as birdcage by doflamingo.
      Luffy is moving towards the palace to defeat Doflamingo. 
##

All the records in this file are separated by �##�. What I need is an output that only shows the needed info based on matched patterns � subchannel or subchannel host� in KAW line. In the example input, only the first records has this patterns. Then, the output should be like below:-

##
ID       Ser402
ACC	  P669GM
TOS     Japanes Anime. one piece
TMA     Pirates; animation; cartoon.
COM    -@- Full Comment: Host channel {SOV:000305}; Multi-chanel
COM       streaming {SOV:000305}.
DOR     TDP; TDP:0034; PPQ:host for sub channel; ASA:Subchannel.
DOR     TDP; TDP:0021; PPQ:internal channel; ASA:Unknown.
KAW     Complete episode; Early release; Host channel;
KAW     Repeat; subchannel; subchannel host.
FEA      link          1    20         unavailable
FEA                                /F3184.
FEA      TOP_CHAN     1      1       unavailable (will be determined).
FEA      SUBCHAN       2      18      at 9 (confirmed!).
FEA      TOP_CHAN     19     117     unavailable (No info).
FEA      SUBCHAN       118    138     at 10 (confirmed!).
FEA      TOP_CHAN      139    145     unavailable (will be determined).
FEA      SUBCHAN        146    166     at 12 (confirmed!).
FEA      TOP_CHAN      167    269     unavailable (the source is unknown).
FEA      REP                    1      146     A.
FEA      CAD                   75     75      by host.
FEA                                                   {undetermined}.
TT        3
##

As shown above, for line starts with COM, I just want the one with -@-Full Comment and another COM line following it, if any (bold in blue color). I also need to print line DOR followed by TDP only (bold in red color). While, In the last line, there should be a new line created named as �TT� and the value following it is the total number of the occurrences of pattern �FEA SUBCHAN�.

I don't have any idea how to print only selected lines there. I used below codes to find the key pattern. But it will only print all the lines for the matched records. I just need selected lines as shown in the sample output above.

awk '/##/{if(l)print s;l=0;s=$0;next}/subchannel/{l=1}{s=s RS $0}END{if(l)print s}' inputfile

would appreciate your kind help. Thanks.

How about this

awk '
BEGIN{
   for(i=split("ID ACC TOS TMA KAW FEA TT", k);i;i--) keep[k];
}
$1 in keep     { s=s "\n" $0 }
/^DOR[ \t]+TDP/{ s=s "\n" $0 }
$1=="COM" && /-@- Full Comment/ {
   s=s "\n" $0; getline
   s=s "\n" $0
}
/^FEA.*SUBCHAN/ { tt++ }
$1=="##"&&s {
   if(prn) print "##" s "\nTT   " tt
   s=""
   tt=prn=0
}
/^KAW.*subchannel/ {prn++}
END { print "##" } ' infile
2 Likes

Hi Chubler_XL,

The codes worked perfectly on my real data!. So, split function can be used to get selected lines. Thank you very much!. :b: