Parsing syslog from Linux

arm · July 20, 2019, 4:58am

Hello,
I'm facing problem to extract fields from below syslog :

logver=56 idseq=63256900099118326 itime=1563205190 devid=FG-5KDTB18800138 devname=LAL-C1-FGT-03 vd=USER date=2019-07-15 time=18:39:49 logid="0000000013" type="traffic"
subtype="forward" level="notice" eventtime=1563205189 srcip=11.3.3.17 srcport=50544 srcintf="SGI-CORE.123" srcintfrole="undefined" dstip=12.0.1.1 dstport=443 dsti
ntf="FA-SPI.100" dstintfrole="undefined" poluuid="230d4d26-AAAA-51e9-b9d1-7bf4c828f000" sessionid=20639817 proto=6 action="server-rst" policyid=10 policytype="policy" s
ervice="HTTPS" dstcountry="Germany" srccountry="Reserved" trandisp="snat" transip=11.1.1.1 transport=5092 duration=71 sentbyte=093 rcvdbyte=213 sentpkt=11 rcv
dpkt=16 appcat="unscanned"

I want to have a script to filter only required field ( for instance srcip=11.3.3.17) and put their values into text file as below:

required:

eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid

output :

1563205189|11.3.3.17|12.0.1.1|50544|443|11.1.1.1|5092|server-rst|20639817

thanks for cooperation

RudiC · July 20, 2019, 8:19am

Any attempts / ideas / thoughts from your side?

arm · July 20, 2019, 1:48pm

In fact , I tried my script took long time to be processed , the size of syslog is about 2 GB and process time was about 6 minutes which is too long ...

RudiC · July 20, 2019, 2:01pm

Show your script.

arm · July 20, 2019, 2:11pm

#!/bin/bash
cat  syslog.log | awk '{for(i=1;i<=NF;i++){if($i~/eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid/) printf " %s", $i};printf "\n" }' | sed 's/"//g;;s/[a-z]*.=//g;s/ /|/g;s/^|//g'

RudiC · July 20, 2019, 3:20pm

Try

awk '
BEGIN   {HDLN = "eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid"
         MX = split (HDLN, HD, "|")
         print HDLN
        }
        {DL = ""
         for (i=1; i<=MX; i++)  if (match ($0, HD "=[^ ]*")) {L = length(HD) + 1
                                                                 printf "%s%s", DL, substr ($0, RSTART + L, RLENGTH - L)
                                                                 DL = "|"
                                                                }
         printf "\n" 
        }
' file
eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid
1563205189|11.3.3.17|12.0.1.1|50544|443|11.1.1.1|5092|"server-rst"|20639817

EDIT: If you can't make sure all the requested fieds exist in the file, remove the if construct:

awk '
BEGIN   {HDLN = "eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid"
         MX = split (HDLN, HD, "|")
         print HDLN
        }
        {DL = ""
         for (i=1; i<=MX; i++)  {match ($0, HD "=[^ ]*")
                                 L = length(HD) + 1
                                 printf "%s%s", DL, substr ($0, RSTART + L, RLENGTH - L)
                                 DL = "|"
                                }
         printf "\n" 
        }
' file
eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid
1563205189|11.3.3.17|12.0.1.1|50544|443|11.1.1.1|5092|"server-rst"|20639817
1563205189|11.3.3.17|12.0.1.1||443|11.1.1.1|5092|"server-rst"|20639817
||||||||

In the second data line, the srcport is missing, and the third is empty entirely.

arm · July 21, 2019, 2:46am

it took 27 minutes to process 2.8 GB !!!

RudiC · July 21, 2019, 3:33am

The code has to match NF fields against 9 items for every line; this will take its time, esp. on large files. I compared ( time d) your code to mine on a medium sized sample data file and found that yours is roughly two to three times slower, so I don't understand the 27 min of my code vs. 6 min of your code. Still, going through my proposal again and trying to tease out a few percent, I came up with

awk '
BEGIN   {print HDLN = "eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid"
         MX = split (HDLN, HD, "|")
         for (i=1; i<=MX; i++) L = length (HD) + 1
        }
        {OUT = DL = ""
         for (i=1; i<=MX; i++)  {match ($0, HD "=[^ ]*")
                                 OUT = OUT DL  substr ($0, RSTART + L, RLENGTH - L)
                                 DL = "|"
                                }
         print OUT 
        }
' file

Pls try and report back, esp. in comparison to your code in post #5 (don't forget you'll need to match the fields' sequence to the header's).

Scrutinizer · July 21, 2019, 5:06am

Hi, also try something like this:

awk '
  NR==FNR {
    split($0,Label,"|")
    next
  }
  {
    for(i=1; i<=NF; i++) {
      split($i,F,"=")
      gsub(/"/,x,F[2])
      Key[F[1]]=F[2]
    }
    $0=x
    for(i in Label) 
      $i=Key[Label]
    print
  }
' OFS=\| file2 file1

Where file2 contains:

eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid

Output:

1563205189|11.3.3.17|12.0.1.1|50544|443|11.1.1.1|5092|server-rst|20639817

Scrutinizer · July 21, 2019, 9:04am

I did some testing and RudiC's second method turned out to be fastest.

I would suggest trying mawk

In tests I conducted with RudiC's approach mawk was several orders faster than regular awk or gawk ..

arm · July 21, 2019, 2:28pm

thanks but it does not work for me

Scrutinizer · July 21, 2019, 3:18pm

What does not work?

Chubler_XL · July 22, 2019, 6:40pm

You could try using index and substr instead of match to avoid regex overheads. This takes about 2mins for a 2GB file on my system:

awk '
BEGIN   {
    HDLN = "eventtime|srcip|dstip|srcport|dstport|transip|transport|" \
           "action|sessionid"
    MX = split (HDLN, HD, "|")
    print HDLN
}
{
  DL = ""
  for (i=1; i<=MX; i++)  {
      s=index($0, HD "=")
      if(s) {
          s += length(HD) + 1
          e=index(substr($0,s)," ")-1
          printf DL substr($0, s, e)
      } else printf DL
      DL = "|" 
  }
  printf "\n"
}' infile

--- Post updated at 09:40 AM ---

As a further test I used the above logic in C, and it finished in 1min 20sec on my system. This has to be close to the fastest you could expect:

#include <stdio.h>
#include <string.h>

int main()
{
   char line_buff[1024];
   int i;
   char *s;
   char dl[2] = "";
   char *match[] = {
     "eventtime=",
     "srcip=",
     "dstip=",
     "srcport=",
     "dstport=",
     "transip=",
     "transport=",
     "action=",
     "sessionid=",
      NULL };


   printf("%.*s", strlen(match[0])-1, match[0]);
   for(i=1;match;i++) printf("|%.*s", strlen(match)-1, match);
   printf("\n");

   while (!feof(stdin)) {
       if (fgets(line_buff, 1024, stdin)) {
           dl[0]='\0';
           for(i=0;match;i++) {
              s=strstr(line_buff, match);
              if(s) {
                printf("%s", dl);
                s+=strlen(match);
                while(*s && *s!=' ') printf("%c", *(s++));
              } else printf("%s", dl);
              strcpy(dl, "|");
            }
           printf("\n");
       }
   }
   return 0;
}

arm · July 26, 2019, 4:22am

thanks guys but now the log gets bigger and some other fields are added some others are deleted , what I need now is to help me to

logver=56 idseq=63256900099118326 itime=1563205190 devid=FG-5KDTB18800138 devname=LAL-C1-FGT-03 vd=USER date=2019-07-15 time=18:39:49 logid="0000000013" type="traffic" subtype="forward" level="notice" eventtime=1563205189 srcip=11.3.3.17 srcport=50544 srcintf="SGI-CORE.123" srcintfrole="undefined" dstip=12.0.1.1 dstport=443 dstintf="FA-SPI.100" dstintfrole="undefined" poluuid="230d4d26-AAAA-51e9-b9d1-7bf4c828f000" sessionid=20639817 proto=6 action="server-rst" policyid=10 policytype="policy" service="HTTPS" dstcountry="Germany" srccountry="Reserved" trandisp="snat" transip=11.1.1.1 transport=5092 duration=71 sentbyte=093 rcvdbyte=213 sentpkt=11 rcvdpkt=16 appcat="unscanned"

logver=56 idseq=63256900099118326 itime=1563205190 devid=FG-5KDTB18800138 devname=LAL-C1-FGT-03 vd=USER date=2019-07-15 time=18:39:49 logid="0000000013" type="traffic" subtype="forward" level="notice" eventtime=1563205190 srcip=11.3.3.17 srcport=50544 srcintf="SGI-CORE.123" srcintfrole="undefined" dstip=13.0.1.1 dstport=80 dstintf="FA-SPI.100" dstintfrole="undefined" poluuid="230d4d26-AAAA-51e9-b9d1-7bf4c828f000" sessionid=20639824 proto=6 action="close" policyid=34 policytype="policy" service="UDP" dstcountry="United State" srccountry="Reserved" trandisp="snat" transip=14.1.1.1 transport=5092 duration=50 sentbyte=093 rcvdbyte=213 sentpkt=11 rcvdpkt=16 appcat="unscanned"


logver=56 idseq=63256900099118326 itime=1563205190 devid=FG-5KDTB18800138 devname=LAL-C1-FGT-03 vd=USER date=2019-07-15 time=18:39:49 logid="0000000013" type="traffic" subtype="forward" level="notice" eventtime=1563205100 srcip=11.3.3.17 srcport=50590 srcintf="SGI-CORE.123" srcintfrole="undefined" dstip=1.0.1.1 dstport=80 dstintf="FA-SPI.100" dstintfrole="undefined" poluuid="230d4d26-AAAA-51e9-b9d1-7bf4c828f000" sessionid=20639817 proto=6  policyid=34 policytype="policy" service="UDP/10"  srccountry="Reserved" trandisp="snat"  duration=60 sentbyte=093 rcvdbyte=213 sentpkt=11 rcvdpkt=16 appcat="unscanned"



eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid|service|policyid|dstcountry|duration    --> header no need to be shown on output and if any field was missing then it will left as empty between pipe delimiter as  the case at line 3 ||   

1563205189|11.3.3.17|12.0.1.1|50544|443|11.1.1.1|5092|server-rst|20639817|HTTPS|10|Germany|71
1563205190|11.3.3.17|13.0.1.1|50544|80|14.1.1.1|5092|closet|20639824|UDP|34|United State|50
1563205100|11.3.3.17|1.0.1.1|50590|80||||20639817|UDP/10|34||60

arm · July 26, 2019, 2:39pm

can not run this , I'm getting unknown type error !

arm · July 27, 2019, 5:59am

can you add dstcountry to the Begin action , keep in mind that the required field could be "United State" or "Germany" or "South Africa" , here the tab is not working with awk , it only show me "United

rudic:

Try

awk '
BEGIN   {HDLN = "eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid"
   MX = split (HDLN, HD, "|")
   print HDLN
   }
   {DL = ""
   for (i=1; i<=MX; i++)  if (match ($0, HD "=[^ ]*")) {L = length(HD) + 1
   printf "%s%s", DL, substr ($0, RSTART + L, RLENGTH - L)
   DL = "|"
   }
   printf "\n" 
   }
' file
eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid
1563205189|11.3.3.17|12.0.1.1|50544|443|11.1.1.1|5092|"server-rst"|20639817

EDIT: If you can't make sure all the requested fieds exist in the file, remove the if construct:

awk '
BEGIN   {HDLN = "eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid"
   MX = split (HDLN, HD, "|")
   print HDLN
   }
   {DL = ""
   for (i=1; i<=MX; i++)  {match ($0, HD "=[^ ]*")
   L = length(HD) + 1
   printf "%s%s", DL, substr ($0, RSTART + L, RLENGTH - L)
   DL = "|"
   }
   printf "\n" 
   }
' file
eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid
1563205189|11.3.3.17|12.0.1.1|50544|443|11.1.1.1|5092|"server-rst"|20639817
1563205189|11.3.3.17|12.0.1.1||443|11.1.1.1|5092|"server-rst"|20639817
||||||||

In the second data line, the srcport is missing, and the third is empty entirely.

RudiC · July 27, 2019, 8:16am

Having the field separator character to be included in the target data makes things complicated. Plus the fact that the char may occur several (unpredictable) times in the data field, like "United States of America". So, additional text processing needs to be done. There are, as always, several approaches, of which this one seems to be the fastest, although it needs around 10% computing time on top:

awk '
BEGIN   {print HDLN = "eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid|dstcountry"
         MX = split (HDLN, HD, "|")
         for (i=1; i<=MX; i++) L = length (HD) + 1
        }
        {OUT = DL = ""
         for (i=1; i<=MX; i++)  {match ($0, HD "=[^ ]*")
                                 TMP =  substr ($0, RSTART + L, RLENGTH - L)
                                if (gsub (/\"/, "&", TMP) %2)  {TMP2 = substr ($0, RSTART + RLENGTH)
                                                                TMP  = TMP substr (TMP2, 1, index (TMP2, "\""))
                                                                }
                                 OUT = OUT DL TMP
                                 DL = "|"
                                }
         print OUT 
        }
' file

Please check and report back.

Chubler_XL · July 27, 2019, 5:16pm

Here is a modification of my solution at post #13 for the dstcountry requirement:

awk '
BEGIN   {
    HDLN = "eventtime|srcip|dstip|srcport|dstport|transip|transport|" \
           "action|sessionid|dstcountry"
    MX = split (HDLN, HD, "|")
    print HDLN
}
{
  DL = ""
  for (i=1; i<=MX; i++)  {
      s=index($0, HD "=")
      if(s) {
          s += length(HD) + 1
          if (substr($0,s,1) == "\"")
            e=index(substr($0,s+1),"\"")+1
          else
              e=index(substr($0,s)," ")-1
          printf DL substr($0, s, e)
      } else printf DL
      DL = "|" 
  }
  printf "\n"
}' file

Or use

if (substr($0,s,1) == "\"")
    e=index(substr($0,++s),"\"")-1
else
    ...

in place of above, if you don't want the quotes in the output.