Parsing syslog from Linux

Hello,
I'm facing problem to extract fields from below syslog :

logver=56 idseq=63256900099118326 itime=1563205190 devid=FG-5KDTB18800138 devname=LAL-C1-FGT-03 vd=USER date=2019-07-15 time=18:39:49 logid="0000000013" type="traffic"
subtype="forward" level="notice" eventtime=1563205189 srcip=11.3.3.17 srcport=50544 srcintf="SGI-CORE.123" srcintfrole="undefined" dstip=12.0.1.1 dstport=443 dsti
ntf="FA-SPI.100" dstintfrole="undefined" poluuid="230d4d26-AAAA-51e9-b9d1-7bf4c828f000" sessionid=20639817 proto=6 action="server-rst" policyid=10 policytype="policy" s
ervice="HTTPS" dstcountry="Germany" srccountry="Reserved" trandisp="snat" transip=11.1.1.1 transport=5092 duration=71 sentbyte=093 rcvdbyte=213 sentpkt=11 rcv
dpkt=16 appcat="unscanned"

I want to have a script to filter only required field ( for instance srcip=11.3.3.17) and put their values into text file as below:

required:

eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid

output :

1563205189|11.3.3.17|12.0.1.1|50544|443|11.1.1.1|5092|server-rst|20639817 

thanks for cooperation

Any attempts / ideas / thoughts from your side?

In fact , I tried my script took long time to be processed , the size of syslog is about 2 GB and process time was about 6 minutes which is too long ...

Show your script.

1 Like
#!/bin/bash
cat  syslog.log | awk '{for(i=1;i<=NF;i++){if($i~/eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid/) printf " %s", $i};printf "\n" }' | sed 's/"//g;;s/[a-z]*.=//g;s/ /|/g;s/^|//g'

Try

awk '
BEGIN   {HDLN = "eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid"
         MX = split (HDLN, HD, "|")
         print HDLN
        }
        {DL = ""
         for (i=1; i<=MX; i++)  if (match ($0, HD "=[^ ]*")) {L = length(HD) + 1
                                                                 printf "%s%s", DL, substr ($0, RSTART + L, RLENGTH - L)
                                                                 DL = "|"
                                                                }
         printf "\n" 
        }
' file
eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid
1563205189|11.3.3.17|12.0.1.1|50544|443|11.1.1.1|5092|"server-rst"|20639817
 

EDIT: If you can't make sure all the requested fieds exist in the file, remove the if construct:

awk '
BEGIN   {HDLN = "eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid"
         MX = split (HDLN, HD, "|")
         print HDLN
        }
        {DL = ""
         for (i=1; i<=MX; i++)  {match ($0, HD "=[^ ]*")
                                 L = length(HD) + 1
                                 printf "%s%s", DL, substr ($0, RSTART + L, RLENGTH - L)
                                 DL = "|"
                                }
         printf "\n" 
        }
' file
eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid
1563205189|11.3.3.17|12.0.1.1|50544|443|11.1.1.1|5092|"server-rst"|20639817
1563205189|11.3.3.17|12.0.1.1||443|11.1.1.1|5092|"server-rst"|20639817
||||||||

In the second data line, the srcport is missing, and the third is empty entirely.

it took 27 minutes to process 2.8 GB !!! :confused:

The code has to match NF fields against 9 items for every line; this will take its time, esp. on large files. I compared ( time d) your code to mine on a medium sized sample data file and found that yours is roughly two to three times slower, so I don't understand the 27 min of my code vs. 6 min of your code. Still, going through my proposal again and trying to tease out a few percent, I came up with

awk '
BEGIN   {print HDLN = "eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid"
         MX = split (HDLN, HD, "|")
         for (i=1; i<=MX; i++) L = length (HD) + 1
        }
        {OUT = DL = ""
         for (i=1; i<=MX; i++)  {match ($0, HD "=[^ ]*")
                                 OUT = OUT DL  substr ($0, RSTART + L, RLENGTH - L)
                                 DL = "|"
                                }
         print OUT 
        }
' file

Pls try and report back, esp. in comparison to your code in post #5 (don't forget you'll need to match the fields' sequence to the header's).

Hi, also try something like this:

awk '
  NR==FNR {
    split($0,Label,"|")
    next
  }
  {
    for(i=1; i<=NF; i++) {
      split($i,F,"=")
      gsub(/"/,x,F[2])
      Key[F[1]]=F[2]
    }
    $0=x
    for(i in Label) 
      $i=Key[Label]
    print
  }
' OFS=\| file2 file1

Where file2 contains:

eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid

Output:

1563205189|11.3.3.17|12.0.1.1|50544|443|11.1.1.1|5092|server-rst|20639817

I did some testing and RudiC's second method turned out to be fastest.

I would suggest trying mawk

In tests I conducted with RudiC's approach mawk was several orders faster than regular awk or gawk ..

1 Like

thanks but it does not work for me

What does not work?

You could try using index and substr instead of match to avoid regex overheads. This takes about 2mins for a 2GB file on my system:

awk '
BEGIN   {
    HDLN = "eventtime|srcip|dstip|srcport|dstport|transip|transport|" \
           "action|sessionid"
    MX = split (HDLN, HD, "|")
    print HDLN
}
{
  DL = ""
  for (i=1; i<=MX; i++)  {
      s=index($0, HD "=")
      if(s) {
          s += length(HD) + 1
          e=index(substr($0,s)," ")-1
          printf DL substr($0, s, e)
      } else printf DL
      DL = "|" 
  }
  printf "\n"
}' infile

--- Post updated at 09:40 AM ---

As a further test I used the above logic in C, and it finished in 1min 20sec on my system. This has to be close to the fastest you could expect:

#include <stdio.h>
#include <string.h>

int main()
{
   char line_buff[1024];
   int i;
   char *s;
   char dl[2] = "";
   char *match[] = {
     "eventtime=",
     "srcip=",
     "dstip=",
     "srcport=",
     "dstport=",
     "transip=",
     "transport=",
     "action=",
     "sessionid=",
      NULL };


   printf("%.*s", strlen(match[0])-1, match[0]);
   for(i=1;match;i++) printf("|%.*s", strlen(match)-1, match);
   printf("\n");

   while (!feof(stdin)) {
       if (fgets(line_buff, 1024, stdin)) {
           dl[0]='\0';
           for(i=0;match;i++) {
              s=strstr(line_buff, match);
              if(s) {
                printf("%s", dl);
                s+=strlen(match);
                while(*s && *s!=' ') printf("%c", *(s++));
              } else printf("%s", dl);
              strcpy(dl, "|");
            }
           printf("\n");
       }
   }
   return 0;
}
1 Like

thanks guys but now the log gets bigger and some other fields are added some others are deleted , what I need now is to help me to

logver=56 idseq=63256900099118326 itime=1563205190 devid=FG-5KDTB18800138 devname=LAL-C1-FGT-03 vd=USER date=2019-07-15 time=18:39:49 logid="0000000013" type="traffic" subtype="forward" level="notice" eventtime=1563205189 srcip=11.3.3.17 srcport=50544 srcintf="SGI-CORE.123" srcintfrole="undefined" dstip=12.0.1.1 dstport=443 dstintf="FA-SPI.100" dstintfrole="undefined" poluuid="230d4d26-AAAA-51e9-b9d1-7bf4c828f000" sessionid=20639817 proto=6 action="server-rst" policyid=10 policytype="policy" service="HTTPS" dstcountry="Germany" srccountry="Reserved" trandisp="snat" transip=11.1.1.1 transport=5092 duration=71 sentbyte=093 rcvdbyte=213 sentpkt=11 rcvdpkt=16 appcat="unscanned"

logver=56 idseq=63256900099118326 itime=1563205190 devid=FG-5KDTB18800138 devname=LAL-C1-FGT-03 vd=USER date=2019-07-15 time=18:39:49 logid="0000000013" type="traffic" subtype="forward" level="notice" eventtime=1563205190 srcip=11.3.3.17 srcport=50544 srcintf="SGI-CORE.123" srcintfrole="undefined" dstip=13.0.1.1 dstport=80 dstintf="FA-SPI.100" dstintfrole="undefined" poluuid="230d4d26-AAAA-51e9-b9d1-7bf4c828f000" sessionid=20639824 proto=6 action="close" policyid=34 policytype="policy" service="UDP" dstcountry="United State" srccountry="Reserved" trandisp="snat" transip=14.1.1.1 transport=5092 duration=50 sentbyte=093 rcvdbyte=213 sentpkt=11 rcvdpkt=16 appcat="unscanned"


logver=56 idseq=63256900099118326 itime=1563205190 devid=FG-5KDTB18800138 devname=LAL-C1-FGT-03 vd=USER date=2019-07-15 time=18:39:49 logid="0000000013" type="traffic" subtype="forward" level="notice" eventtime=1563205100 srcip=11.3.3.17 srcport=50590 srcintf="SGI-CORE.123" srcintfrole="undefined" dstip=1.0.1.1 dstport=80 dstintf="FA-SPI.100" dstintfrole="undefined" poluuid="230d4d26-AAAA-51e9-b9d1-7bf4c828f000" sessionid=20639817 proto=6  policyid=34 policytype="policy" service="UDP/10"  srccountry="Reserved" trandisp="snat"  duration=60 sentbyte=093 rcvdbyte=213 sentpkt=11 rcvdpkt=16 appcat="unscanned"



eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid|service|policyid|dstcountry|duration    --> header no need to be shown on output and if any field was missing then it will left as empty between pipe delimiter as  the case at line 3 ||   

1563205189|11.3.3.17|12.0.1.1|50544|443|11.1.1.1|5092|server-rst|20639817|HTTPS|10|Germany|71
1563205190|11.3.3.17|13.0.1.1|50544|80|14.1.1.1|5092|closet|20639824|UDP|34|United State|50
1563205100|11.3.3.17|1.0.1.1|50590|80||||20639817|UDP/10|34||60

can not run this , I'm getting unknown type error !

can you add dstcountry to the Begin action , keep in mind that the required field could be "United State" or "Germany" or "South Africa" , here the tab is not working with awk , it only show me "United

Having the field separator character to be included in the target data makes things complicated. Plus the fact that the char may occur several (unpredictable) times in the data field, like "United States of America". So, additional text processing needs to be done. There are, as always, several approaches, of which this one seems to be the fastest, although it needs around 10% computing time on top:

awk '
BEGIN   {print HDLN = "eventtime|srcip|dstip|srcport|dstport|transip|transport|action|sessionid|dstcountry"
         MX = split (HDLN, HD, "|")
         for (i=1; i<=MX; i++) L = length (HD) + 1
        }
        {OUT = DL = ""
         for (i=1; i<=MX; i++)  {match ($0, HD "=[^ ]*")
                                 TMP =  substr ($0, RSTART + L, RLENGTH - L)
                                if (gsub (/\"/, "&", TMP) %2)  {TMP2 = substr ($0, RSTART + RLENGTH)
                                                                TMP  = TMP substr (TMP2, 1, index (TMP2, "\""))
                                                                }
                                 OUT = OUT DL TMP
                                 DL = "|"
                                }
         print OUT 
        }
' file

Please check and report back.

Here is a modification of my solution at post #13 for the dstcountry requirement:

awk '
BEGIN   {
    HDLN = "eventtime|srcip|dstip|srcport|dstport|transip|transport|" \
           "action|sessionid|dstcountry"
    MX = split (HDLN, HD, "|")
    print HDLN
}
{
  DL = ""
  for (i=1; i<=MX; i++)  {
      s=index($0, HD "=")
      if(s) {
          s += length(HD) + 1
          if (substr($0,s,1) == "\"")
            e=index(substr($0,s+1),"\"")+1
          else
              e=index(substr($0,s)," ")-1
          printf DL substr($0, s, e)
      } else printf DL
      DL = "|" 
  }
  printf "\n"
}' file

Or use

if (substr($0,s,1) == "\"")
    e=index(substr($0,++s),"\"")-1
else
    ...

in place of above, if you don't want the quotes in the output.