what is the most effective way to process a large logfile?

I am dealing with a very large firewall logfile (more than 10G),
the logfile like this

*snip*
Nov 9 10:12:01 testfirewall root: [ID 702911 local5.info]

Nov 9 10:12:01 testfirewall root: [ID 702911 local5.info] 0:00:11 accept testfw01-hme0 >hme0 proto: icmp;
src: test001.example.net; dst: abc.dst.net; rule: 1; icmp-type: 8; icmp-code: 0; product: VPN-1 & Fire
*snip*

I don't need any line including "icmp or snmp", and since there are many lines with no content (like the first line in the example, no info after local5.info), I perform a "src" grep, and then I pick up all the lines with which the 16th field not starting with 192.12 or 192.34, or including "test", then I print several fields, using a tab (\t) instead of space to separate them, and at last, delete all the ";" character in the logfile.

My command is as following,

egrep -vi "icmp|snmp" /logs/logfile | egrep -i "src" | awk '$16!~/(^192.(12|34)|.*test.*)/' | awk 'BEGIN {OFS="\t"} {print $1$2, $11,$10,$14,$16,$18,$20," ",$26}' | sed 's/;//g' > /tmp/logfile2

I don't think my way is efficient, so anyone here can give me some suggestions on how to organize my command? Thank you!

You do all the wotk within a single awk program:

awk '
     BEGIN { 
        OFS = "\t" 
     }
     { 
        l0 = tolower($0) 
     }
     l0   ~ /icmp|snmp/ || l0  !~ /src/ ||  $16  ~ /^192.(12|34)|test/ {
        next
     }
     { 
        gsub(/;/, ""); 
        print $12,$11,$10,$14,$16,$18,$20," ",$26 
     }
   ' /logs/logfile > /tmp/logfil

Jean-Pierre.

Thank you! aigles

I am just wondering is this the most efficient way to do the job?

I will compare the time difference.

You want a single process and this does that. A perl or ksh solution might beat this by a little bit, provided that they carefully use only built-in commands and never invoke anything external. Perl and ksh compile the script while awk does not. And a custom C program can beat anything else.

Your 5 stage pipeline will not be even close to a single process. Even if you have 5 CPU's available that can be dedicated to the pipeline, all of that reading and writing to pipes is expensive. (Anything is expensive when you do it many millions times.) And you probably do not have 5 CPU's available for the entire run. Without 5 dedicated CPU's you will need to context switch several million times as well.

Why don't you split the file into small files of 1GB each. Then use Pederarbo's awk script to go through each one of the split files. And Awk being a stream editor there can be nothing faster to work on data than working on data streams.
After you are done with the cleansing of the files you could append them into a single file.

About splitting the files is just an idea and might save because you would be handling small sets of data flowing in one continuous stream than one large one of 10GB.

I really appreciated your guys' advice, have a great day :smiley: :smiley: :cool:

Splitting the file won't help. Now you added reading and rewriting 10 GB of data to your list of things to do.

Process the awk script through a2p to convert to a perl program.

a2p is gr8t..!!!!!!!!!

good to go through the forum get to know more things

cheers