hi,
I'm trying to calculate IP addresses and their respective calls to our apache Server. The standard format of the input is
HOST IP DATE/TIME - - "GET/POST reuest" "User Agent"
HOST IP DATE/TIME - - "GET/POST reuest" "User Agent"
HOST IP DATE/TIME - - "GET/POST reuest" "User Agent"
HOST IP DATE/TIME - - "GET/POST reuest" "User Agent"
HOST IP DATE/TIME - - "GET/POST reuest" "User Agent"
I'm using below given gawk code to do this ( that is accumulating all requests for all IPs in a given input file.
gawk --re-interval -F\" '
/./ { split($1,IP," "); IPPP[IP[2]]++;}
/./ { split($1,IP," "); LINE[IP[2]]=LINE[IP[2]]"<br>"$2; }
END { for(i in LINE){{ printf("\n\n%s\t%s",i,LINE) }} }' other_vhosts_access.log
the problem:
input-file is actually around 47Gib in size and when I return the LINE array in END space of gawk, The process consumes all the available memory of the system and the system starts running out of memory for all other processes.
Question:
Can i return the LINE array in our pattern space rather than END space so that every IP matched is returned -- instead of adding it into array and then displaying the result.
------ Post updated at 02:16 PM ------
BTW, this code works fine for smaller file ( when I split the file into smaller chunks, which doesn't satisfy the requirement, as all the file must be scanned at once, so that I get all IPs list )