Gawk --- produce the output in pattern space instead of END space

busyboy · September 24, 2018, 4:16am

hi,

I'm trying to calculate IP addresses and their respective calls to our apache Server. The standard format of the input is

HOST IP DATE/TIME - - "GET/POST reuest" "User Agent"
HOST IP DATE/TIME - - "GET/POST reuest" "User Agent"
HOST IP DATE/TIME - - "GET/POST reuest" "User Agent"
HOST IP DATE/TIME - - "GET/POST reuest" "User Agent"
HOST IP DATE/TIME - - "GET/POST reuest" "User Agent"

I'm using below given gawk code to do this ( that is accumulating all requests for all IPs in a given input file.

gawk --re-interval -F\"  '
 /./  { split($1,IP," "); IPPP[IP[2]]++;}
 /./  { split($1,IP," "); LINE[IP[2]]=LINE[IP[2]]"<br>"$2; } 
END  { for(i in LINE){{  printf("\n\n%s\t%s",i,LINE) }} }' other_vhosts_access.log

the problem:

input-file is actually around 47Gib in size and when I return the LINE array in END space of gawk, The process consumes all the available memory of the system and the system starts running out of memory for all other processes.

Question:

Can i return the LINE array in our pattern space rather than END space so that every IP matched is returned -- instead of adding it into array and then displaying the result.

------ Post updated at 02:16 PM ------

BTW, this code works fine for smaller file ( when I split the file into smaller chunks, which doesn't satisfy the requirement, as all the file must be scanned at once, so that I get all IPs list )

RudiC · September 24, 2018, 3:50pm

You could try - if the IPs are sorted. Then, whenever the IP changes, print out the results for the just gone IP. But, sorting a file that big may be a challenge, too. sort , on the other hand, offers some options to deal with large files.

shamrock · September 27, 2018, 11:41am

Apart from what RudiC suggested you can print out the pattern space to individual files that are uniquely IP'd and after gawk finishes you can catenate them all into a single output file...