I have a file that is 20 - 80+ MB in size that is a certain type of log file.
It logs one of our processes and this process is multi-threaded. Therefore the log file is kind of a mess. Here's an example:
The logfile looks like: "DATE TIME - THREAD ID - Details", and a new file is created for each day
Now, here's where it gets to be a pain... I need to pull out the lines from "Starting Session" to "Ending Session" for each Thread ID, and dump these to separate files. HOWEVER, the Thread ID CAN be duplicated over the course of a day -- but usually not for many hours.
A session can last from 30 seconds to 4 minutes or so (~1200 lines) in the logfile, and there can be up to 20 concurrent sessions.
Now, I have something that works -- although quite slowly. I end up grepping and sedding the file over and over. When the file gets large, it takes a MASSIVE amount of time. I am hoping that someone here can help me optimize this. If possible, I'd like to use bash.
Thanks,
Eric
Here is the code I have that works, but is _slow_
if [[ -e "$log_file" ]]
then
echo "parsing: "$log_file
grep "starting session" $log_file | while read line
do
thread=`echo $line | cut -d' ' -f4`
sessiontype=`echo $line | cut -d' ' -f6`
sessionnumber=`echo $line | cut -d' ' -f7`
echo " first line of session: "${line:0:25}"..."
line2=`echo - $thread - $sessiontype $sessionnumber shutting down`
echo " last line of session: "${line2:0:25}"..."
sed -n "/$line/,/$line2/p" $log_file | grep " - $thread - ">session.$thread.$sessiontype.$sessionnumber
done
....
This gives me a number of files, that using the example log above would be created as shown below:
Sorry, I should have been more specific. The starting session lines all end with something like:
20090409 000122 - BD0 - Order 123 starting session with client 12 port 34
20090409 000123 - EF0 - Order 234 starting session with client 347 port 38
...
And both the client and port are dynamic values.
Yeah, I'm getting errors -- I'm running this under cygwin, so I don't have easy access to nawk.