Parse log file to insert into database

unplugme71 · August 16, 2015, 1:54am

I have a log file that's created daily by this command:

sar -u 300 288 >> /var/log/usage/$(date "+%Y-%m-%d")_$(hostname)_cpu.log

It that contains data like this:

Linux 3.16.0-4-amd64 (myhostname)       08/15/2015      _x86_64_        (1 CPU)

11:34:17 PM     CPU     %user     %nice   %system   %iowait    %steal     %idle
11:39:17 PM     all      0.09      0.00      0.07      0.09      0.00     99.75
11:44:17 PM     all      0.04      0.00      0.03      0.00      0.00     99.92
11:49:17 PM     all      1.49      0.00      0.49      0.06      0.00     97.96
11:54:17 PM     all     23.27      0.00      0.51      0.05      0.03     76.14


11:56:12 PM     all      0.17      0.00      0.13      0.01      0.01     99.69
Average:        all      5.69      0.00      0.26      0.05      0.01     93.99

I'm not sure if there's a way to strip out the data I don't need before it logs to the file such as %nice, %iowait, and %steal. As well as the header and footer information. It would make the rest of this post irrelevant.

I need to ignore the first few rows and the last row. I also need to ignore any blank lines.

The only data I need to insert into the database is the time, %user, %system, and %idle data.

The table looks like this:

row_id (int) | date (date) | time (time) | hostname (varchar) | user (decimal) | system (decimal) | idle (decimal)

The log file name is YYYY-MM-DD_hostname_cpu.log format. I need to extract the date of the filename and insert into the table as well.

So far I have

cat YYYY-MM-DD_hostname_cpu.log | awk 'NR printf "%s,%s,%s\n", $4,$6,$9'

I then needed it inserted into a mysql database. I'm assuming some type of while loop would be needed here.

Don_Cragun · August 16, 2015, 2:33am

To create a pipe symbol ( | ) separated values file containing data from all of the files in a directory with names ending in .log , you could use the following:

awk '
BEGIN {	OFS = "|"
}
FNR == 1 {
	d = substr(FILENAME, 1, 10)
}
$2 ~ /^[AP]M$/ && $3 != "CPU" {
	print FNR, d, $1 " " $2, $4, $6, $9
}' *.log

If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

If there are files named 2015-08-14_hostname_cpu.log and 2015-08-15_hostname_cpu.log in a directory where you run the above script and each of those files contained the sample data shown in your 1st post in this thread, it produces the output:

4|2015-08-14|11:39:17 PM|0.09|0.07|99.75
5|2015-08-14|11:44:17 PM|0.04|0.03|99.92
6|2015-08-14|11:49:17 PM|1.49|0.49|97.96
7|2015-08-14|11:54:17 PM|23.27|0.51|76.14
10|2015-08-14|11:56:12 PM|0.17|0.13|99.69
4|2015-08-15|11:39:17 PM|0.09|0.07|99.75
5|2015-08-15|11:44:17 PM|0.04|0.03|99.92
6|2015-08-15|11:49:17 PM|1.49|0.49|97.96
7|2015-08-15|11:54:17 PM|23.27|0.51|76.14
10|2015-08-15|11:56:12 PM|0.17|0.13|99.69

I will leave it to you to redirect the output from the above script into a file you can use to load your database or pipe the output directly into a mysql statement to load your database.

Scrutinizer · August 16, 2015, 5:38am

IMO if you are only going to represent %system %user and %idle, you cannot simply discard the columns, you need to map the other columns by adding them to the relevant columns.

For example %iowait is idle time, while IO is happening, but it is still idle time, so you need to add it to idle time, otherwise the numbers do not add up to 100%.
In your sample the percentages are low, but there are situations where they may be significant.

I think you should use this:

%total_user = %user + %nice 
%total_idle = %iowait + %steal + %idle

So, adjusting Don's suggestion that would mean:

	print FNR, d, $1 " " $2, $4+$5, $6, $7+$8+$9

Which produces the output:

4|2015-08-14|11:39:17 PM|0.09|0.07|99.84
5|2015-08-14|11:44:17 PM|0.04|0.03|99.92
6|2015-08-14|11:49:17 PM|1.49|0.49|98.02
7|2015-08-14|11:54:17 PM|23.27|0.51|76.22
10|2015-08-14|11:56:12 PM|0.17|0.13|99.71
4|2015-08-15|11:39:17 PM|0.09|0.07|99.84
5|2015-08-15|11:44:17 PM|0.04|0.03|99.92
6|2015-08-15|11:49:17 PM|1.49|0.49|98.02
7|2015-08-15|11:54:17 PM|23.27|0.51|76.22
10|2015-08-15|11:56:12 PM|0.17|0.13|99.71

unplugme71 · August 16, 2015, 12:15pm

Thank you both for your replies. The OS i'm using is Debian 8 x64.

scrutinizer:

IMO if you are only going to represent %system %user and %idle, you cannot simply discard the columns, you need to map the other columns by adding them to the relevant columns.

For example %iowait is idle time, while IO is happening, but it is still idle time, so you need to add it to idle time, otherwise the numbers do not add up to 100%.
In your sample the percentages are low, but there are situations where they may be significant.

I think you should use this:
%total_user = %user + %nice 
%total_idle = %iowait + %steal + %idle
So, adjusting Don's suggestion that would mean:
	print FNR, d, $1 " " $2, $4+$5, $6, $7+$8+$9 
Which produces the output:
4|2015-08-14|11:39:17 PM|0.09|0.07|99.84
5|2015-08-14|11:44:17 PM|0.04|0.03|99.92
6|2015-08-14|11:49:17 PM|1.49|0.49|98.02
7|2015-08-14|11:54:17 PM|23.27|0.51|76.22
10|2015-08-14|11:56:12 PM|0.17|0.13|99.71
4|2015-08-15|11:39:17 PM|0.09|0.07|99.84
5|2015-08-15|11:44:17 PM|0.04|0.03|99.92
6|2015-08-15|11:49:17 PM|1.49|0.49|98.02
7|2015-08-15|11:54:17 PM|23.27|0.51|76.22
10|2015-08-15|11:56:12 PM|0.17|0.13|99.71

Thank you again for this information. I was not aware about the other columns being part of the whole equation of cpu utilization.

---------- Post updated at 11:15 AM ---------- Previous update was at 11:09 AM ----------

don cragun:

To create a pipe symbol ( | ) separated values file containing data from all of the files in a directory with names ending in .log , you could use the following:
awk '
BEGIN {	OFS = "|"
}
FNR == 1 {
	d = substr(FILENAME, 1, 10)
}
$2 ~ /^[AP]M$/ && $3 != "CPU" {
	print FNR, d, $1 " " $2, $4, $6, $9
}' *.log
If you want to try this on a Solaris/SunOS system, change awk to /usr/xpg4/bin/awk or nawk .

If there are files named 2015-08-14_hostname_cpu.log and 2015-08-15_hostname_cpu.log in a directory where you run the above script and each of those files contained the sample data shown in your 1st post in this thread, it produces the output:
4|2015-08-14|11:39:17 PM|0.09|0.07|99.75
5|2015-08-14|11:44:17 PM|0.04|0.03|99.92
6|2015-08-14|11:49:17 PM|1.49|0.49|97.96
7|2015-08-14|11:54:17 PM|23.27|0.51|76.14
10|2015-08-14|11:56:12 PM|0.17|0.13|99.69
4|2015-08-15|11:39:17 PM|0.09|0.07|99.75
5|2015-08-15|11:44:17 PM|0.04|0.03|99.92
6|2015-08-15|11:49:17 PM|1.49|0.49|97.96
7|2015-08-15|11:54:17 PM|23.27|0.51|76.14
10|2015-08-15|11:56:12 PM|0.17|0.13|99.69
I will leave it to you to redirect the output from the above script into a file you can use to load your database or pipe the output directly into a mysql statement to load your database.

There will be other log files in this directory that are for another purpose, could I use *_cpu.log instead? Is there any way to convert the time to a 24 hr clock that MySQL understands? I'm assuming I would need to trim the AM or PM off and if it was PM add 12:00:00 somehow?

I have created a script that processes the output and modified it a little bit. I just need to fix the time and I should be good to go. Here's the current output.

2015-08-15,11:39:17PM,0.09,0.07,99.84
2015-08-15,11:44:17PM,0.04,0.03,99.92
2015-08-15,11:49:17PM,1.49,0.49,98.02
2015-08-15,11:54:17PM,23.27,0.51,76.22
2015-08-15,11:56:12PM,0.17,0.13,99.71

Edit: I forgot to add the hostname column. How can I add that into the awk. The hostname can be various length, but will always be between the date_ and _filename

Scrutinizer · August 16, 2015, 12:45pm

Hi, modifying Don's suggestion, you could try including the hostname with something like this:

awk '
  BEGIN {
    OFS = "|"
  }
  FNR == 1 {
    split(FILENAME,F,/_/)
    d=F[1]
    h=F[2]
  }
  $2 ~ /^[AP]M$/ && $3 != "CPU" {
    print FNR, d, h, $1 " " $2, $4+$5, $6, $7+$8+$9
  }
' *_cpu.log

unplugme71 · August 16, 2015, 1:38pm

scrutinizer:

Hi, modifying Don's suggestion, you could try including the hostname with something like this:
awk '
  BEGIN {
   OFS = "|"
  }
  FNR == 1 {
   split(FILENAME,F,/_/)
   d=F[1]
   h=F[2]
  }
  $2 ~ /^[AP]M$/ && $3 != "CPU" {
   print FNR, d, h, $1 " " $2, $4+$5, $6, $7+$8+$9
  }
' *_cpu.log

Awesome! That worked.

---------- Post updated at 12:38 PM ---------- Previous update was at 12:06 PM ----------

What about time conversion from 12 to 24hr?

Don_Cragun · August 16, 2015, 2:15pm

So, using comma instead of the pipe symbol, and getting rid of the space between the time stamp and the "AM" or "PM", we have:

awk '
BEGIN {	OFS = ","
}
FNR == 1 {
	split(FILENAME, F, /_/)
	d=F[1]
	h=F[2]
}
$2 ~ /^[AP]M$/ && $3 != "CPU" {
	print FNR, d, h, $1 $2, $4+$5, $6, $7+$8+$9
}' *_cpu.log

and, if you want a 24 hour clock time instead of AM/PM notation, try:

awk '
BEGIN {	OFS = ","
}
FNR == 1 {
	split(FILENAME, F, /_/)
	d=F[1]
	h=F[2]
}
$2 ~ /^[AP]M$/ && $3 != "CPU" {
	split($1, HMS, /:/)
	if(HMS[1] == 12)HMS[1] = 0
	if($2 == "PM")	HMS[1] += 12
	t = sprintf("%02d:%s:%s", HMS[1], HMS[2], HMS[3])
	print FNR, d, h, t, $4+$5, $6, $7+$8+$9
}' *_cpu.log

unplugme71 · August 16, 2015, 3:09pm

Here's what I have so far for the entire process.

crontab -e

0 0 * * * /scripts/collect_cpu.sh
0 * * * * /scripts/process_cpu.sh

collect_cpu.sh

#!/bin/bash

sar -u 300 288 >> /var/log/mylogs/$(date "+%Y-%m-%d")_$(hostname)_cpuusage.log

process_cpu.sh

#!/bin/bash

# Change directory to HNP logs
cd /var/log/mylogs/

# Loop over each file
for logFile in *_cpuusage.log
  do
    awk '
      BEGIN { OFS = "," } FNR == 1 {
        split(FILENAME,F,/_/)
        d=F[1]
        h=F[2]
      }
      $2 ~ /^[AP]M$/ && $3 != "CPU" {
        split($1, HMS, /:/)
	if(HMS[1] == 12)HMS[1] = 0
	if($2 == "PM")	HMS[1] += 12
	t = sprintf("%02d:%s:%s", HMS[1], HMS[2], HMS[3])
        print d, t, h, $4+$5, $6, $7+$8+$9
      }
    ' $logFile > /var/log/mylogs/$logFile.processed

  # load into mysql
  mysql -e "load data local infile '"$logFile.processed"' into table cpuusage fields TERMINATED BY ',' (hidden)

  # rename file done
  mv $logFile.processed $logFile.processed.done

  # mark .done for previous days
  if d < $(date "+%Y-%m-%d")
    mv $logFile $logFile.done
  endif

done

I can manage this process if I run the cpu process job for any day before the current day. However, I want to process this every 60 minutes.

Is there any scripting I can do to better manage this? I'm thinking it may be easier to just rename the current log file as .done and let the existing cron job recreate the cpu log file when it runs again the next 5 minutes, however, what if my process happens the same time as the cron? Seems i'll run into a file lock issue.

RudiC · August 16, 2015, 4:15pm

Why don't you do that in one go, i.e. combine the two scripts into one? That would entirely decouple two runs, and the DB's locking mechanisms would certainly ensure two processes don't interact.

What be the execution times of collect_cpu.sh and process_cpu.sh .

unplugme71 · August 16, 2015, 5:15pm

The first script kicks off once per day and runs every 5 minutes for 288 times. How can I combine both?

From what I read, if I just run 'sar' it gives me an average CPU since last reboot. Not since the last time it ran.

By logging every 5 minutes, I can get an idea of what the CPU was at a point in time and correlate it to the same 5 minute window my other web apps are executing. I'm also tracking CPU load, but that's a separate script.

Don_Cragun · August 16, 2015, 8:31pm

Please be more specific in what you want to change from what you are currently doing.

Does this code in process_cpu.sh :

  # mark .done for previous days
  if d < $(date "+%Y-%m-%d")
    mv $logFile $logFile.done
  endif

really work for you? In addition to missing a then , what does the d utility on your system do and what is in the file named 2015-08-16 when you run this command today?

If you want to update the database hourly, why are you creating one log file per day instead of one log file per hour? Why not run:

sar -u 300 12 > ($date '+%H.log')

in your hourly script and run the awk script on the resulting log file instead of reprocessing data being accumulated continuously during the day into a single file.

RudiC · August 17, 2015, 5:40am

Again, why don't you process the data immediately after creating/logging them?

unplugme71 · August 17, 2015, 11:09am

don cragun:

Please be more specific in what you want to change from what you are currently doing.

Does this code in process_cpu.sh :
  # mark .done for previous days
  if d < $(date "+%Y-%m-%d")
   mv $logFile $logFile.done
  endif
really work for you? In addition to missing a then , what does the d utility on your system do and what is in the file named 2015-08-16 when you run this command today?

If you want to update the database hourly, why are you creating one log file per day instead of one log file per hour? Why not run:
sar -u 300 12 > ($date '+%H.log')
in your hourly script and run the awk script on the resulting log file instead of reprocessing data being accumulated continuously during the day into a single file.

Good point. I'll run both hourly. Is there a way to merge the hourly files into one per day?