Solaris 10 - script creating huge spike in Kernel CPU

Rorgg · March 3, 2010, 2:28pm

I'm running on Solaris 10, and I have a script that's running on several machines. Basically, what it's doing is:

tail -f | grep one or more log files into a temp file
Every minute or so, copy that temp file to a second temp and zero the first
Sed through the 2nd temp to pull out a user ID
grep through the file for occurences of that ID with some other text (the other text is in a local file that I read from in a loop)
Output a text record into a log file that basically consists of the User, that text, and a count.

Another server then does a tail -f of the output log file.

The user cpu on this script is small... on the order of 2-4% depending on what box I'm running it on. But the crazy thing is that on my test box, with one input data stream, as soon as I start running the script, my kernal % jumps from like 1% to 60%.

As far as I can see, the issue is NOT with disk i/o wait, or with memory (still half free and very very small paging values).

# sar -g 5 5

SunOS ssdev01 5.10 138889-03 i86pc    03/03/2010

13:09:01  pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf
13:09:06     0.20     0.80     0.60     0.00     0.00
13:09:11     0.40     1.00     0.80     0.00     0.00
13:09:16     0.20     0.80     0.80     0.00     0.00
13:09:21     0.20     0.80     0.60     0.00     0.00
13:09:26     0.60     1.80     1.40     0.00     0.00

Average      0.32     1.04     0.84     0.00     0.00

Pre-run top:

load averages:  0.05,  0.27,  0.66                                                      13:25:39
69 processes:  68 sleeping, 1 on cpu
CPU states: 96.7% idle,  2.5% user,  0.8% kernel,  0.0% iowait,  0.0% swap
Memory: 2048M real, 1126M free, 486M swap in use, 2652M swap free

During run:

load averages:  0.68,  0.39,  0.68                                                      13:26:19
78 processes:  75 sleeping, 1 running, 1 zombie, 1 on cpu
CPU states: 20.5% idle, 18.3% user, 61.2% kernel,  0.0% iowait,  0.0% swap
Memory: 2048M real, 1123M free, 491M swap in use, 2647M swap free

here's the script in ps:

# /usr/ucb/ps -aux |more     
USER       PID %CPU %MEM   SZ  RSS TT       S    START  TIME COMMAND
ops      15436  4.7  0.1 1524 1028 ?        S 13:25:51  0:06 /usr/bin/ksh /sscp

Any ideas what I can look at to see what's chewing all the kernel?

jlliagre · March 3, 2010, 4:09pm

That's a good job for dtrace.

Any difference in configuration between that machine and the others ? Like say ZFS compression enabled ?

methyl · March 4, 2010, 6:36pm

Does the kernel percentage go down when you run the script a second time?
(The question is about caching).

Does your test system have a lesser number of CPUs than your production system? Does your test system have more than one CPU?

You could always post the script, making sure that we know how big these files are.

Hmm. With only 69 processes running this computer is tiny. Less processes than a Windows PC when it is doing nothing.

In general 100% utilisation of a CPU is only a problem if other processes are waiting for CPU.

What is your output from:

sar -u