top and nice

CBorgia · January 22, 2008, 6:39am

Hi,

I have two identical 12 CPU HPUX machines, and I run the same processes on each that load the boxes fully.

top on one reports activity under the NICE (19%) and SYS (18%) columns, while top on the other reports 0% NICE and 16% SYS. What would cause NICE to be zero on one machine and not the other? The machine with NICE activity seems to underperform compared to the other and I'm trying to find the difference.

vbe · January 29, 2008, 9:59am

And so, what was the difference?

CBorgia · January 30, 2008, 2:06pm

Hi VBE,

The difference is simply that one machine shows a lot of activity under the NICE column of top, and the other always shows zero. The machines are supposed to be identically configured and each handles the same very heavy load.

So my question is: can HPUX be set so that NICE is not used/recorded (in which case there is some difference in the way the machines are set up), or is it impossible to influence the numbers in the NICE column (in which case there is something I don't understand about the load on the machines)?

Perderabo · January 30, 2008, 9:22pm

The only way to affect nice is to explicitly set via the nice() or setpriority() system calls which are usually invoked by the nice and renice commands. There is other stuff that could affect priority, but the other stuff can't affect the nice value. The nice value is under the explicit control of the user.

The fact that one box has niced processes while the other does not pretty much means that they are indeed difference. But it is not too surprising the the niced processes underperform the processes running at standard priority. That is what is supposed to happen.

But it sounds like the solution is simple. Since you think they are supposed to be the same, why not simply copy the good one to the bad one?

CBorgia · February 2, 2008, 10:09am

Thanks Perderabo,

I'm not explicitly using nice or renice. The bulk of the processes running on the two boxes are identical. These processes saturate the CPU, its a batch machine, if the batch job is not running the machine is idle. The set of processes on a machine make up a single job, if one of the processes is niced out to let another run then the overall job should run in the same time (minus some time for swapping tasks around). However, with the same processes and the same load, one machine displays a lot of nice activity in top, the other doesn't ever show nice above 0%, and these runs last many hours or days at 100% CPU. The machine without nice activity finishes its 50% share of the load well before the other does. I'm not sure what is meant by "copy the good one to the bad one".

vbe · February 4, 2008, 1:16pm

I agree with Perderabo, if both box have 50% of the tasks to share I would do a permutation and see if the result is the same which would mean the box is to blame otherwise the code is to blame...
On top of nice, you could have some code that explicitaly force all subsequent processes to be executed by a given processor (I do this to mad developpers: They share in common the last proc while all other users work confortably with all the rest...)