Maxuproc parameter and number of processes

Hi there,

I am having a problem on an AIX server running a WebSphere MQ instance. The problem is that sometimes it seems to reach process limit, but I do not find the processes themselves.

What I see: succeed to log in (as root from console os as nonpriviliged user via ssh). Trying to run almost any command results in a message "Killed.", even a simple "ls" command. However "ps -ef" command is able to run. MQ monitoring scripts gets killed. vmstat cannot run, but lsps shows no paging activity. thus there should be enough memory. Also topas is able to run, showing very little CPU activity.
This "Killed." thing - by my experience - used to be the result of reaching maxuproc limit, but the maxuproc is set to 4096 and the ps -ef shows only ~90 processes. However. when I raise the maxuproc parameter, everything works fine again.

Well, my question is: how to monitor if I am reaching maxuproc limit? Or: where are the processes which are not listed by ps command.

Best regards,
--Trifo

If you are hitting a limit on the number of processes you're running, ps may be exempt from the limit because it usually runs set-UID root.

If you're running ps -ef | grep "<username>" or some other pipeline, even though ps might be exempt from the limit, the pipeline is not exempt and the output you're seeing could be truncated if the grep is killed due to the process limit.

Are you seeing this problem consistently? Or does it vary with time of day, or at times when cron or at jobs might be expected to be running? You say you're seeing about 90 processes running. Are they all things that you expect to be running? Are any of them things that hang around running for a while and then kick off a bunch of other processes to perform certain tasks when certain conditions arise?

Could network traffic be kicking off jobs that are being run by processes running under your account?

Do you have a bunch of MQ monitoring scripts running in the background? What are they doing? How many of them are there?

Obviously, with no access to your system, we can only make wild guesses. I agree that it sounds like you're running enough processes that AIX isn't letting you start any more until one or more of the jobs that are running terminate, but that doesn't help much if we don't know what is running and why it is running.

Is process accounting enabled on your system? Can you sysadmin help you track down what jobs you're running during times when your processes are being killed?

I can't tell you where your processes are but i can tell you how to find out all user properties (including, but not limited, to maxuproc ):

root@system # lsuser <username>

The output is in "attribute=value" format, separated by blanks. You can also use the -f switch to get stanza format or -c to get colon-separated format. You need to do it as root to get all attributes, if you do it as user you only get a small subset.

I hope this helps.

bakunin

Thanks for the replies. Well, trying to be more specific.

There is an MQ server running on the host, running ps -ef at any time shows about 90 lines of output. This is quite normal, including the processes belonging to AIX itself, the MQ server and the monitoring scripts (5 maximum at any given moment).

This morning I found that the output of ps -ef shows just the same amount of processes as it usually does. Most of them remain live for an extended period, thus every app that succeeded to connect earlyer, is able to use the service. New connections cannot be created - new connections in this configuration implies new processes to handle a client.

Also I am unable to run any command that is not setuid root.

Now, raising the maxuproc value from 4096 to 5000 seems to solve the problem. Well, there is not a single user in the system trying to run 4000 processes, as I see 90 processes altogether. Why?

Couple of hours later the problem is showing up again the same way. Raising the maxuproc again solves the problem. Well, seems solving. Something is accumulating in the background and I do not see what that might be. So, when I run into this maxuproc problem, and maxuproc is set to 4096, then I would like to see thet something is really 4096. What kind of objects are counted? entries in process table? Threads? Or what else.

Well, I know how to list user parameters .

The relevant parameters of the relevant user are:

        fsize=2097151
        cpu=-1
        data=262144
        stack=65536
        core=2097151
        rss=65536
        nofiles=2000

Well, yes, maybe I was on a wrong track and the limit was not the number of processes, but some other limit. In this case my question is, why did the raise of maxuproc suppress the problem?

--Trifo

On AIX, I would expect the count to just be the number of processes in the process table. (On a Linux system, it could easily be the number of threads.)

Note that if one of your processes forks and execs other processes and doesn't reap them when they die you could easily get a condition like this, but you should see zombie processes in the process table in this case. (Note that a zombie process is a process that was running and has died. The process table slot is still consumed by the process even though all of its other resources have been freed because the process slot can't be released until its parent reaps its exit status with a call like wait(), waitid(), or waitpid().) But, zombies should show up in ps -ef output.

I suppose it is possible that you have a process that is creating threads and not waiting for them to finish (i.e., calling thread_join() to free up the thread ID). I don't know if AIX would kill processes that can't get a new thread ID due to unreaped threads, but it seems plausible. On AIX, threads would not show up in ps -ef output.

Maybe bakunin can suggest a way to determine thread limits on AIX and a way to look for zombie threads?

Well, zombie processes - if there would be any - would show up in ps -ef output as "defunct". This time there were none.

Threads in AIX can be listed using ps -efo THREAD but counting all the threads resulted in ~500 entries, which is far less than the value of maxuproc.

Let's see the problem from another aspect:

  • monitoring shows that monitoring scripts are unable to finish
  • logging in as nonprivileged user succeed, but running most command results "Killed." message
  • the host seems to have plenty of free memory and CPU resources.
  • no messages in errpt
    Well, what would you do as problem determination?

--Trifo

Before your last post, I thought you were saying that one (non-root) user was having problems. Do you mean that all non-root users are having jobs killed by the system?

Does AIX have a fixed process table size? If so, what fixed process table size is currently configured and how many processes does ps -ef show running for all users?

Well, I am the one with the root privileges. However the app - WebSphere MQ - is running under its own nonpriviliged user account. When I log in remotely, I am also using my own nonprivileged user then use sudo to change privileges if needed.

The problem were visible to all nonprivileged user and even root user, running most commands,

ps -ef shows all processes for all users. Almost. ps aux is listing all processes for all users, even those not attached to a terminal.

Now I do not have the opportunity to repeat the situation, but my guess is that ps aux would be able to show all the entries I wanted for.

--Trifo

I'm no AIX expert, but i did use MQ on linux platforms.

First, when installing MQ, there are kernel parameters which should be set.
IBM Knowledge Center Error

There is also a utility (mentioned in the docs) which will check if everything is set properly system wise for MQ software on various platforms.
It is called mqconfig .

Set everything as written and you should have no issues, unless you are having extreme loads or running other software on box along with MQ.

Regards
Peasant.

Also check LWPs! Usually it is

ps -eLf

To be honest, i am out of (simple) ideas that could easily be conveyed over the internet. I am pretty sure there is something deeply amiss with your system and you should contact the systems administrator immediately. It might well be that you - as non-administrator - could not even see the problem that is causing this and he is in a much better position than i am to determine what exactly that is (in fact he can actually see the system - that helps along the determination a lot).

As Don said, the number of processes running are what they seem to be and you did follow the correct procedure to get the number. I doubt that this is your problem but i am equally at a loss when it comes to offer alternative explanations. In any case, PLEASE TELL US the solution once you got it, because this is something i dearly want to add to my knowledge. I'll be indebted to you.

Sorry for being not more of a help, but if you have questions (regarding this or anything else) i can answer you are welcome.

bakunin

2 Likes

The "maxuproc" parameter is limiting the size of process table in AIX.

--Trifo

maxuproc limits the number of processes per user. In other words, each user can run up to maxuproc processes in parallel.
And 10 users can run a total of 10*maxuproc processes.
The limit does not apply for uid=0.
See also this article.

1 Like

Without digging into the system's header files, you should be able to retrieve your current system's allowed number of processes per user with the command:

getconf CHILD_MAX

which is defined to return the system's current value for the maximum number of simultaneous processes per real user ID. Note that this says nothing about the size of the kernel's process table which must contain one slot for each process that is currently active. Note that in this case, active means has been started and its exit status has not yet been collected by its parent (or if its parent has died, collected by the system's garbage collector [a process named init on some systems]).

In the old days, the size of the process table was fixed when the kernel was built. Most of today's systems attempt to grow the pricess table as needed rather than failing fork ()s when the process table fills up. But, if the kernel runs out of memory, a normal user's fork () will fail and a super-user's fork () may kill off a normal user's running process to allow the super-user to create a new process. What actually happens in these cases varies considerably from system to system.

A Unix-compatible process table is still fixed. Only LWPs are not limited on some OS.
In AIX, according to my link, it is 262144. Not tunable.
Solaris and HP-UX have 30000, tunable.
Linux has 32768, shared with LWPs, tunable.

To monitor the maxuproc, I find the user's id with:

lsuser -a id user

, then use:

ps -fu $id | wc -l

to count their processes.

--- Post updated at 02:57 PM ---

If you're interested, here's a section of a script I use to monitor every users' current maxuproc.

numprocs=$(lsattr -El sys0|grep maxuproc|awk '{print $2}')
let threshprocs=$numprocs-50
lsuser -a id ALL > /tmp/maxuproc.tmp
while read uid unum
do
    kount=`ps -fu $uid|wc -l|awk '{print $1}'`
    if (( kount > threshprocs )) then
      echo " High Process Count for User: " $uid
      echo " Exceeds Threshold Value Set in this SCRIPT: " $threshprocs
      echo " Current Process Count: " $kount
      echo " Max Process Count: "     $numprocs
      echo "                    "
      ps -fu $uid
      errors_found="TRUE"
    fi
done  < /tmp/maxuproc.tmp

And this is a simplified version of my Nagios check script, that checks user's threads (LWPs) and processes.
Does it run on AIX?

#!/bin/sh
set -f
PATH=/bin:/usr/bin:/usr/sbin:/sbin

limit=${1:-2850}
plimit=${2:-850}

case $limit$plimit in
*[!0-9]*)
  echo "
Usage: $0 [WARN threads] [WARN processes]"
  exit 3
  ;;
esac

check=`ps -e -L -o user= -o pid= | awk '
{++n[$1]}
p[$2]++==0 {++o[$1]}
END {
 for (u in n) {
  if (n>max) max=n[uid=u]
  if (o>pmax) pmax=o[puid=u]
 }
 if (max>'$limit') {print "uid="uid,"max="max}
 else if (pmax>'$plimit') {print "uid="puid,"pmax="pmax}
}
'`
if [ -z "$check" ]
then
  echo "OK: all users below $limit threads"
  exit 0
else
  eval $check
  if [ -n "$max" ]; then
    echo "WARNING: user $uid runs $max threads"
  elif [ -n "$pmax" ]; then
    echo "WARNING: user $uid runs $pmax processes"
  fi
  exit 1
fi

It can only report one user. If two or more users are above thresholds it reports the top user.

I don't have an AIX system available to see if your code produces the desired results there, but you might want to consider the following.

Since this code doesn't allow a user to override the default 850 for number of processes unless a first operand is given to specify the number of threads, one might consider changing:

  echo "
Usage: $0 [WARN threads] [WARN processes]"

to:

  echo "
Usage: $0 [WARN_threads_count [WARN_processes_count]]" >&2

I added the redirection because this usage statement is a diagnostic that should be written to stderr instead of stdout. I added the underscores and the "count"s so a naive user running your code and seeing your Usage printout would be less likely to think your script was expecting "WARN threads" and "WARN processes" to be literal strings.

1 Like

Of course your usage text is better.
The original script is named "check_user_threads.sh" and that's what it did from the beginning. The procs measurement was added later, in a hurry, as Solaris servers with a proc limit (in /etc/system) showed up.
The usage message is sent to stdout on purpose, because stdout *must* go to the Nagios console. Not so with stderr. But maybe it works meanwhile.?

1 Like

Hi MadeInGermany,
Thanks for the information. I'm just used to writing utilities that work directly on BSD, Linux, and UNX platforms where we all know what is supposed to happen and users know how to separate diagnostics from normal output. I hate using things like Nagios that think that diagnostic messages should be hidden from users (making it hard or impossible for those users to find out what went wrong when underlying utilities report problems)

Oh, well.

Cheers,
Don