Shortlived Process Don't Appear in 'top' or 'ps'

We are running a field specific middle tier application server on HP-UX. We've recently been experiencing performance problems with it and the database back end (Oracle on a separate HP-UX box). We resolved a few issues on the DB server (some kernel parameters to free up RAM that was extremely overutilized for the vxfs buffer cache) and it seems to be able to handle the load again. But as soon as that was resolved the problems that we saw on the middle tier came back.

Currently we're involved in a finger pointing battle with the company that makes the application server, HP and Oracle. Personally I believe the fault lies with the middle tier. We had someone from HP come in on a time and materials basis to analyze our DB and middle tier system and he said things look good in terms of the OS. Further investigation of performance data indicated that the third heaviest CPU and RAM eating process was a short script that the application server launches hundreds to thousands of times per minute. It seems like that process is intended to set some environment variables for it's child processes and nothing more. This seems like gross inefficiency to us. But we need to be able to figure out what process(es) spawn this script's process.

I found: 'UNIX95=1 ps -Hef' in order to see a rough process tree. (There isn't a port of 'pstree' from Linux is there?) But, we've discovered that the script processes never show up in our 'ps' or 'top' commands. However, the performance data gathered by HP's scripts (and Glance I think) seemed to keep track of those processes. My supervisor believes that the problem is that 'ps' and 'top' only get a snapshot of current activity and the script process is too quick to be captured. I'm not sure if that's true or not, but it seems unlikely.

So my questions are:

  1. Is there a way to control how short of a period of time that 'ps' can see?
  2. Is it possible that 'ps' and 'top' can't display processes that are "too short"?

Is it possible to modify the shell script instead? add some small instrumentation feature to it like $PPID? So you can track it in a log file?

Look into doing something with pstat if you need a "fast" ps command. You'll have to run it with slightly elevated priority..

If the shell script is editable, isn't that possible to log the pid, ppid and if needed the time for which it lives.

Upon further inspection it appears that the we're dealing with a PA-RISC executable and not a script. There is a wrapper script that calls the executable. And it looks like the executable is what we want to find in the 'ps' output. They both have names that start with 'get'. But 'ps -ef | grep get' never finds any processes that match.

Is 'pstat' a command on it's own? All I found were the HP-UX function libraries in the man pages. Or are you saying we might need to build our own 'ps' command?

I have to leave now (kids waiting...)
So I drop note I made in 2003 for you to read and see if it help understanding:
Bill Hassell:
Unix memory usage is a very complex process. As you mentioned, shared memory is difficult to
assign to a given process, and considering the number of different ways a process may be st
arted (cron, rpc, network client/server tasks, threads), accurately assigning memory to a si
ngle user is virtually impossible. For given user ID, you can get a rough idea (which is lik
ely all that you need) by using ps:

UNIX95= ps -u joan -o vsz,ruser,pid,args |sort -rn

This shows all processes owned by the real user joan, showing the virtual size in Kbytes in
descending order.

Mike Stroyan:
The pstat_procvm function can give you all the information you need to do that. The attached program uses a reference count of the number of processes that map each region. It recognizes unique regions by a combination of their vm.pst_space and vm.pst_vaddr.

It divides the credited size of a region by the reference count. If three processes share a memory segment then they each get billed for one third of its size. You can run the program
with either user ids or user names to look for.

The ps command is really naive about process size. It reports only the total size of text, data, and stack. It completely misses mmap, shared memory and shared libraries.

His program: (pstat_64.c)

#define _PSTAT64
#include <sys/param.h>
#include <sys/pstat.h>
#include <sys/unistd.h>
#include <malloc.h>
#include <stdio.h>
#include <stdlib.h>
#include <pwd.h>

typedef struct shared_segment_struct {
    long long pst_space;
    long long pst_vaddr;
    int refs;
    struct shared_segment_struct *next;
} segment;

static segment *shared_segs = NULL;

void pstatvm(uid_t uid, int pid)
{
    struct pst_vm_status pst;
    int idx, count;
    long long shared_vm = 0;
    long long shared_ram = 0;
    long long private_vm = 0;
    long long private_ram = 0;

    idx=0;
    count = pstat_getprocvm(&pst, sizeof(pst), (size_t)pid, idx);
    while (count > 0) {
        switch ((long)pst.pst_type) {
            case PS_IO: break;
                /* Don't count IO space.  It really is not RAM or swap. */
            default: 
                if (pst.pst_flags & PS_SHARED) {
                    segment *s;
                    int refs = 1;
                    for (s=shared_segs; s; s=s->next) {
                        if (s->pst_space == pst.pst_space
                        && s->pst_vaddr == pst.pst_vaddr) {
                            refs = s->refs;
                            break;
                        }
                    }
                    shared_vm += (long long) pst.pst_length;
                    shared_ram += (long long)pst.pst_phys_pages/refs;
                } else {
                    private_vm += (long long) pst.pst_length;
                    private_ram += (long long)pst.pst_phys_pages;
                }
                break;
        }

        idx++;
        count = pstat_getprocvm(&pst, sizeof(pst), (size_t)pid, idx);
    }
    printf("%6d ", uid);
    printf("%6d ", pid);
    printf("%11lldK ", shared_vm*4);
    printf("%11lldK ", shared_ram*4);
    printf("%11lldK ", private_vm*4);
    printf("%11lldK ", private_ram*4);
    printf("%11.1fM\n", (shared_ram+private_ram)/256.0);
}

void pstatvm_uid(uid_t uid)
{
#define BURST ((size_t)10)

    struct pst_status pst[BURST];
    int i, count;
    int idx = 0; /* index within the context */

    /* loop until count == 0, will occur when all have been returned */
    while ((count = pstat_getproc(pst, sizeof(pst[0]), BURST, idx)) > 0)
    {
        /* got count (max of BURST) this time.  process them */
        for (i = 0; i < count; i++) {
            if (pst.pst_pid==0) continue; /* Can't getprocvm real pid 0 */
            if (pst.pst_uid==uid) {
                pstatvm(uid, pst.pst_pid);
            }
        }

        /*
         * now go back and do it again, using the next index after
         * the current 'burst'
         */
        idx = pst[count-1].pst_idx + 1;
    }

    if (count == -1)
        perror("pstat_getproc()");

#undef BURST
}

void pstat_refcount_all(void)
{
#define BURST ((size_t)10)

    struct pst_status pst[BURST];
    int i, count;
    int idx = 0; /* index within the context */

    /* loop until count == 0, will occur when all have been returned */
    while ((count = pstat_getproc(pst, sizeof(pst[0]), BURST, idx)) > 0)
    {
        /* got count (max of BURST) this time.  process them */
        for (i = 0; i < count; i++) {
            struct pst_vm_status vm;
            int reg, count;
            if (pst.pst_pid==0) continue; /* Can't getprocvm real pid 0 */
            reg=0;
            while (pstat_getprocvm(&vm, sizeof(vm), pst.pst_pid, reg++)) {
                segment *s;
                for (s=shared_segs; s; s=s->next) {
                    if (s->pst_space == vm.pst_space
                    && s->pst_vaddr == vm.pst_vaddr) {
                        s->refs += 1;
                        break;
                    }
                }
                if (!s) {
                    s = (segment *) malloc(sizeof(segment));
                    s->pst_space = vm.pst_space;
                    s->pst_vaddr = vm.pst_vaddr;
                    s->refs = 1;
                    s->next = shared_segs;
                    shared_segs = s;
                }
                reg++;
            }
        }
        idx = pst[count-1].pst_idx + 1;
    }

    if (count == -1)
        perror("pstat_getproc()");

#undef BURST
}

int main(int argc, char *argv[])
{
    int i;
    pstat_refcount_all();
    printf("   uid    pid    shared_vm   shared_ram   private_vm  private_ram      Res_Mem\n
");
    if (argc > 1) {
        for (i=1; i<argc; i++) {
            struct passwd *p = getpwnam(argv);
            if (p) {
                pstatvm_uid(p->pw_uid);
            } else {
                pstatvm_uid(atoi(argv));
            }
        }
    } else {
        pstatvm_uid(getuid());
    }
    return 0;
}
 

To answer your questions, yes it is possible for short-lived processes to never show up. In fact it is very unlikely that a short lived process will show up in ps or top. Those programs read the process table with techniques that are almost as fast as a memory to memory data move. Once they have this snapshot, they prepare a report. top repeats this every n seconds, but top presents you with a program...it does not want to show all processes, just the "top" ones. You have a better shot with ps. A short-lived process can be gone in well under a tenth of a second.... let's say that yours is lasting exactly one tenth of a second... That means that ps must capture a process table snapshot sometime during that tenth of a second. This is very hard to arrange in a reliable fashion. A very clever wrapper program that runs both ps and your program nearly simultaneously might be able to do it.

But if the program shows up in glance but not ps/top, I tend to suspect something else. It could be that the program is destroying it's command line. You might try: ps -el and see if it shows up.