Problem with nmon, actual CPU usage per process

zaxxon · March 8, 2013, 11:43am

Hi all,

I am currently having trouble to get nmon to print me the actual CPU usage for an interval for a process.
According to the manual, something like

# time nmon -t -C cron -s 5 -c 2 -F outfile

real    0m0.98s
user    0m0.03s
sys     0m0.04s

should print out at least the process information about cron for an interval of 2 x 5 seconds.
I tried it without specifying which process (without -C) and other parameters, but no chance. I get the very general information about everyhting else in the output, but nothing about any processes.
Also what I do not understand is, why it always runs through in much less time than I specified with -s and -c.

I am currently on AIX 6100-06-05-1115, and I am not root. Though when I call nmon to be in it's online mode and press "t", I get the top view as non-root user.

Any help is welcome. Alternatives to get the current CPU usage for a process over a specified interval is welcome.
I also tried to get the information with pprof but it seems it's showing like ps some values (ACCT_TIME) which are not working for me at all, as this seems to be the usage over time since the process was started, which is not what I am looking for. I also checked tprof , but as it looks it only works for processes that are started with it, not for processes which are already running.

In the IBM DeveloperWorks Wiki I found Nigel Griffiths' entry for a C-program to get the process information (IBM Developer)
He states that you have to take at least 2 measures and calculate the difference (I guess you have to bring this into relation with other processes etc. too, since the values I got did not tell me much).
I am looking for an easier way if any.

Thanks in forward!

DGPickett · March 8, 2013, 3:12pm

AIX nmon is not open source nmon -- where did you get examples?
See bottom of page first url: nmon for Linux | Main / HomePage

developerWorks: Wikis - Systems - nmon

developerWorks: Wikis - Systems - nmon Manual

zaxxon · March 9, 2013, 3:40am

I was only looking for the IBM pages. There are several examples that use switches I used.

DGPickett · March 11, 2013, 10:23am

Well, for starters, 'time' is just a distraction, timing nmon and producing that printout.

After that:

the -t is for top processes,
-C is CPU utilization for many CPUs where graph does not fit,
cron seems like an orphan value.
-s is not a listed option,
-c does not take a value and
-F is not a listed option.

_XrAy · March 11, 2013, 11:10am

Hi zaxxon,

NMON does not seem to work properly with the process option "-C" and recording mode "-f". It only shows the TOP processes.

If you specify a recording option "-f", the nmon process goes to background (init) and your command "time nmon -t -C cron -s 5 -c 2 -F outfile" returns immediately

The tprof command should work:

tprof -x sleep 15
grep "cron" sleep.prof

/usr/sbin/cron                            3   0.29   0.29   0.00   0.00   0.00
/usr/sbin/cron        7209028 12386355   0.12   0.12   0.00   0.00   0.00
/usr/sbin/cron        5111970 12845245   0.12   0.12   0.00   0.00   0.00
/usr/sbin/cron       12714162 17825889   0.06   0.06   0.00   0.00   0.00

Configuration information
=========================
System: AIX 7.1 Node: sradvu002 Machine: 00F6C66C4C00
Tprof command was:
    tprof -x sleep 15
Trace command was:
    /usr/bin/trace -ad -M -L 1073741312 -T 500000 -j 00A,001,002,003,38F,005,006,134,210,139,5A2,5A5,465,234,5D8, -o -
Total Samples = 1704
Traced Time = 15.01s (out of a total execution time of 15.01s)
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Process                                Freq  Total Kernel   User Shared  Other
=======                                ====  ===== ======   ==== ======  =====
wait                                     16  98.00  98.00   0.00   0.00   0.00
/usr/bin/perl                             3   0.76   0.23   0.00   0.53   0.00
/usr/sbin/cron                            3   0.29   0.29   0.00   0.00   0.00
/usr/sbin/syncd                           1   0.18   0.18   0.00   0.00   0.00
/usr/bin/ksh                              1   0.12   0.12   0.00   0.00   0.00
/usr/bin/sh                               2   0.12   0.06   0.06   0.00   0.00
/usr/bin/grep                             2   0.12   0.12   0.00   0.00   0.00
/usr/bin/sleep                            1   0.06   0.06   0.00   0.00   0.00
/usr/bin/tprof                            1   0.06   0.06   0.00   0.00   0.00
/usr/bin/hostname                         1   0.06   0.06   0.00   0.00   0.00
/usr/bin/trcstop                          1   0.06   0.00   0.00   0.06   0.00
/usr/bin/basename                         1   0.06   0.06   0.00   0.00   0.00
/usr/bin/printf                           1   0.06   0.06   0.00   0.00   0.00
/usr/bin/cut                              1   0.06   0.06   0.00   0.00   0.00
=======                                ====  ===== ======   ==== ======  =====
Total                                    35 100.00  99.35   0.06   0.59   0.00

.....SNIP....

zaxxon · March 12, 2013, 6:19am

@DGPickett

Thanks, but the options I used are valid switches I get displayed when I issue an nmon -h or if I check the man page, as well in the documentation in the IBM Wiki.

@-=Xray=-
Thanks, but tprof is for monitoring a program that you start, like sleep . I have processes that are running already and I have to check them for let's say an interval of 10 seconds and have to get their actual CPU usage for exactly this interval.
I don't want it since the process is alive (that would be regular ps ) nor do I have the option to start a program.
For the explanation of -f together with -c and -s for nmon it seems to be a pity that this option is valid as it looks to me, that the output file is being written and does not get any addition 5 or 10 seconds later.
But I got your point and I think I have tried that already. I will have another look into it, but iirc, the values I got back were not that what I was looking for.

The background of all this is that I am about to write a plugin for Nagios which shall just capture the the actual CPU usage of the specified interval for a process.
Those plugins I have found do usually issue a ps and work with values that are the average since the process came to life.
When this is several days and you currently have a workload peak, you get for example just 12% CPU usage back while it is actually 77%.
So this approach is in my eyes "useless".

Maybe I explained it better now. If you have any other ideas, let me know. Thanks a lot so far for your efforts both.

_XrAy · March 12, 2013, 6:36am

hmm,

I run the "tprof" with my personal userid and saw in the output all processes currently running.

zaxxon · March 12, 2013, 10:55am

Yes, that's correct.

Where I currently have a doubt is, that I have no process, that is running a long time yet, but having a peak at the moment so that I can see that the values tprof shows are relevant for the interval and not again just some kind of average etc.

When I have a infinite loop running, it will be at a certain level for all of it's life span. So I can't tell if the output is because of the actual CPU usage in the interval or just the average, since it was always at this level.
I checked boxes in my environment, that have some amount of traffic, but sadly the involved processes have a very jumpy behaviour about CPU usage and are so with not suitable for my test to proof the output of tprof .

If you have such processes, that are running for some hours or days etc. already, you could check that for me, if you like

DGPickett · March 12, 2013, 12:00pm

Can you give us the breakout of what the options do in your version of nmon?

zaxxon · March 12, 2013, 12:31pm

Sure, there you go (nmon comes these days as part of AIX - same as topaz or replaced it):

notroot@somehost:/home/notroot# nmon -v
nmon version TOPAS-NMON build AIX
notroot@somehost:/home/notroot# oslevel -s
6100-06-05-1115
notroot@somehost:/home/notroot# which nmon
/usr/bin/nmon
notroot@somehost:/home/notroot# lslpp -w /usr/bin/nmon
  File                                        Fileset               Type
  ----------------------------------------------------------------------------
  /usr/bin/nmon                               bos.perf.tools        File
notroot@somehost:/home/notroot# nmon -h

Hint: topas_nmon [-h] [-s <seconds>] [-c <count>] [-f -d -t -r <name>] [-x]
 Command: TOPAS_NMON
        -h            FULL help information - much more than here
        Interactive-Mode:
        read startup banner and type: "h" once it is running
        For Data-Collect-Mode (-f)
        -f            spreadsheet output format [note: default -s300 -c288]
        optional
        -s <seconds>  between refreshing the screen [default 2]
        -c <number>   of refreshes [default millions]
        -t            spreadsheet includes top processes
        -x            capacity planning (15 min for 1 day = -fdt -s 900 -c 96)

For Interactive-Mode
        -s <seconds>  between refreshing the screen [default 2]
        -c <number>   of refreshes [default millions]
        -g <filename> User decided Disk Groups
                      - file = on each line: group_name <hdisk_list> space separated
                      - like: rootvg hdisk0 hdisk1 hdisk2
                      - upto 32 groups hdisks can appear more than once
        -b            black and white [default is colour]
        -B            no boxes [default is show boxes]
        example: topas_nmon -s 1 -c 100

For Data-Collect-Mode = spreadsheet format (comma separated values)
        Note: use only one of f,F,z,x or X and make it the first argument
        -f            spreadsheet output format [note: default -s300 -c288]
                         output file is <hostname>_YYYYMMDD_HHMM.nmon
        -F <filename> same as -f but user supplied filename
        -r <runname>  goes into spreadsheet file [default hostname]
        -t            include top processes in the output
        -T            as -t plus saves command line arguments in UARG section
        -Y            like -t but all commands with the same name are added up and reported
                      Note: you can have only one of -t, -T or -Y (last on the cmd line wins)
        -s <seconds>  between snap shots
        -c <number>   of refreshes
        -w <number>   Timestamp size (Tnnnn), values4 to 16, for analyser use 4 or 8
        -l <dpl>      disks/line default 150 to avoid spreadsheet issues. For EMC use 64
        -g <filename> User decided Disk Groups (see above -g)
        -d            Include Disk Service time sections
        -k <disklist> Only report these disks also works online  (Example -k hdisk3,hdisk23,hdisk44)
        -G            Use UTC/GMT standard time (not local time)
        -K            Include RAW Kernel & LPAR sections (RAWLPAR & RAWCPUTOTAL)
        -D            Skip disk configuration sections
        -E            Skip ESS  configuration sections
        -J            Skip JFS sections
        -V            Include disk Volume Group section
        -P            Include Paging Space section
        -M            Include MEMPAGES section = detailed memory stats per page size
        -N            Include NFS section, use -NN for NFSv4 stats.
        -W            Include WLM sections
        -S            Include WLM sections with SubClasses
        -^            Include Fibre Channel (FC) sections
        -O            Include Shared Ethernet Adpater (SEA) VIOS only sections
        -L            Include LARGE page section
        -I <percent>  Ignore process percent threshold (default 0.1)
                      don't save TOP stats if proc using less CPU than this %
        -A            Include Async I/O Section
        -m <dir>      nmon changes to this directory before saving data to a file
        -Z <priority> set nice priority -20=important to 20=unimportant (negative only for root user)
        example: collect for 1 hour at 30 second intervals with top procs
                 topas_nmon -f -t -r Test1 -s30 -c120

        To load into a spreadsheet like Lotus 1-2-3:
        sort -A *nmon >stats.csv
        transfer the stats.csv file to your PC
        Start 1-2-3 and then Open <char-separated-value ASCII file>

Capacity planning mode - use cron to run each day
        -x            sensible spreadsheet output for CP =  one day
                      every 15 mins for 1 day ( i.e. -ft -s 900 -c 96)
        -X            sensible spreadsheet output for CP = busy hour
                      every 30 secs for 1 hour ( i.e. -ft -s 30 -c 120)

Set-up and installation
        To enable disk stats as root: chdev -l sys0 -a iostat=true
        - this adds the disk % busy numbers (otherwise they are zero)
        If you have hundreds of disk this can take 1% to 2% CPU

Interactive Mode Commands
        key --- Toggles to control what is displayed ---
        h   = Online help information
        r   = Resources pSeries type, machine name, cache details and AIX version + LPAR
        p   = Partitions stats
        c   = CPU by processor stats with bar graphs
                 #=toggle PURR based stats (POWER5/6 shared CPUs only)
        C   = CPU by processor stats for high numbers of CPU
        l   = long term CPU (over 75 snapshots) with bar graphs
        m   = Memory and Paging stats
        M   = Multiple Page Size stats in pages - 2nd time in MB's
        k   = Kernel Internal stats
        n   = Network stats
        =   = For Network & Disk Toggle KB/s to MB/s
        O   = Shared Ethernet Adapter VIOS only
        N   = NFS Network File System stats (2nd N gets you NFSv4)
        d   = Disk I/O Graphs (see -k to limit to specific disks)
        D   = Disk I/O Stats - multiple times gets you more stats
        o   = Disk I/O Map (one character per disk showing how busy it is)
        g   = Disk Group I/O Stats (have to use -g commandline option)
        a   = Adapter I/O Stats
        ^   = Fibre Channel Adapter via fcstat command
        [   = Start an On demand nmon recording
        ]   = Stop an On demand nmon recording
        e   = ESS vpath Logical Disk I/O Stats
        V   = Disk Volume Group stats
        P   = Paging Space stats
        j   = JFS Stats
        t   = Top Process Stats  1=Basic-Details 2=Accumulated-CPU
               Performance sorted by 3=CPU 4=Size 5=I/O
        u   = Top but with command arguments shown (used with 3,4 & 5)
               to refresh arguments (for new processes) hit u twice
        U   =  as u plus Workload Classes/WPAR
        W   =  Workload Management (WLM) Stats
        S  =  WLM with SubClasses
        w   = use with top to show AIX wait processes (good for SMP)
        A   = Summarise Async I/O (aioserver) processes
        v   = Verbose this highlights problems on the machine and
              categorises them as either danger, warnings or OK
        b   = black and white mode (or use -b option)
        .   = minimum mode i.e. only busy disks and processes
        ~   = switch to topas screen

        key --- Other Controls ---
        +   = double the screen refresh time
        -   = halves the screen refresh time
        q   = quit (also x)
        0   = reset peak counts to zero (peak = ">")
        space = refresh screen now

Startup Control
        If you find you always type the same toggles every time you start
        then place them in the NMON shell variable. For example:
         export NMON=cmdrvtan

Others:
        a)   Do you want to stop nmon - kill -USR2 <nmon-pid>
        b) Use -p and nmon outputs the background process pid
        c) To limit the processes nmon lists (online and to a file)
           Either set NMONCMD0 to NMONCMD63 to the program names
           or use -C cmd:cmd:cmd etc. example: -C ksh:vi:syncd
        d) To limit the disks nmon lists up to 64 (online only)
           Use -k diskname,diskname,diskname (Example -k hdisk2,hdisk0,hdisk3)

As said, even in other combinations of switches etc., like without -C etc. there was not the output I expected.

DGPickett · March 12, 2013, 1:00pm

That's one funky help, with -C appname mentioned only parenthetically! A 'B' effort. I guess ignorant riffraff need not apply.

Well, now you are awash in tools. Do you have one that fits your needs?

zaxxon · March 13, 2013, 8:24am

I will try the tprof way again like -=xray=- suggested. Thanks all^^

zaxxon · March 14, 2013, 9:04am

If anybody is interested in the function/code I ended up with, here it is:

...
get_proccpu()
{
        PATTERN=$1
        INTERVAL=$2
        INTERVAL=${INTERVAL:=10}

        T_FILE=sleep.prof

        tprof -x sleep $INTERVAL > /dev/null 2>&1
        RESULT=$( awk -v p="$PATTERN" '/^Process/ {c++; next} c == 1 && $1 == p {print int($3+0.5)}' $T_FILE )

        # Delete tprof-File
        rm -f ./${T_FILE} 2> /dev/null
}
...

$PATTERN is just a string that matches the process' name in tprof 's output file, here "sleep.prof".
$RESULT is later compared against thresholds.

DGPickett · March 14, 2013, 2:07pm

Does using a variable T_FILE add any value?

Can you create a permanent named pipe sleep.prof (sbin/mknod NAME p) and read it real time? No need to remove.

zaxxon · March 15, 2013, 10:11am

For Nagios the guideline says, that you should, if possible not use temporary files. So I wanted to keep it as tight as possible. Sure, there is not that much of a value using a variable to substitute a filename at 2 positions. I remove it as I don't want to leave any "rubbish" there.
On the other hand, the plugin-handling of GroundWork (that's on-top of Nagios) is just a download from a httpd and doesn't cleanup anything anyway on the clients where the plugins run.
I read about the named pipes on the IBM help page, but I think this way it is ok, thanks.

DGPickett · March 15, 2013, 10:38am

Writing a pipe means never having name, space, security or permission problems. A named pipe can have name and permission problems, I guess! Is there a way to direct the logging to an arbitrary file name? Does AIX have >(...) in ksh? The bash always has it, but if there is no /proc/##/dev/fd/# or the like in the O/S, bash uses a mknod named pipe in /var/tmp/ (and they pile up - no purge). The */fd/# pipes are the best, named but private, local and ephemeral.