"too big" and "not enough memory" errors in shell script

jerardfjay · February 5, 2009, 12:05pm

Hi,

This is odd, however here goes. There are several shell scripts that run in our production environment AIX 595 LPAR m/c, which has sufficient memory 14GB (physical memory) and horsepower 5CPUs. However from time to time we get the following errors in these shell scripts. The time when these happen there is not much activity going on and the errors seem to eventually go away since the scripts which are getting these types of errors are scheduled to run from TWS (maestro) and run all day long. They run just fine during the peak hours of acitivity on the box. Can someone tell me what to check or do for these types of errors.

 vmstat
System Configuration: lcpu=5 mem=14336MB

 oslevel
5.2.0.0

AAA.sh_20090203.log:/entH/bin/AAA.sh[277]: /usr/bin/rm: 0403-013 There is not enough memory available to run the command.
AAA.sh_20090203.log:/entH/bin/AAA.sh[278]: /usr/bin/rm: 0403-013 There is not enough memory available to run the command.
AAA.sh_20090203.log:/entH/bin/AAA.sh[279]: /usr/bin/rm: 0403-013 There is not enough memory available to run the command.
AAA.sh_20090203.log:/entH/bin/AAA.sh[287]: /usr/bin/rm: 0403-013 There is not enough memory available to run the command.

XXXXX.sh_20090203.log:/entH/bin/XXXXX.sh[106]: /usr/bin/grep: too big
XXXXX.sh_20090203.log:/entH/bin/XXXXX.sh[124]: /usr/bin/grep: too big
XXXXX.sh_20090203.log:/entH/bin/XXXXX.sh[124]: /usr/bin/wc: too big
XXXXX.sh_20090203.log:/entH/bin/XXXXX.sh[129]: /usr/bin/ls: too big
XXXXX.sh_20090203.log:/entH/bin/XXXXX.sh[130]: /usr/bin/wc: too big

YYYYY.sh_20090203.log:/entH/bin/YYYYY.sh[107]: /usr/bin/grep: too big
YYYYY.sh_20090203.log:/entH/bin/YYYYY.sh[125]: /usr/bin/wc: too big
YYYYY.sh_20090203.log:/entH/bin/YYYYY.sh[125]: /usr/bin/grep: too big
YYYYY.sh_20090203.log:/entH/bin/YYYYY.sh[130]: /usr/bin/ls: too big
YYYYY.sh_20090203.log:/entH/bin/YYYYY.sh[131]: /usr/bin/cat: too big

ZZZZZ.sh_20090203.log:/entH/bin/ZZZZZ.sh[107]: /usr/bin/grep: too big
ZZZZZ.sh_20090203.log:/entH/bin/ZZZZZ.sh[125]: /usr/bin/wc: too big
ZZZZZ.sh_20090203.log:/entH/bin/ZZZZZ.sh[125]: /usr/bin/grep: too big
ZZZZZ.sh_20090203.log:/entH/bin/ZZZZZ.sh[130]: /usr/bin/ls: too big

Please advise.
Thanks
Jerardfjay

Neo · February 5, 2009, 12:12pm

The problem is not the system memory. The limitation is the memory available to read the large files by the commands you are running (like grep, for example).

These must be very large files.

Can you post the size of the files?

jerardfjay · February 5, 2009, 12:23pm

For instance I am posting the typical size of files that got the error on the "rm" command. These are not big at by any means.

-rw-rw-rw-   1 userid   groupid          266 Feb 05 12:12 AAA.sh_158098.tmp
-rw-rw-rw-   1 userid   groupid          327 Feb 05 12:12 AAA.sh_158098.sql.0
-rw-rw-rw-   1 userid   groupid            6 Feb 05 12:12 AAA.sh_158098.0.AK5File
-rw-rw-rw-   1 userid   groupid           18 Feb 05 12:12 AAA.sh_158098.0.AK2File
-rw-rw-rw-   1 userid   groupid           78 Feb 05 12:12 AAA.sh_158098.0

Neo · February 5, 2009, 12:26pm

You are right. That's odd. I'll think about it and post back.

In the meantime, if anyone has a suggestion, please post.

jerardfjay · February 5, 2009, 1:46pm

Neo,

It is very strange. The frequency of the memory errors are more when compared to the "too big" error. One of the items that IBM support asked us to increase for the "too big" error was the value of ncargs from 512 to 1024. (Initially this value was 6). However this did not have a positive impact.

Here is the individual limits for the userid under which the scripts execute.

 ulimit -a
time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         unlimited
stack(kbytes)        4194304
memory(kbytes)       unlimited
coredump(blocks)     unlimited
nofiles(descriptors) 8000

If anyone has any ideas we are willing to give it a shot. Thx.

bakunin · February 6, 2009, 1:15pm

Hmm, when a command as small as "/usr/bin/rm" does not fit into memory any more (as per your first post here) there is something seriously wrong.

If the problem goes away temporarily with a reboot chances are you have a memory leak problem. To track this sort of problems down is sometimes hard, because of the ephemeral nature of these. You could wait until the problem shows up and then use "ps -Alo vsz<,other options>" to get the "virtual memory footprint" of every process running. See the man page of "ps" for more information about possible options.

Another common source of problems are java processes, which are generally known to be memory hogs. Find out if there are java processes running ("ps -fe | grep java") and if there are some you have a likely cause for your problem.

Further lets examine the overall memory situation of your machine. Please post the output of the following commands:

"svmon -G" (only as root)

"vmstat -v"

"lsps -a"

Also examine the crontabs of all users on the machine. Maybe some memory hog is started regularly and this is whats causing your problems.

I hope this helps.

bakunin

jerardfjay · February 9, 2009, 9:50am

Bakunin,

I concur too. I am not sure what is causing these errors and they seem to last for a couple of minutes and then they would not appear for the rest of the day. Then they would not occur for a couple of days and then they would prop up. Seems very strange. We do have java processes that are event driven by MQ. In fact our MQ itself has a java wrapper around it. Since I am not a root user I will post everything except svmon -G output.
Here are the rest of the outputs that you

ps -ef | grep java
     xxx 140596 140870   0 09:12:51      -  0:01 /usr/java131/jre/sh/java -mx256m -ms64m abcde.xx.yy.gpc.gco.cm.mqft.dispatcher.MqftDispatcherProgram /appl01/GPC/conf/mqft.disp.cfg
 userid 156590 135086   0 09:13:01  pts/5  0:00 grep java

 vmstat -v
              3670016 memory pages
              3570524 lruable pages
               252061 free pages
                    1 memory pools
               222137 pinned pages
                 80.1 maxpin percentage
                  3.0 minperm percentage
                 80.0 maxperm percentage
                 79.9 numperm percentage
              2856298 file pages
                  0.0 compressed percentage
                    0 compressed pages
                 79.9 numclient percentage
                 80.0 maxclient percentage
              2856298 client pages
                    0 remote pageouts scheduled
                10520 pending disk I/Os blocked with no pbuf
                    0 paging space I/Os blocked with no psbuf
                 3026 filesystem I/Os blocked with no fsbuf
                    0 client filesystem I/Os blocked with no fsbuf
                39249 external pager filesystem I/Os blocked with no fsbuf

 lsps -a
Page Space      Physical Volume   Volume Group    Size %Used Active  Auto  Type
hd6             hdisk45           rootvg        6144MB     1     yes   yes    lv

As soon as I can get one of the UNIX admins to work with me on the svmon -G command I will post the results of that command as well. Thank you for your comments.

Jerardfjay

zaxxon · February 9, 2009, 10:19am

When these errors occure, can you just test a grep or wc on one of these files to check if it happens when executing on the shell manually too?

Also maybe you should set up some monitoring of any kind; either self written with vmstat etc. for example or maybe set up nmon.

jerardfjay · February 9, 2009, 2:09pm

Here is the output from the svmon command on our box.

svmon -G
               size      inuse       free        pin    virtual
memory      3670016    3411288     258728     222409     554969
pg space    1572864       1402

               work       pers       clnt      lpage
pin          222409          0          0          0
in use       554989          0    2856299          0

Thx
Jerardfjay

bakunin · February 9, 2009, 11:20pm

Lets see: The output of "svmon" is in memory pages, which are 4k in AIX. The "size" and "inuse" values tell the physical memory and how of that is used. The machine has ~14GB memory installed (3.5 mio of 4k pages) and uses nearly all of it constantly. That the machine uses all of the physically installed memory is OK and to be expected.

The "virtual" column is the overall memory used by applications. The number is small compared to the number of installed memory and this means that the machine has enough memory for its day-to-day-operation. These figures are statistical in nature and this shows that your memory problems are short peaks of dramatically increased memory demand in a otherwise relatively idle machine.

The one java process you found is IMHO not the problem. If i interpret it correctly it is configured to use 256MB and this should be no big problem.

The output of "vmstat" shows nothing exceptional and the "lpstat" shows you have only 6GB of swap configured. This is a bit on the light side for 14GB of real memory, but otherwise only 1% of the swap is in use - it doesn't seem that you need more right now.

This leaves the question what goes wrong on your machine. You said you experience the problems only in very short timeframes. Start with searching the crontabs of all users you might find one (or several) troublemaker(s) which is (are) called only rarely. (I had such a situation once when a machine was experiencing a severe memory shortage with heavy paging activity every three days. We analyzed the situation and found out that a "mksysb" was responsible for the problem. We moved this mksysb-run to another time with less activity and the problem never happened again.)

I hope this helps.

bakunin

jerardfjay · February 11, 2009, 9:43am

Another idea that is being thrown around is the possible overrun of the heap memory. Since we have 32 OS and applications on this system, are there any known limits for this type of memory. Please advise.

Thx
Jerardfjay

funwithux · March 16, 2009, 11:09pm

I have seen the "too big" error come out when accessing a large number of files, say over 2000. how many files are you trying to process? Also do you have a sample of the scripts that you are trying to run? What else is going on on the server at the time of errors?