This is odd, however here goes. There are several shell scripts that run in our production environment AIX 595 LPAR m/c, which has sufficient memory 14GB (physical memory) and horsepower 5CPUs. However from time to time we get the following errors in these shell scripts. The time when these happen there is not much activity going on and the errors seem to eventually go away since the scripts which are getting these types of errors are scheduled to run from TWS (maestro) and run all day long. They run just fine during the peak hours of acitivity on the box. Can someone tell me what to check or do for these types of errors.
vmstat
System Configuration: lcpu=5 mem=14336MB
oslevel
5.2.0.0
AAA.sh_20090203.log:/entH/bin/AAA.sh[277]: /usr/bin/rm: 0403-013 There is not enough memory available to run the command.
AAA.sh_20090203.log:/entH/bin/AAA.sh[278]: /usr/bin/rm: 0403-013 There is not enough memory available to run the command.
AAA.sh_20090203.log:/entH/bin/AAA.sh[279]: /usr/bin/rm: 0403-013 There is not enough memory available to run the command.
AAA.sh_20090203.log:/entH/bin/AAA.sh[287]: /usr/bin/rm: 0403-013 There is not enough memory available to run the command.
XXXXX.sh_20090203.log:/entH/bin/XXXXX.sh[106]: /usr/bin/grep: too big
XXXXX.sh_20090203.log:/entH/bin/XXXXX.sh[124]: /usr/bin/grep: too big
XXXXX.sh_20090203.log:/entH/bin/XXXXX.sh[124]: /usr/bin/wc: too big
XXXXX.sh_20090203.log:/entH/bin/XXXXX.sh[129]: /usr/bin/ls: too big
XXXXX.sh_20090203.log:/entH/bin/XXXXX.sh[130]: /usr/bin/wc: too big
YYYYY.sh_20090203.log:/entH/bin/YYYYY.sh[107]: /usr/bin/grep: too big
YYYYY.sh_20090203.log:/entH/bin/YYYYY.sh[125]: /usr/bin/wc: too big
YYYYY.sh_20090203.log:/entH/bin/YYYYY.sh[125]: /usr/bin/grep: too big
YYYYY.sh_20090203.log:/entH/bin/YYYYY.sh[130]: /usr/bin/ls: too big
YYYYY.sh_20090203.log:/entH/bin/YYYYY.sh[131]: /usr/bin/cat: too big
ZZZZZ.sh_20090203.log:/entH/bin/ZZZZZ.sh[107]: /usr/bin/grep: too big
ZZZZZ.sh_20090203.log:/entH/bin/ZZZZZ.sh[125]: /usr/bin/wc: too big
ZZZZZ.sh_20090203.log:/entH/bin/ZZZZZ.sh[125]: /usr/bin/grep: too big
ZZZZZ.sh_20090203.log:/entH/bin/ZZZZZ.sh[130]: /usr/bin/ls: too big
The problem is not the system memory. The limitation is the memory available to read the large files by the commands you are running (like grep, for example).
It is very strange. The frequency of the memory errors are more when compared to the "too big" error. One of the items that IBM support asked us to increase for the "too big" error was the value of ncargs from 512 to 1024. (Initially this value was 6). However this did not have a positive impact.
Here is the individual limits for the userid under which the scripts execute.
Hmm, when a command as small as "/usr/bin/rm" does not fit into memory any more (as per your first post here) there is something seriously wrong.
If the problem goes away temporarily with a reboot chances are you have a memory leak problem. To track this sort of problems down is sometimes hard, because of the ephemeral nature of these. You could wait until the problem shows up and then use "ps -Alo vsz<,other options>" to get the "virtual memory footprint" of every process running. See the man page of "ps" for more information about possible options.
Another common source of problems are java processes, which are generally known to be memory hogs. Find out if there are java processes running ("ps -fe | grep java") and if there are some you have a likely cause for your problem.
Further lets examine the overall memory situation of your machine. Please post the output of the following commands:
"svmon -G" (only as root)
"vmstat -v"
"lsps -a"
Also examine the crontabs of all users on the machine. Maybe some memory hog is started regularly and this is whats causing your problems.
I concur too. I am not sure what is causing these errors and they seem to last for a couple of minutes and then they would not appear for the rest of the day. Then they would not occur for a couple of days and then they would prop up. Seems very strange. We do have java processes that are event driven by MQ. In fact our MQ itself has a java wrapper around it. Since I am not a root user I will post everything except svmon -G output.
Here are the rest of the outputs that you
vmstat -v
3670016 memory pages
3570524 lruable pages
252061 free pages
1 memory pools
222137 pinned pages
80.1 maxpin percentage
3.0 minperm percentage
80.0 maxperm percentage
79.9 numperm percentage
2856298 file pages
0.0 compressed percentage
0 compressed pages
79.9 numclient percentage
80.0 maxclient percentage
2856298 client pages
0 remote pageouts scheduled
10520 pending disk I/Os blocked with no pbuf
0 paging space I/Os blocked with no psbuf
3026 filesystem I/Os blocked with no fsbuf
0 client filesystem I/Os blocked with no fsbuf
39249 external pager filesystem I/Os blocked with no fsbuf
lsps -a
Page Space Physical Volume Volume Group Size %Used Active Auto Type
hd6 hdisk45 rootvg 6144MB 1 yes yes lv
As soon as I can get one of the UNIX admins to work with me on the svmon -G command I will post the results of that command as well. Thank you for your comments.
Lets see: The output of "svmon" is in memory pages, which are 4k in AIX. The "size" and "inuse" values tell the physical memory and how of that is used. The machine has ~14GB memory installed (3.5 mio of 4k pages) and uses nearly all of it constantly. That the machine uses all of the physically installed memory is OK and to be expected.
The "virtual" column is the overall memory used by applications. The number is small compared to the number of installed memory and this means that the machine has enough memory for its day-to-day-operation. These figures are statistical in nature and this shows that your memory problems are short peaks of dramatically increased memory demand in a otherwise relatively idle machine.
The one java process you found is IMHO not the problem. If i interpret it correctly it is configured to use 256MB and this should be no big problem.
The output of "vmstat" shows nothing exceptional and the "lpstat" shows you have only 6GB of swap configured. This is a bit on the light side for 14GB of real memory, but otherwise only 1% of the swap is in use - it doesn't seem that you need more right now.
This leaves the question what goes wrong on your machine. You said you experience the problems only in very short timeframes. Start with searching the crontabs of all users you might find one (or several) troublemaker(s) which is (are) called only rarely. (I had such a situation once when a machine was experiencing a severe memory shortage with heavy paging activity every three days. We analyzed the situation and found out that a "mksysb" was responsible for the problem. We moved this mksysb-run to another time with less activity and the problem never happened again.)
Another idea that is being thrown around is the possible overrun of the heap memory. Since we have 32 OS and applications on this system, are there any known limits for this type of memory. Please advise.
I have seen the "too big" error come out when accessing a large number of files, say over 2000. how many files are you trying to process? Also do you have a sample of the scripts that you are trying to run? What else is going on on the server at the time of errors?