Fork: Resource temporarily unavailable

clx · May 7, 2013, 9:05am

Hi friends,

Working on a linux X86-64 bit system, I suddenly started getting this error (mentioned in subject) from various scripts.

I googled, found that there are couple of reason which causes this issue.

less memory
I am pretty sure, memory seems to be stable on my system and at the time of facing issue, I am still having good amount of free RAM.

 free -m
             total       used       free     shared    buffers     cached
Mem:         64445      25898      38546          0        192       3389
-/+ buffers/cache:      22316      42128
Swap:        66431      24824      41607

crossed max no of process limit
ps gives me only 600 processes.

ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 532480
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65535
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 532480
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

crossed max PID integer limit.

cat /proc/sys/kernel/pid_max
32768

For the last point, It has been observed that the PID has been reached till 32762 but definitely not beyond that.

I am unable to reach to conclusion.

Moreover, I know that even if PID reached at max limit, the kernel resets the counter to 300 or something !

I am clueless. Need you help.

radoulov · May 7, 2013, 9:18am

If the issue is reproducible, try to strace the process to get more details.
Some limits might be altered, check /proc/<pid>/limits for the currently used ones.
Check for additional limits in /etc/secutity/limits.conf also.

clx · May 7, 2013, 10:03am

Thanks.

While it doesn't occur always.
I have bunch of scripts scheduled in cron. (running very frequently say 1 or 2 min)
I tried to check strace with just "echo" and saw huge amount of text.

Would it be useful if I schedule this with one of the my real script ? and to find the error prone execution after that?

limits.conf contains only some oracle specific limits. If you want, I will post.
I didn't get the <pid> in "/proc/<pid>/limits". what is meant by that? the real pid?

radoulov · May 7, 2013, 10:16am

Just try running ulimit -a from inside the scripts (as opposed to the command line),
to see if you get the expected values.

If sar is scheduled, check the memory usage history, try:

sar -f /var/log/saN -r

Where N is the day, for yesterday 2013/05/06 here, GMT + 1,
the command would be:

sar -f /var/log/sa/sa06 -r

You will have /proc/<pid>/limits while the process is executing.

clx · May 7, 2013, 10:28am

Ok. I will try that.

Regarding sar, the memory usage seems to be fine.

02:00:01 PM kbmemfree kbmemused  %memused kbbuffers  kbcached kbswpfree kbswpused  %swpused  kbswpcad
.
.
.
06:20:01 PM  39475812  26516060     40.18    198364   3490292  42605940  25420420     37.37         0
06:30:01 PM  39464752  26527120     40.20    200396   3534608  42605940  25420420     37.37         0
06:40:01 PM  39290188  26701684     40.46    202580   3570116  42605940  25420420     37.37         0
06:50:01 PM  39200236  26791636     40.60    206812   3622196  42605940  25420420     37.37         0
07:00:01 PM  39382548  26609324     40.32    208804   3661100  42605940  25420420     37.37         0
07:10:01 PM  39209884  26781988     40.58    211940   3746460  42605940  25420420     37.37         0
07:20:01 PM  39080412  26911460     40.78    215128   3785572  42605940  25420420     37.37         0
07:30:01 PM  39097160  26894712     40.75    216760   3825096  42605940  25420420     37.37         0
07:40:01 PM  38982708  27009164     40.93    218364   3860988  42605940  25420420     37.37         0
07:50:01 PM  38970000  27021872     40.95    220144   3899924  42605940  25420420     37.37         0
Average:     17866377  48125495     72.93   1949824  22634065  42605940  25420420     37.37         0

Will try the ulimit stuffs and post the updates.

Thanks.

MadeInGermany · May 8, 2013, 1:39am

Anything in /var/log/messages ?
Also look for threads with

ps -efL

clx · May 9, 2013, 4:03am

Hi,

I am able to trace at least one execution with the error.
Here is the strace snippet.

rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGINT, {0x436f40, [], SA_RESTORER, 0x38f80302d0}, {SIG_DFL, [], SA_RESTORER, 0x38f80302d0}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigaction(SIGINT, {SIG_DFL, [], SA_RESTORER, 0x38f80302d0}, {0x436f40, [], SA_RESTORER, 0x38f80302d0}, 8) = 0
pipe([3, 4])                            = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, [INT CHLD], [], 8) = 0
clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x2b013b0fafe0) = -1 EAGAIN (Resource temporarily unavailable)
fstat(2, {st_mode=S_IFREG|0644, st_size=8116374, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b013e6d2000
open("/usr/share/locale/locale.alias", O_RDONLY) = 5
fstat(5, {st_mode=S_IFREG|0644, st_size=2528, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2b013e6d3000
read(5, "# Locale name alias data base.\n#"..., 4096) = 2528
read(5, "", 4096)                       = 0
close(5)                                = 0
munmap(0x2b013e6d3000, 4096)            = 0
open("/usr/share/locale/en_US.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en_US/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en.UTF-8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en.utf8/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/usr/share/locale/en/LC_MESSAGES/libc.mo", O_RDONLY) = -1 ENOENT (No such file or directory)

I did man fork and found the below.

       EAGAIN fork() cannot allocate sufficient memory to copy the parent�s page tables and allocate a task structure for the child.

       EAGAIN It was not possible to create a new process because the caller�s RLIMIT_NPROC resource limit was encountered.  To exceed this limit,  the
              process must have either the CAP_SYS_ADMIN or the CAP_SYS_RESOURCE capability.

Any clue?

radoulov · May 9, 2013, 4:54am

Try adding the following in the beginning of your scripts:

ulimit -n 65535
ulimit -u 532480

The values are from the output of ulimit -a that you reported.
Just to make sure that your scripts run with correct settings when invoked by crond.

clx · May 9, 2013, 7:33am

Ok, but I think that is not the issue. (Since I am even getting this from command line)

The frequency of the issue seems to be increases.
I am now getting the error even from command line. ( ls, grep etc )
On an average, 1 out of 10 commands are failed on command line. If I retry instantly, it succeed in 2-3 hits.

Additionally, I am getting this issue for all the users ( not just root).

radoulov · May 9, 2013, 8:36am

You're hitting a resource limit. How many processes are currently running?
Please post the output from:

cat /proc/loadavg
cat /proc/sys/kernel/threads-max
cat /proc/sys/vm/max_map_count

clx · May 9, 2013, 8:54am

Here is the output.

# cat /proc/loadavg
14.84 64.55 195.84 6/9452 13798

# cat /proc/sys/kernel/threads-max
1064960

#  cat /proc/sys/vm/max_map_count
65536

#

radoulov · May 9, 2013, 10:03am

I'm out of ideas ...
I suppose that when you get those errors, you're hitting the max number of running processes (entities) limit,
even though the output above doesn't show that,
or, at least not, when you executed the cat command.

It would be easier to check this by setting the resource limits in your scripts
explicitly (as shown above with: ulimit -n/u ).
If you never exceed those values, the scripts should run fine.

You said that sometimes you're getting the same errors while executing commands
in the shell: make sure that the same limits are currently set.

Don_Cragun · May 9, 2013, 9:56pm

This sounds like you might have a process that is starting children, but not reaping their status when they die. When they die, they will release the memory they were using, but you won't be able to start more processes running with the same user ID until the number of processes being run by that user falls below the system limit for that user. And the number of processes being run includes those unwaited for zombies.

MadeInGermany · May 10, 2013, 1:49am

Again, you must run

ps -eL

or

ps -efL

to see the threads!
Linux uses clone not fork, and the ulimit -u is a thread limit - even if man page and other description say processes.
I have seen bad java applications cloning thousands of threads in a loop.

clx · May 10, 2013, 3:27am

I tried ps -efL when you suggested previously. I believe, threads and process are different from each other?
I have seen huge count when did ps-efL . Most of them was having same PID (true by definition of threads) . I am confused whether the system process limit applies to threads also?

Yes, We have lots of java applications running on the systems. When we restarted those applications, the system gets stable for sometime (we get comparatively, less number of fork errors).

Does that mean those java applications are the culprit?

I had seen zombie/defunct processes, but didn't find much of them. They are not so many in count.

MadeInGermany · May 10, 2013, 5:26am

Rule of thumb: if there are more than 100 threads per process, then the Java process is badly programmed - talk to the application owner/vendor.
If there are more than 1000 threads per process, then the Java process is definitely wrong.
I suggest to put a "ulimit -S -u 3000" soft limit into each Java start script.
This is a per-user limit. Therefore, it makes sense to start your Java apps as different users.
Sometimes it also helps to limit the file handles with "ulimit -S -n 1024", the default on most Linux systems. (Some application vendors suggest to tune this up - with a negative effect.)
Last but not least I think (have not yet evidence) kernel.pid_max not only limits the processes but also the application threads.
Tune it up with "sysctl kernel.pid_max=99999", and also add this to /etc/sysctl.conf.
99999 i.e. 5 digits is safe with processes like xfs that have a buggy "pidfile" handling. BTW an immediate tune-up is safe; a tune-down should happen only in /etc/sysctl.conf to be applied at next system reboot.