cron/logrotate chicken and egg problem

jjinno · September 11, 2008, 2:13pm

I have run into a problem where about a dozen machines, all the same x86_64 2.6.12 GNU/Linux. For some reason these machines will fill up their /var partition (10G), because their logs never get rotated... Unfortunately, there is no error message from logrotate (would be in /var/log/messages) and the last time logrotate ran (according to /var/log/logrotate.log) was August 11, 2008.

This unfortunately is somewhat of a puzzling problem, making it feel (however unlikely) that this is actually a problem with cron, and not log-rotate. IE. cron failed first, which made logrotate never happen. I had originally thought that log-rotate just couldn't do its job because there was no space available for the rotation to occur. But even when I gave it enough space, cron just never ran the logrotate script. On top of that, I even added my own "append timestamp to file" script to the cron.hourly, and it never got run either. As a test, on one of the servers, I restarted cron... BINGO, it fixed the whole thing.

I could care less about the fact that I can fix it. I want to prevent this from ever happening in the field. In my lab is one thing, but in the field, there goes all my logs for debugging.

Here is a sanity check I ran (just proving that cron was functioning and stopped functioning - all within the same up-time):

[root@babylon5 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda2             9.5G  3.7G  5.3G  42% /
/dev/sda1             190M   18M  163M  10% /boot
/dev/shm              2.0G  4.0K  2.0G   1% /dev/shm
/dev/sda6              44G  8.4G   33G  21% /stuff
/dev/sda5             9.5G  9.5G     0 100% /var

[root@babylon5 ~]# ls -lut /etc/init.d/crond
-rwxr-xr-x  1 root root 1904 Aug 11 15:14 /etc/init.d/crond

[root@babylon5 ~]# ls -lut /etc/crontab
-rw-r--r--  1 root root 255 Jul 15 10:19 /etc/crontab

[root@babylon5 ~]# grep "Aug 11 15" /var/log/cron
Aug 11 15:01:02 babylon5 crond[9429]: (root) CMD (run-parts /etc/cron.hourly)

[root@babylon5 ~]# ps auxwww|grep cr[o]n
root      3014  0.0  0.0  60404  1296 ?        Ss   Jul15   0:00 crond

[root@babylon5 ~]# uptime
 13:33:34 up 57 days,  3:15,  1 user,  load average: 15.63, 15.37, 15.29

[root@babylon5 ~]# grep "STARTUP" /var/log/cron*
/var/log/cron.4:Jul 14 14:16:44 babylon5 crond[3156]: (CRON) STARTUP (V5.0)
/var/log/cron.4:Jul 15 08:53:45 babylon5 crond[3008]: (CRON) STARTUP (V5.0)
/var/log/cron.4:Jul 15 10:19:35 babylon5 crond[3014]: (CRON) STARTUP (V5.0)

[root@babylon5 ~]# tail -n 1 /var/log/cron.4
Jul 20 00:01:01 babylon5 crond[15555]: (root) CMD (run-parts /etc/cron.hourly)

[root@babylon5 ~]# tail -n 1 /var/log/cron.3
Jul 27 01:01:02 babylon5 crond[27941]: (root) CMD (run-parts /etc/cron.hourly)

[root@babylon5 ~]# tail -n 1 /var/log/cron.2
Aug  3 01:01:01 babylon5 crond[17755]: (root) CMD (run-parts /etc/cron.hourly)

[root@babylon5 ~]# tail -n 1 /var/log/cron.1
Aug 10 01:01:02 babylon5 crond[29697]: (root) CMD (run-parts /etc/cron.hourly)

[root@babylon5 ~]# tail -n 1 /var/log/cron
Sep 10 13:01:01 babylon5 crond[22186]: (root) MAIL (mailed 146 bytes of output but got status 0x007f )

[root@babylon5 ~]# ll /var/log/cron*
-rw-r--r--  1 root root 138502 Sep 10 13:01 /var/log/cron
-rw-------  1 root root  14407 Aug 10 01:01 /var/log/cron.1
-rw-------  1 root root  14581 Aug  3 01:01 /var/log/cron.2
-rw-------  1 root root  14483 Jul 27 01:01 /var/log/cron.3
-rw-------  1 root root  14513 Jul 20 00:01 /var/log/cron.4

Any help?

Annihilannic · September 11, 2008, 7:47pm

Can you be sure that the jobs aren't being run if you are relying on logs stored in /var to check? If /var is 100% full, it's quite possible that the jobs *are* running, but simply can't add a log entry to say so. Did your "append timestamp to file" test script use a file on another filesystem?

Having said that, I think it's quite likely that cron would stop working once /var becomes 100% full... there must be some pretty big logs being generated somewhere to fill a 9.5GB /var?

jjinno · September 11, 2008, 8:29pm

The "append timestamp to file" test used /tmp... which is on /dev/sda2.

If you look at the output from "ll /var/log/cron*" you will see that messages continued to be sent to /var/log/cron from after August 10. The last line of the cron.1 file confirms that August 10 was when the last logrotate happened. Given that cron logs should be relatively consistent, one would expect the next cron log rotation to occur on August 17... but it did not. Also, because of the size (~14k) of the cron log, there is a guarantee that the logrotate did not fail (on the 17th) due to the filesystem being full... instead logrotate just never ran.

According to /var/log/logrotate.log, the last time that logrotate ran was:
1218492062 which translates to : August 11th, 2008 10:01:02 PM

As for the logs being generated, yeah, they are big, but its normal for my app with debug on max...

Annihilannic · September 11, 2008, 10:19pm

Have you established whether cron stops running all jobs when it's in this state? Just the cron.hourly? Or just jobs for a specific user (i.e. root in this case)? Also, are there any other cron jobs stuck at the time?

Have you tried strace-ing a crond that's stuck like this? Also, does a pkill -HUP crond "wake" it up again?

It might also be worth running cron with some debugging options, although they don't seem to be very well documented. The best reference (for Vixie cron) I've found so far is:

Mac OS X Manual Page For cron(8)

jjinno · September 12, 2008, 9:54pm

I tried to use GDB but it would appear that the symbols are stripped from the crond binary, as the backtrace only provided the following:

(gdb) bt
#0  0x00002aaaaae6d5a0 in ?? ()
#1  0x00002aaaaae6d3f4 in ?? ()
#2  0x0000555555557772 in ?? () from /usr/sbin/crond
#3  0x0000000000000000 in ?? ()

I HUPed the cron daemon, and waited to see if the test script ran, but unfortunately no... even though it is theoretically writing to someplace other than the /var partition (aka. /tmp)

As for strace, I was unaware that I could attach strace to a process that is already running... if that is so, can you reference an example? I will gladly kick it off as a test.

strike-2 · September 13, 2008, 7:53pm

Just a silly question:
Are you sure you have inodes available on /var?
('df -i', depending on your O/S)

If your inode table is full, you may have space on the disk but you can't make a new file to rotate the data into.

Annihilannic · September 14, 2008, 10:06pm

strace -aefp <pid> should do the trick.

jjinno · September 16, 2008, 2:59pm

The results are in...
I did a comparative strace between a fully functioning system and a system with cron failure. The basics are the same...

A poling loop for config changes (repeated every minute):

stat("cron", {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0
stat("/etc/cron.d", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/etc/crontab", {st_mode=S_IFREG|0644, st_size=255, ...}) = 0
rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
rt_sigaction(SIGCHLD, NULL, {0x555555557772, [], SA_RESTORER|SA_RESTART, 0x2aaaaae0a460}, 8) = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
nanosleep({60, 0}, {60, 0})             = 0

After some reads to /etc/localtime (for logging) and a message send, the first /var interaction... connecting to a socket:

[pid 24735] socket(PF_FILE, SOCK_STREAM, 0) = 3
[pid 24735] fcntl(3, F_GETFL)           = 0x2 (flags O_RDWR)
[pid 24735] fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 24735] connect(3, {sa_family=AF_FILE, path="/var/run/nscd/socket"}, 110) = -1 ENOENT (No such file or directory)
[pid 24735] close(3)                    = 0

Blah, blah, blah, some different libraries, a look at resolv.conf, and finally, the moment we have all been waiting for:

[pid 24735] socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 3
[pid 24735] connect(3, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.20.30.40")}, 28) = 0
[pid 24735] fcntl(3, F_GETFL)           = 0x2 (flags O_RDWR)
[pid 24735] fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
[pid 24735] poll([{fd=3, events=POLLOUT, revents=POLLOUT}], 1, 0) = 1
[pid 24735] sendto(3, "\377\276\1\0\0\1\0\0\0\0\0\0\7babylon5\7skynet\5loc"..., 39, MSG_NOSIGNAL, NULL, 0) = 39
[pid 24735] poll([{fd=3, events=POLLIN, revents=POLLIN}], 1, 5000) = 1
[pid 24735] ioctl(3, FIONREAD, [111])   = 0
[pid 24735] recvfrom(3, "\377\276\205\203\0\1\0\0\0\1\0\0\7babylon5\7skynet\5lo"..., 1024, 0, {sa_family=AF_INET, \
sin_port=htons(53), sin_addr=inet_addr("10.20.30.40")}, [16]) = 111
[pid 24735] writev(2, [{"crond", 5}, {": ", 2}, {"relocation error", 16}, {": ", 2}, {"/lib64/libresolv.so.2", 21}, {": ", 2}, \
{"symbol __res_iclose, version GLI"..., 97}, {"", 0}, {"", 0}, {"\n", 1}], 10 <unfinished ...>
[pid 24734] <... read resumed> "crond: relocation error: /lib64/"..., 4096) = 146
[pid 24735] <... writev resumed> )      = 146
[pid 24734] uname( <unfinished ...>
[pid 24735] exit_group(127)             = ?
Process 24735 detached

It does this once more, and then exits (on a Interrupted System Call) to resume the nanosleep loop...

Now, almost everything (except the error) is identical between the good and bad strace. The return code for the last recvfrom() is 91 on the good server, and 111 on the bad... also the sendto() returns 43 on the good server, and 39 on the bad...

Annihilannic · September 17, 2008, 1:07am

I presume PID 24735 one of the jobs being run out of cron and 24734 is cron itself? Is the 10.20.30.40 an IP address you recognise? If you search back through strace output you should see an exec...() = 24735 which will contain the command line used to execute that process.

Anyway, the values returned by recvfrom() are just the numbers of bytes received on that socket, so they are likely to vary.

More interesting is the writev(2, ...) which looks like an error message being sent to stderr. Relocation errors sounds like incompatible or missing library issues, can you figure out where those messages are going? I refer back to my prior suggestion to run cron manually and/or with some debugging options where available to obtain more information.