What is the cause of file truncation?

venkatmyname · October 3, 2005, 8:21am

Hi,

I have a program that gets called from the front end of my application. Actually it creates some temporary files and uses them and deletes them at last. But sometimes, say once in 6 times, some of these temporary files are getting truncated in the middle and because of this my program is behaving irregularly. My application runs on AIX.

I am not sure -
1) whether some other process is truncating the files, or
2) My program itself is writing the files incompletely.

If I restart the same operation again, I am able to proceede correctly. This kind of trucation of files is happening only some times, say once in 6 times.

I want to monitor these temporary files from creation to the deletion - like what processes are writing to them, using them, truncating them etc.

Can you please tell me, is there a way to do this task? Or, any other better way of solving this problem is possible?

            Thanks,
            Venkat.

vino · October 3, 2005, 8:55am

I am not sure if this would help.

Did you try strace ? Read the man pages. It outputs all the system calls by a process. strace is usually for the whole application. In your case it would be the program.

vino

jim_mcnamara · October 3, 2005, 10:04am

Make sure you call fflush() after every write to your temp files.

This sounds like a program design issue more than a problem with the filesystem.

venkatmyname · October 4, 2005, 4:52am

No. I think it cannot be. Because, it is working well on other environments. It's the problem only on my system/environment. Moreover, it has fflush() after every write.

jim_mcnamara · October 4, 2005, 11:49am

Are you checking return codes on ALL your file calls?

If you are working on a busy disk where apps create a lot of temp files (like /var/tmp), it is possible for write() not complete successfully because of transient disk full errors. Since this only happens once in a while, this must be the case.

Also consider defining TMPDIR to point to a filesystem with lots of free space or with low disk contention.

If you don't check return codes, the program runs merrily on, regardless of disk free space. I've seen your problem exactly as you describe it under these cricumstances.

blowtorch · October 4, 2005, 2:49pm

I have observed this on one of our systems too. I tried to simulate this using the following programs:

fop.c - uses fopen and fwrite

#include<stdio.h>
#include<string.h>
#include<errno.h>

int main() {
        FILE *fp;
        char str[]="test";
        int ret;

        fp=fopen("/mount_pt/testfile","w");
        if(fp==NULL) {
                fprintf(stdout,"errno: %d",errno);
                exit(-1);
        }
        ret=fwrite(str,1,strlen(str),fp);
        fprintf(stdout,"ret of write: %d",ret);
        if(ret==0) {
                fprintf(stdout,"couldnot write! errno: %d",errno);
                exit(-1);
        }
        fclose(fp);
}

op.c - uses open and write

#include<stdio.h>
#include<fcntl.h>
#include<string.h>
#include<errno.h>

int main() {
        int fd;
        char str[]="test";
        int ret;

        fd=open("/mount_pt/testfile",O_CREAT|O_RDWR,0664);
        if(fd==-1) {
                fprintf(stdout,"errno: %d",errno);
                exit(-1);
        }
        ret=write(fd,str,strlen(str));
        fprintf(stdout,"ret of write: %d",ret);
        if(ret==-1) {
                fprintf(stdout,"couldnot write! errno: %d",errno);
                exit(-1);
        }
        close(fd);
}

I simulated a full filesystem by creating a 4MB filesystem and filling it up. Then ran the op.c and fop.c programs on this. op.c gives an error when trying to 'write'. However, fop.c goes through successfully - fwrite even returns the expected values, but all that is created is a 0 byte file.

This may have something to do with the buffering that is done when 'fwrite'ing - this causes the 'fwrite' to return success, even though 'write' fails.
But this does not really sound right.. could any one shed light on this?

Perderabo · October 4, 2005, 3:08pm

blowtorch, your problem is due to buffering as you suspect. You could use setvbuf() to unbuffer. Or you could check the return code from fclose() which will detect the problem. Ideally, you check the return code from close() as well... although no one ever does. With an NFS mounted filesystem, close() could be the syscall that detects a full filesystem.

blowtorch · October 4, 2005, 4:24pm

Thanks Perderabo! Though the code is not mine to modify - and it is next to impossible to get the application team to change their code. :rolleyes: So we gave them a different tmpdir and tightened up filesystem monitoring on our end.

jim_mcnamara · October 4, 2005, 5:19pm

You need to understand something about the fnnnnn calls. They make kernel requests
for read/write only at certain intervals. The only thing that forces a write is either fflush() or fclose(). And they don't do a thing except return an error value -1.

You can also call write() without checking the return code and have your code continue as though everything went okay. When in fact it did not. read() will screw up unless you do this on a busy system:

#include <unistd.h>
#include <fcntl.h>
#include <errno.h>
#include <stdlib.h>


ssize_t readall(int fd, void *buf, size_t nbyte)
{
     ssize_t nread = 0, n=0;

     do 
     {
         if ((n = read(fd, &((char *)buf)[nread], nbyte - nread)) == -1) 
        {
             if (errno == EINTR)
                 continue;
             else
                 return (-1);
         }
         if (n == 0)
             return nread;
         nread += n;
     } while (nread < nbyte);
     return nread;
}

It's a known deal: you must stress test a production app that uses a lot of I/O. Most apps fail in high I/O or low free disk space situations. Or fail when lots of preemption is present (process context switching)- i.e., super busy cpu.

I know all this because 25 years ago, all systems were overloaded by definition.
One user was too many.

jim_mcnamara · October 4, 2005, 5:23pm

PS: you can construct a similar writeall() function for the same reason: checking for EINTR.