readdir and dynamic array memory corruption

torbium · January 27, 2011, 4:34pm

Hi everyone

I am developing an utility.
At some part of it I read directory entries to a dynamic array: struct list
It stores pointers to items: list.entries,
which are structures: struct entry

If a number of files in a directory is greater then number of elements an array was initially allocated,
I reallocate memory for this array.
if (c > list_size)

And at this point something strange happens.
Pointers are correct.
While I can successfully allocate memory for new items
list.entries[c] = malloc(sizeof(struct entry));
list.entries[0]->pde->d_name is corrupted at some iteration, but always if list.entries was reallocated.

See the test code provided.

If I do not realloc list.entries everything goes fine.
I played with list_size values.
On Mac OS X (10.4.0 Darwin Kernel Version 10.4.0) memory is corrupted while list.entries[124] is processed.
On Ubuntu Linux 2.6.24-23-xen #1 SMP Wed Apr 1 23:47:10 UTC 2009 x86_64 GNU/Linux while list.entries[196] is processed.
If I don't use readdir and explicitly allocate
list.entries[c]->pde = malloc(sizeof(struct dirent));
memory is not corrupted.

What is wrong?
Where is the problem?
What is the best way to read dir entries to a dynamic array?

#include <sys/stat.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <dirent.h>
#include <math.h>
#include <pwd.h>
#include <dirent.h>
#include <unistd.h>

int
main (void)
{
    int i, c, ac;
    unsigned int list_size = 10;
    char * fname;
    
    DIR *pdir;
    struct dirent *pde;
    
    
    struct entry {
        struct stat st;
        struct dirent *pde;
    };
    
    struct list {
        int count;
        struct entry **entries;
    } list;
    
    struct entry **ppent;
    pdir = NULL;
    pde = NULL;
    
    pdir = opendir("/usr/bin");
    
    if ((list.entries = malloc(list_size * sizeof(struct entry *))) == NULL)
        return 1;
    
    c = 0;
    ac = 0; /* allocation counter */
    while ((pde = readdir(pdir)) != NULL) {
        if (c > list_size - 1) {
            list_size <<= 1;
            if ((list.entries = realloc(list.entries,
                                    list_size * sizeof(struct entry*))) == NULL) {
                perror("unable to realloc");
                return 1;
            }
            ac++;
        }

        list.entries[c] = malloc(sizeof(struct entry));
        list.entries[c]->pde = pde;

        if (strcmp(list.entries[0]->pde->d_name, ".") != 0) {
            printf("memory corruption. size of array %d items\n", c);
            printf("number of reallocations %d\n", ac);
            return 1;
        }
        
        printf("base address: %p pointer address: %p entry pointer value: %p name: %s\n",
               list.entries, &list.entries[c], list.entries[c], list.entries[c]->pde->d_name);
        c++;
    }
    
    return 0;
}

Corona688 · January 27, 2011, 5:12pm

readdir returns the same pointer every time. So your structure gets filled with copies of whatever your last readdir() gave you.

Store the structure's data, not a pointer to data.

struct entry {
        struct stat st;
        struct dirent de;
    };

...

list.entries[c]->de = *pde;

...

---------- Post updated at 04:12 PM ---------- Previous update was at 03:59 PM ----------

if (c > list_size - 1) might not be doing what you expect. C has an odd order of operations sometimes. try if(c > (list_size-1))

torbium · January 27, 2011, 5:22pm

Hm,
if readdir returned the SAME pointer every time
list.entries[c]->pde would have the SAME value.
BUT they don't.

Of course what you propose solves the problem, thanks you.

BUT WHY and by WHOM heap memory is corrupted after list.entries has been reallocated?

DreamWarrior · January 27, 2011, 5:24pm

If you read the man pages for readdir you will see that the pointer returned is not guaranteed to be valid beyond the next call. It could be to a location in memory that gets reused, or to a location that is freed, or whatever the implementors feel like.

Essentially, since you are attempting to store the result of readdir into a location that you will use after that call, you are breaking the above rule. So, either use readdir_r or copy the results of the readdir into a "struct dirent" before making the next call (and hope there's no other thread in the application using readdir concurrent to yours....).

If you want to continue using "readdir", you'd change:

struct entry {
        struct stat st;
        struct dirent pde;
    };

Notice pde is no longer a "struct dirent *".

Then change this line "list.entries[c]->pde = pde;" to this "list.entries[c]->pde = *pde;".

If you want to use readdir_r then you'll need to restructure the while loop a little. I'll leave that to you, but if you know this isn't a multi-threaded program then IMO there's no reason to use readdir_r other than it's safer for the future if your code gets thrown into a multi-threaded environment. Since it's not that much harder to use, I suppose you could just do it right now...but I'll leave it to you to decide.

torbium · January 27, 2011, 5:24pm

if(c > (list_size-1))
does not help. The same result.

DreamWarrior · January 27, 2011, 5:25pm

UGH...way beaten to the punch, lol...that's what happens when I start typing a response, go into a meeting and come back to finish it up...haha...at least we're all telling you the same thing.

torbium · January 27, 2011, 5:29pm

Ok. Thanks to everyone for your comments.
I understood and have got several good ideas.

DreamWarrior · January 27, 2011, 5:39pm

It likely does return the SAME pointer every-time (and if it doesn't it's likely freeing the pointer it returned the previous time). What changes is the data within the region to which the pointer points, the data "inside" the pointer if you will.

Because you're now complying with the contract proposed by readdir.

Even if you didn't realloc, you'd have the same problem. Nothing is "corrupting" anything after you realloc. It's the fact that as we both said above, the next time you call readdir (which so happens to coincide with the realloc call) the data from the previous call is overwritten.

Also...your incrementing the size with <<= 1 may be a little more...extreme than you'd like. I hope you know exactly why you're doing that. The array size you'll be allocating will grow quite quickly. If there just so happen to be 257 entries in the directory you'll jump your array to 512 entries to hold them...you may prefer to jump in additive steps rather than exponential...but that's just my humble opinion. Even better still, use a linked list and there will be no wastage.

Corona688 · January 27, 2011, 5:57pm

Never more than twice as much memory as is currently used. It'll be sane enough unless there's frankly insane numbers of directory entries. Reducing the number of times you call malloc() for tiny things cuts down on memory fragmentation, and once the readdir() loop finishes he's free to shrink it, too.

Ever try calling qsort() on a linked list? I swear, the amount of time some programs take to sort things is simply obscene, and could've been avoided if they'd kept it in a numerically-addressable form.

torbium · January 27, 2011, 6:51pm

:))) I am totaly with Corona688.
There is good video, in my opinion, on STL at channel9.msdn.com
starting with 17th minute he explains why multiplying is better then addition in memory allocation.
Sorting with qsort is the next what i am going to do with this array.

---------- Post updated at 02:51 AM ---------- Previous update was at 02:49 AM ----------

C9 Lectures: Stephan T. Lavavej - Standard Template Library (STL), 1 of n | Going Deep | Channel 9

DreamWarrior · January 28, 2011, 1:14pm

I'm not convinced on the fragmentation "issue". It may speed things up because you call malloc fewer times, but you'd need to know how your malloc works to know if fragmentation will be an issue. Since most malloc implementations use pools for various sizes anyway, I don't see calling realloc with smaller jumps as bad. Once you call it with a large enough amount to matter then you're in the large pool and either filling in gaps or pushing out the break point.

Though...I suppose on your side, calling malloc with large sizes typically does nothing anyway until the pages are touched...so...it's not as if your calling malloc and asking for a gig (on most operating systems anyway) is going to hurt unless you start actually putting data into the pages "given" to your process when it (inevitably) calls brk.

Either way...it's a blanket statement that one is better than the other. If you had a process that was multi-threaded and read ten million items in with ten threads would you ask for it exponentially in each thread? If you did you would probably regret it, because you'd likely reach the memory limits of a 32 bit process and your mallocs would start failing and you'd have to back down to an additive method at some point.

Anyway, maybe I need to "think on it" some more, but for now I remain unconvinced.

P.S. I'd watch that video...but I don't have Silverlight installed on my Linux machine.

Corona688 · January 28, 2011, 9:01pm

A bit of fragmentation doesn't hurt malloc() much. A small percentage of wasted space? Boo hoo. Unless it's truly horrific you'd never notice or care.

The cost of fragmentation for repeated realloc() is exponentially-growing wasted CPU time as your existing data leapfrogs through every available heap hole in order of size. I think that's worth minimizing.

If you knew the sizes of your heap pools, you could aim large enough to probably be at the end of a pool but small enough to not spill into the next one. We don't, but exponential growth is a decent approximation.

To a point, anyway. For ludicrous sizes I'll admit a cap might be good.

I think that kind of situation needs careful tuning no matter what allocation method you pick.

I might use large fixed-size blocks of memory anonymously mmap()-ed in, one to a thread. That'd let you parcel out fractions of address space pretty precisely without exhausting your memory map or wasting memory. If you knew a little about your address space you could pick an entire contiguous region for them and MAP_FIXED things into place to prevent your address space from growing motheaten.

You could even extend that to accommodate more items than you can hold in your address space by making the mappings file-backed... if you fill a mapping, just truncate() the file one chunk larger and move your map further into the file. This'd also prevent you from exhausting system memory and swap. To sort it, you could qsort in chunks then merge the chunks.