Memory release latency issue

I have an application that routinely alloc() and realloc() gigabyte blocks of memory for image processing applications; specifically performing rotations of huge images, or creating/ deleting huge image buffers to contain multiple images. Immediately upon completion of an operation I call free() to release the memory.

I've noticed dramatic performance disparities depending upon the sequence that operations are performed. The first call to a function completes quickly, but subsequent calls can take up to 5X as long as the first; exact same code. All terminate normally, the issue is performance or lack of it.

It appears that after I free() a block of memory that I am using the system, for unknown reasons, does not make this resource immediately available again for an indeterminate period. I free the memory, but the system performs as if the memory is still in use. There is no logic issue of the memory being freed; the only path to a return is through the free() statement.

I'm a coder, not a systems expert. Any ideas out there? What is going on? Language is C/C++.

Many thanks in advance.

Imagtek

---------- Post updated at 12:34 PM ---------- Previous update was at 12:32 PM ----------

The system is CentOS/64 bit, release 2.6.32-358.14.1.el6.x86_64

Likely the kernel is not optimized for free().
It is a complex task, and maybe blocking new allocations.
You can try an OS upgrade, with a higher kernel version.
Or try to optimize your code - call free() less often.

In short, malloc() is the wrong tool for throwing around entire gigabytes of memory at once. You should cut out the middleman and use mmap().

The first time you request an entire gigabyte of memory, malloc() probably has to call brk to extend the heap segment. (This is a system memory call related to mmap.) This adds a vast new region of unused memory to the heap -- memory that's all guaranteed to hold nothing but ASCII NULLs. It just gives it straight to you and doesn't bother to clean it.

Then you free() and malloc() it again. Because it's been used, malloc() will memset() that entire gig of memory to NULL for you to make it "clean" again.

By using mmap() instead, you can let the OS do that as-needed instead of in one 1-gig write. mmap also has other useful features like file backing -- if all you're doing is dumping 5 gigs of file into memory, mmap can save you a ton of trouble and speed and RAM.

Or, if you went the other direction, you could just keep reusing the same block of memory around all the time without free()ing it.

That's because free() does nothing other than make the memory you freed available for you by your next malloc() call.

Why are you using malloc() and free() over and over, anyway? Just malloc() (or mmap()) a few chunks that you know will be big enough and use the same ones over and over.

Perhaps I misunderstood you before. So the problem isn't the speed of the free(), but the memory use?

It's like achenle says, it is in use. malloc() assumes if you've allocated it before, you're going to allocate it again, and keeps it in the pool for later. If you want control over when exactly it's released to the OS, you need mmap.

Plus, if you repeatedly call malloc/free for varying very large sized chunks of memory, malloc will gladly fragment heap to the point where it becomes less efficient. This is due in part to the fact that some OS flavors may reclaim memory after a free call. Especially if there are other processes calling for memory chunks. Numa also plays into big chunk operations.

Several years ago we ran a test on a non-prod Solaris 10 box with 64GB of memory. We malloced one single giant chunk, never called malloc again. We reused the chunk over and over with varying sized buffers. By adding back in the malloc/free calls between every operation on new "new" chunk, the same test code ran about 15% slower and spent most of that extra time in kernel mode.

NUMA really slows down accessing large memory allocations because of locality issues. The system cannot relocate gigantic memory chunks to more convenient locations. Since you have a commodity cpu (multicoore x86 ) then NUMA is a concern.
You need to look into cpu affinity for threads.

If you are reading from and then writing to vastly distant memory chunks you need to be aware of the order of accessing neighboring memory rather than doing something like copying the contents of arr[0] to arr[2000000], then reading in arr[1000000]. Each one of those example actions can mean reloading an L2 cache - as an example. As it is nowadays, memory is about an order of magnitude or more slower than your cpus.

Edit: You really should consider this article:

It is somewhat old, but still completely applicable.

1 Like

Thanks all for very informative replies. Memory allocation at the system level is more complex than I thought. I'll dig into the mmap() possibility. Part of my design-for-performance strategy working with huge images is to code low-level and as close to the system as possible, so it looks like more work to do there. As I said, first time through these algorithms fly, then its like they get stuck in the mud. Sometimes simply painting the screen hangs for seconds at a time. Always immediately after using/freeing massive blocks of memory.

I'll play around with some of these ideas and let you know what I find. I'm pushing my old 8 GB machine to its limits, maybe a bit past them, but that is what its for.

Thanks again for the valuable information.
imagtek

As pointed out earlier, why not just keep them around?

8GB is nothing to sneeze at.

Well as a matter of fact my application is already a memory hog... Each RGBA pixel is 4 bytes and the display footprint is barely visible on some of these images. I keep image buffers in memory so the user can 'pan' them using a compressed view to navigate. No problems there because the display footprint is never that much memory and I just discard motion events that occur while the screen is updating. Also redefine the XImage as the new mapping to the display with every update because X was crashing when trying to map an entire 4+ GB buffer. So since its already a memory hog, and can't assume I'm the only app running, just grabbing a massive scratch buffer to have handy seems not a great idea. But maybe I'm missing something about how virtual memory works with that attitude?

But you are using massive scratch buffers, and the memory has to come from somewhere. Give it away and you might have a hard time getting it back.

How about a mid-way compromise? Let the OS decide which bits to cache. Let the OS decide how much memory can be spared against the competition of other programs. Let the OS do all the heavy lifting. You can do this all while still using what's, to your program, still a huge, contiguous chunk of memory.

Make and keep a huge scratch file. You can map that into memory with mmap() -- memory accesses effectively become file accesses. This takes advantage of the file cache, so the OS will cache any bits of it you're using. The OS will cache recently accessed chunks, and turf infrequently accessed chunks back to disk. This is how many large database applications do I/O. If you have enough memory, you'll end up dealing almost exclusively with memory, not waiting for disk.

Your application will have to wait for bits to load, but it will wait for page-sized chunks, not an entire 8-gigabyte memory alloc. And if you have enough memory, it can cache everything without you having to ask. You can also do fun things like "copy-on-write", meaning that the mapping will do just what it says above for reads, but when you write to any of the memory, it won't bother writing the changes to disk, just keep it in RAM.

You can also optimize things further, warning the OS so it doesn't have to guess. "Could you cache these nearby parts for me since the user's probably going to scroll over them soon? More important than whatever faraway parts you have cached, thanks". That's what posix_madvise can do to mmaped segments.

1 Like