How to deal with lots of data in memory in order not to run out of memory

Hi,

I'm trying to learn how to manage memory when I have to deal with lots of data.

Basically I'm indexing a huge file (5GB, but it can be bigger), by creating tables that
holds offset <-> startOfSomeData information. Currently I'm mapping the whole file at
once (yep!) but of course the application quickly run outs of memory and malloc'ing
a new table fails after a bit.

My first question, which is more a request for confirmation, is the following.
Does the mapped file counts in memory usage? I'd say yes at 99,9% but I'd like
to be sure.

Second I'd like to know what syscall are available in order to retrieve memory
information of the calling process (how much memory used, how much left etc..) ?

I'm of course going to map a few pages at the time, although it'll be more tricky
to parse the file. Anyway, I'd like to know how I should deal with the tables I create and
I keep in memory. If I dump a few tables to a temporary file, and mmap it for quick access, it'd be the same thing if the answer to my first question is yes. I definitely
would not want the kernel to start swapping memory, but I'd rather have a thread
that concurrently writes those table to a file, for later retrieve.

Anyhow, my main concern is not to run out of memory (i.e. malloc has not to fail).

Any kind of suggestions for people with more expertise are very welcomed.
I'm eager to learn.

Thanks,
S.

Yes memory mapping counts as part of your process. But it is possible to map a file shared, and have another process actually process the file. That does not solve the memory usage problem necessarily. If you are running out of memory see how ulimit is set.

ulimit example -

/home/jmcnama> ulimit -a
time(seconds)        unlimited
file(blocks)         unlimited
data(kbytes)         2015464
stack(kbytes)        256000
memory(kbytes)       unlimited
coredump(blocks)     4194303

The two red lines apply to your question.

getrusage() returns resource usage for a parent process and it's children.
setrusage allows you to change ulimit soft values - you cannot go beyond the hard limits unless the sysadmin reconfigures your account/kernel.

Also you can increase virtual memory simply by adding swap space. Virtual memory (if ulimit for memory == unlimited) is the actual limit for process memory space.

check out vmstat for more information.