Linux: Access time of mapped data

rusttree · June 2, 2009, 10:44pm

Before I forget, I'm running on a RedHat 5 box with the following uname -a output:
Linux gnc141c 2.6.18-53.el5 #1 SMP Wed Oct 10 16:34:19 EDT 2007 x86_64 x86_64 x86_64 GNU/Linux

Now on to my question.

I'm using a tool that maps a Matlab .mat file using the Linux mmap functionality and then provides access to the saved variables (this is all done at the Matlab command prompt). I've found that using a variable in a calculation, such as

a = foo*1;

(where foo was a variable saved in the .mat file) forces the variable into active memory. I'm finding, however, that the time to execute the line of code above in Matlab depends on where the variable was saved in the .mat file.

If "foo" was the 1st variable saved in the .mat file, I can execute that line in microseconds. If "foo" was the 20th variable in the .mat file, the time-to-execute jumps up 1000 times to the millisecond region. And then, to make it more confusing, the last handful of variables at the end of the .mat file drop back down to the microsecond region. For the sake of example, all of the variables are the same size. In general, the first couple of variables in the .mat file are accessible in microseconds, the middle 80% or so of the variables take milliseconds, and the last couple go back to microseconds.

I can understand it taking a little longer to find a variable further down in the file, but by a factor of 1000? And what about the variables at the end that are accessible quickly again? From my very limited understanding of what goes on behind-the scenes, I think this has something to do with Linux's memory management.

Another little tidbit I noticed is that once I've accessed one of the variables (and taken the millisecond time penalty), many of the other millisecond variables now take microseconds. This stays true until I reboot (a cache thing, I think).

In case you're wondering about a Matlab .mat file, all the variables are stored as double-precision arrays, one at a time, top to bottom. So the first variable is listed at the top of the file, followed by all of its data, followed by the next variable and all of its data, etc.

Any thoughts? This problem is significant to me because I'm working with thousands of .mat files. The difference between a microsecond and a millisecond can be the difference of hours of processing time.

Thanks,
Dan

rusttree · June 3, 2009, 1:53pm

I just got some advice from a Linux guru. He suggested that all of the variables take milliseconds to access, even the ones in the beginning of the file that appear to take microseconds. The difference is the mapper defaults to loading the first page of data into memory when the mapper is first called. So all those variables at the beginning of the .mat file enjoy being defaulted into memory right off the bat.

Still doesn't explain why the last couple of variables also appear to be defaulted into memory as well... but that could just be a oddity of the mapper.