Help with __builtin_prefetch function and it's timing

Hello there, I just needed to know how to get the timing right when using the gcc __builtin_prefetch() function, that is, how many instructions before the actual utilization of the data should I make the prefetch call.
I will be measuring the L1 cache hit rate with valgrind's cachegrind, simulating a 1KB L1 Data cache.

Just in case you ask yourself what's the point in doing what I'm doing, I tell you it's a university project.

AFAIK, it depends on your processor and the 'data' you are processing. Have you looked at prefetch test cases in the GCC testsuite (.../testsuite/gcc.misc-tests/i386- prefetch.exp)?

The consists in two dimensional arrays that are much bigger than L1D cache. And our professors expect us to "guess" the prefetch distance by introducing timers in the program. But I have yet to see an example of that.

The prefetch instruction should be issued at a time before the data is used at least as large as the L1 cache latency. You should be able to find out what that latency is for the machine you are using. I imagine that your code will have some sort of loop over the array, so you take into account the time taken per loop iteration (without cache latency), the amount of data read per iteration and the size of the L1 data cache. These figures should give you an idea of the memory bandwidth you need to sustain and hence how often you need to prefetch and how far ahead each prefetch should touch the data. This gives you somewhere to start, but your professors are correct; guessing and timing is the only way to be sure.