Tool to simulate non-sequential disk I/O (simulate db file sequential read) in C POSIX

Writing a Tool to simulate non-sequential disk I/O (simulate db file sequential read) in C POSIX

I have over the years come across the same issue a couple of times, and it normally is that the read speed on SAN is absolutely atrocious when doing non-sequential I/O to the disks. Problem being of course that most databases will be doing non-sequential I/O to disks, databases most common read process is db file sequential reads, which would not cause a sequential read of the actual blocks on the device.

My second issue is that it is normally tricky to segregate the different processes enough to be able to clearly test or even show the exact issue with the non-sequential reads and writes, so I end up in very lengthy discussions about possibilities of changing computing theory rather than actually changing the SAN to be able to handle these type of requests, and while that allows for very creative use of similes it is not very efficient use of my time, and I really have little need for more overtime.

So my thought was, how would one go about writing a utility that takes a file that opens a large file and read random blocks of data throughout the file, that way simulating the same effect in a controlled environment.

The general layout I was thinking is

Input for program

name [file to read] [block size] [number of reads]

set block size
set number of reads

get file size

open file

for I << number of reads
set random block address
read random block address from file

close

My problem is, how would I go about reading a random block address from a file.
And is there any way to get the time in milliseconds the operation took

And the POSIX bit, basically the systems I need to use this code on are locked down pretty heavily, and installing a new compiler is a couple of months worth of work, so I want a tool that will be able to be compiled on almost any old compiler.

On Linux:

#!/bin/bash


for ((N=0; N<10; N++))
do
        dd if=gigabytefile of=/dev/null skip=$((RANDOM % 1024)) bs=$((1024*1024)) count=1
done
$ ./disktest.sh
$ ./disktest.sh
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0916211 s, 11.4 MB/s
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0867173 s, 12.1 MB/s
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00175788 s, 597 MB/s
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0166869 s, 62.8 MB/s
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.00172908 s, 606 MB/s
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.018392 s, 57.0 MB/s
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0261557 s, 40.1 MB/s
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0181711 s, 57.7 MB/s
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0152302 s, 68.8 MB/s
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.0144154 s, 72.7 MB/s
$
1 Like

Corona688: Absolutely brilliantly simple solution. Will try to see what would actually happen if I use that.

Thank you very much.

No problem.

Unless you've got SSD's, disks are atrocious for random read in general. Reading sequential disk blocks(512 bytes) you can get 100+MB per second on a modern disk. Reading random blocks from that same modern disk, assuming a 15ms seek time, you get worst-case transfer rates in double-digit kilobytes per second. Until SSD's started becoming practical the usual way to overcome this was gigantic amounts of cache.

That's not really random IO from a single process. That's 1 MB of sequential IO from multiple processes, in series.

You really need something lower-level, something maybe like this:

#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>

int main( int argc, char **argv )
{
    int fd;
    off_t *offsets;
    struct stat sb;
    int numReads = 1024;

    // get page-sized buffer (Linux direct IO
    // fails unless IO requests are exact
    // multiples of page size)
    size_t readSize = sysconf( _SC_PAGESIZE );

    // get page-aligned buffer (Linux can't handle
    // direct IO unless it's page-aligned)
    char *buffer = valloc( readSize );

    // must actually touch the memory to create
    // the physical page mapping in the process
    // address space
    memset( buffer, 0, readSize );

    // get an array for offsets to read from, since
    // calling lrand48() during the read loop can
    // be slow enough to impact the results, especially
    // on fast devices
    offsets = ( off_t * ) calloc( numReads, sizeof( offsets[ 0 ] ) );

    // need to do direct IO to avoid page cache
#ifdef __linux
    fd = open( argv[ 1 ], O_RDONLY | O_DIRECT );
#else
    fd = open( argv[ 1 ], O_RDONLY );
#endif

#ifdef __sun
    directio( fd, DIRECTIO_ON );
#endif

    fstat( fd, &sb );

    for ( ii = 0; ii < numReads; ii++ )
    {
        // get a random offset that's no larger
        // than the file/device we're reading
        offsets[ ii ] = ( off_t ) lrand48();
        offsets[ ii ] <<= 32;
        offsets[ ii ] += ( off_t ) lrand48();
        offsets[ ii ] %= sb.st_size;
        // mask to get 512-byte offsets
        offsets[ ii ] &= ( off_t ) 0xFFFFFFFFFFFFFE00;
    }

    // do the reads
    // add code to get start time here
    for ( ii = 0; ii < numReads, ii++ )
    {
        pread( fd, buffer, readSize, offsets[ ii ] );
    }
    // add code to get finish time here, then
    // print out results
    
    return( 0 );
}

Compile that with "-m64" to get a 64-bit binary that can easily handle devices > 2GB, and run like this on Linux:

./RandomIOTest /dev/sda1

or this on Solaris:

./RandomIOtest /dev/rdsk/c1t2d4s2

Also of note, if you're reading small chunks (8K or so, most common page size), and your storage device has a large read-ahead setting, you'll get much slower performance than you'd otherwise expect as each 8K read can cause the disk controllers to read a whole lot more than 8K per read.

You can get some strange effects with high-speed disk systems. Try to malloc() a 1 GB buffer, and read into it from a high-speed storage system without actually setting the memory you malloc()'d to zero. The data can come in from disk faster than your system's virtual memory system can create the pages to put it into.

And then you can turn around and do something pathologically bad and get only a few KB/sec from that same storage.

achenle: thank you very much, I hope you will not mind if I butcher that code a bit to suit my exact needs, will post the stuff once I've completed.

Thanks everyone who have responded. I'll be going for reading completely (OK, somewhat random blocks if we are going to be exact) random blocks in a fast loop, this as it will allow me to as closely as possible segregate the reads from any other activity (shell, file open ...) to simulate what happens when a database reads blocks of data all over the disk.

/Ben

Have at it. Don't forget - that's meant to be 64-bit code. It might work if you compile t with large-file compile flags/defines, but I'm not sure as I pretty much do nothing but 64-bit code any more.

I did forget to call srand48() to seed the random number generator. FWIW, something like this would work:

srand48( time( NULL ) );

On Solaris I usually like this as running something like the above C code over and over in a script can result in sequential runs getting the same "random" sequence when using time() as the seed because the seed value is the same:

srand48( gethrtime() );

I don't recall offhand if Linux has gethrtime() or an equivalent.

I also didn't try to test or even compile that code. There's a good chance I missed include files and put in something utterly brain dead.

And the location of the read isn't technically random because using the % operator means the beginning of the file will have a slightly larger chance of being selected because of the way it causes the full 64-bit random offset to wrap. But unless the file/device is HUGE it won't really matter - even multiple petabytes won't make the difference measurable.

Happy with it not being completely random, as long as it jumps around enough to not catch itself on the read advance routines when running on the SAN, and as simulation goes, the database would more than likely read specific files or parts of files more than others anyway.

Here comes my next exciting problem, a database will only read a certain amount of time, other bits are sorting, writing and whatever else the database gets up to. So I am thinking of adding how manny cycles there are between reads, because just doing random reads and nothing else is giving rather silly results. In its purest form I seem to be able to get 100% WaitI/O and read about 6-12 Kb/s, which is pretty impressive in its ow right.

And todays exiting question, anyone feel like guessing how many cycles in a million that a rather streamlined database spends reading disk data:)
Anyone who guesses the right answer wins an icecream, collectable in Camden, London (U.K.)

/Ben