DD command using block device as input

nytty · July 3, 2012, 1:50pm

Hi,

I am trying to measure the speed of reading a given block size using the dd command. However depending on which input I use: a regular file (on the same device) or /dev/sdb1, I get some really different results.

$sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'; dd if=pirate of=/dev/null bs=512 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 0.0190755 s, 26.8 kB/s

$sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'; sudo dd if=/dev/sdb1 of=/dev/null bs=512 count=1
1+0 records in
1+0 records out
512 bytes (512 B) copied, 1.9649e-05 s, 26.1 MB/s

Note:

Freshly formated ext4
ATA device, with non-removable media
Transport: Serial, SATA Rev 3.0
I did the same experiment with varying blocksizes, and up to 128K the phenomena is the same.
The file pirate is 100G and filefrag reports: 57 extents found

Questions:

Is this normal?
I am wondering if the speed of repositioning the disk head to the beginning of a device is fast comparing to any file offset movement. because reading 512B should be elementary ...

Thanks a lot

Corona688 · July 3, 2012, 1:52pm

Just because it's a raw disk device doesn't mean you're actually reading the disk raw. Disk cache and read-ahead gets used for it just like everything else.

Try hdparm for some more direct tests.

nytty · July 3, 2012, 2:07pm

Why is read-ahead a problem since above is only a 512bytes read, all it takes is normally positioning the drive head .. Then if the cache is the reason, it should affect the file as well, but I am getting the same result for all reps?

More importantly, my experiment is really about time/vs blocksize, and I am trying to make it more accurate :

Shall I only use regular files?
position the disk head somewhere else after each test?
is it possible to flush all caches programmatically ?

Sorry for all those questions ..

Corona688 · July 3, 2012, 2:23pm

Because, I repeat, you are not getting raw access here.

You are not getting raw access.

You are not telling the drive 'move to sector x, read'.

You are telling the operating system 'give me data from position x'.

The operating system goes 'Hmmm, someone asked for that a little while ago', pulls it from cache, and gives it to you without touching the disk.

Dropping caches all the time is a bad idea. If you really want to get disk access speeds, use hdparm.

Your results aren't accurate. You're running huge programs to do tiny things and most of what you're measuring is going to be error and bias of some sort. If you want to do real benchmarks, hdparm.

There isn't a significant difference between file and raw disk unless your file is very badly fragmented.

nytty · July 3, 2012, 3:27pm

Alright, I get your point. I though that using /dev/sdb1 moves to position 0 :wall:
AFAIK hdparm only gives you the speed of the disk and not the time of reading a given amount of data.

So far with my tests I find that reading 16kb takes about the same time as reading 512b. I need this information to set the optimal page size of my system (oracle and mysql suggest these sizes)

Can you elaborate on why dropping the caches is a bad idea?
How about this sudo procedure:

drop_caches
flush disk-cache with hdparm
make sure dd is in memory
position the read head randomly on disk
with DD: read X amount of bytes from the beginning of my file (this is timed)

Corona688 · July 3, 2012, 3:43pm

Do a little math. x megabytes per second is 1/x seconds per megabyte.

Disks do read-ahead for you. Disks transfer to the host in larger bundles than 512 bytes anyway. Disks even do their own caching which the OS has no control over, which is going to throw off all your results supremely.

Too bad there isn't a tool which can tell you more about what your disk's doing, test uncached reads, or even configure hardware read-ahead to your preference... something like hdparm...

Tell me exactly what they're asking you. I suspect you've gotten it a bit mixed up.

Because it's not realistic. Your system needs cache to work. Disk speeds are going to be awful without it.

Same problem as the exact same thing you did before. YOu're running huge programs to do tiny things and your results are going to be meaningless.

nytty · July 3, 2012, 6:03pm

At the application level page size is an important concept, it's the unit of interaction with the disk. Imagine a database system requesting a record A, it'd be completely inefficient to only read this record (couple of bytes), instead we collocate a bunch of records in a page and we bet on spatial/temporal locality ...
It's a tradeof, if you set the page size to too big, you risk over reading stuff you dont need. Too small, and you risk to get multiple requests for otherwise contiguous records. A bit what the os does with 4096b pages.
I am trying to study that, using a buffered read of different sizes and calculate the throughtput ... I know only basics about disks, but i am certain that the metric i am studying is not uniform, thus doing the math is just meaningless.
I should cope with all parameters(cache,vm,readahead..ect) and have something like:
bs=512b 0.01sec 50kb/s
bs=16kb 0.0004sec 164mb/s
bs=2M 0.01sec 200mb/s
Then i will decide that 16kb is the best time vs throughtput trade off and this will be my page size.

Corona688 · July 5, 2012, 3:05pm

Agreed.

However, you're not actually interacting with the raw disk in your tests. You're telling the OS to do so for you, which does as it pleases. It will turn a tiny read into a much larger read for you -- ruining your results.

Even the OS isn't dealing with the disk raw, here. The OS asks the disk and the disk does what it pleases, pulling things from its own cache -- ruining your results.

Not to mention, you're running huge programs to do tiny things, which drowns your numbers in meaningless noise -- ruining your results.

Too bad there's not a program that actually deals with disks the way you want already... Something which can tell you transfer rates, bus modes, and bus speeds. Something which can configure software and hardware read-ahead, flush hardware caches at will, and all that jazz, letting you compare results for different configurations. Something which you can actually tell 'read raw sector x', and it will do so.

They really ought to make a program like that.

I bet they'd call it hdparm.

nytty · July 13, 2012, 7:42am

Coming back to your comment on raw device. I think if I replace read operations with writes and do a Fsync() call right after, then I am sure that I am getting raw access there. Also with hdparm, one can disable the write-cache.

methyl · July 13, 2012, 8:13pm

The days of an Operating System being aware of how a disc works are long gone.

A modern disc will use variable block sizes to read/write to the disc and have it's own buffers. The Operating System will see the disc as if it is an early design, but the inner workings are invisible to the Operating System. As are the inner workings of disc arrays.

Get hold of the detailed Installation Guide for your Oracle version and your Operating System version.
Just use the kernel parameters recommended by Oracle, the Database Block Size recommended by Oracle, and the Oracle startup parameters recommended by Oracle scaled for you application and within any memory constraints. The big performance gains within Oracle Performance Tuning are in areas like Sort Buffer Size and tuning the SGA.