Raid0 array stresses only 1 disk out of 3

chebarbudo · April 11, 2016, 6:47am

Hi there,

I've setup a raid0 array of 3 identical disks using :

mdadm --create --verbose /dev/md0 --level=stripe --raid-devices=3 /dev/sdb1 /dev/sdc1 /dev/sdd1

I'm using dstat to monitor the disk activity :

dstat --epoch -D sdb,sdc,sdd --disk-util 30

The results show that the stress is not evenly split (stripped) across the 3 disks:

2016-04-11 09:35:30 |   26%   28%   27%
[...]
2016-04-11 10:15:00 |    0%  100%    0%
2016-04-11 10:15:30 |    0%    3%   97%
2016-04-11 10:16:00 |    0%    0%   81%
2016-04-11 10:16:30 |    0%    0%  100%
2016-04-11 10:17:00 |    0%    0%   30%
[...]
2016-04-11 11:28:30 |    0%    0%   55%
2016-04-11 11:29:00 |    0%    0%   49%
2016-04-11 11:29:30 |    0%    0%   31%
2016-04-11 11:30:00 |    0%    0%   73%
2016-04-11 11:30:30 |    0%    0%    4%
2016-04-11 11:31:00 |    0%    0%   99%
[...]
2016-04-11 11:32:00 |    0%    0%   81%
2016-04-11 11:32:30 |    0%    0%   43%
[...]
2016-04-11 11:43:30 |    0%   93%    0%
2016-04-11 11:44:00 |    0%  100%    0%
2016-04-11 11:44:30 |    0%   97%    0%
2016-04-11 11:45:00 |    0%  100%    0%
2016-04-11 11:45:30 |    0%   10%    0%
[...]
2016-04-11 11:51:30 |    0%   79%    0%
2016-04-11 11:52:00 |    0%  100%    0%
2016-04-11 11:52:30 |    1%    9%    1%
2016-04-11 11:53:00 |    0%  100%    0%
2016-04-11 11:53:30 |    0%   98%    0%
2016-04-11 11:54:00 |    0%   30%    0%
2016-04-11 11:54:30 |    1%    1%    1%
2016-04-11 11:55:00 |    2%    3%    2%
[...]
2016-04-11 12:07:30 |    0%   68%    1%
2016-04-11 12:08:00 |    0%  100%    0%
2016-04-11 12:08:30 |    0%  100%    0%
2016-04-11 12:09:00 |    0%   38%    0%
[...]
2016-04-11 12:23:00 |    0%   84%    1%
2016-04-11 12:23:30 |    0%   58%    0%
[...]
2016-04-11 14:17:00 |    0%   43%    0%
2016-04-11 14:17:30 |    0%   99%    0%
2016-04-11 14:18:00 |    0%  100%    0%
2016-04-11 14:18:30 |    1%    6%    1%
[...]
2016-04-11 14:46:30 |    2%    2%    1%
[...]
2016-04-11 14:48:00 |    1%    9%    1%
2016-04-11 14:48:30 |    0%  100%    0%
2016-04-11 14:49:00 |    0%   96%    0%
2016-04-11 14:49:30 |    0%  100%    0%
2016-04-11 14:50:00 |    0%   99%    0%
2016-04-11 14:50:30 |    0%  100%    0%
2016-04-11 14:51:00 |    0%   41%    0%
2016-04-11 14:51:30 |    0%  100%    0%
2016-04-11 14:52:00 |    2%   18%    2%
[...]
2016-04-11 15:23:30 |    3%    5%    3%
[...]
2016-04-12 09:25:30 |    4%    3%    3%

Do you have an explanation?
Thanks for your help.

Santiago

OS : Debian Wheezy 7.4
Disks : ATA Hitachi HUA72302, 2000GB

gull04 · April 11, 2016, 9:24am

Hi,

This could be a number of things, but it will most likely revolve around the stripe size.

Regards

Gull04

chebarbudo · April 12, 2016, 6:11am

Hi Gull04,

Thank you for your answer.
Is "stripe size" the same as "chunk size"?

Apparently, mine is 512k:

cat /proc/mdstat

returns

Personalities : [raid0]
md0 : active raid0 sdd1[2] sdc1[1] sdb1[0]
      5860543488 blocks super 1.2 512k chunks

unused devices: <none>

How can I identify if this is the source of the problem?

Regards
Santiago

hicksd8 · April 12, 2016, 8:57am

RAID0 "stripes" the data across the three actuators you have and the stripe size (that's official RAID speak) is the minimum allocation. So if the stripe is 2k then the first 2k bytes of a file is written to the first drive, the next 2k to the second drive, and the third 2k to the third drive. It then goes back to the first drive, and so on.

So it's not difficult to see that writing lots of small files will give unpredictable results respecially if they're less than 2k each. Also, read requests can only be satisfied be reading the drive(s) where the files were written.

So your results are misleading.

If you have a desire to test this then you need to do something like......
Create a 4GB file on (ideally) an internal drive not part of this RAID0 array. Kick all the users off if you can and then copy this 4GB to the RAID filesystem and take your measurements whilst that's going on. It won't be precise but should give you a better set of figures.

bakunin · April 12, 2016, 12:43pm

Wouldn't it be sufficient to fire 4GB worth of any data (for instance some brand new hexadecimal zeroes freshly out of /dev/zero ) with dd ? Like

dd if=/dev/zero of=/the/raid/somefile bs=1G count=4

True, this will be off by the overhead of /dev/zero , wouldn't that be negligible given the bandwidth of disks and the memory interface (which are apart some orders of magnitude)?

I hope this helps.

bakunin

hicksd8 · April 12, 2016, 1:05pm

@bakunin......point taken.....good idea.

Sent from my HTC Desire S using Tapatalk 2

chebarbudo · April 14, 2016, 10:45am

Hi guys,

Thank you very much for your contributions.

First of all, my problem does not happen any more. I created the raid with sdb , sdc and sdd on April 11 at 09:35.
Until 11:32, sdd was very busy, then until 14:51, sdc was very busy.
Since then (3 days), the 3 disks are always under the same moderate load altogether (0-20%). The server is used by 5 graphic designers manipulating quite large files (100M-2G).

I ran some tests and the results leave me quite puzzled. So I created simultaneously 10 files. 1GB each. But all the load went on sda . Leaving sdb , sdc and sdd with a moderate 20% load.

The command:

for i in {1..10}; do
  file=$(mktemp /galaxy/XXXXXXX)
  echo $file >> /galaxy/dd.files
  dd if=/dev/zero of=$file bs=1G count=1 &
  echo $!    >> /galaxy/dd.pids
done

The output of dstat:

----system---- sda--sdb--sdc--sdd-
     time     |util:util:util:util
14-04 15:56:30|  21:   0:   0:   0
14-04 15:57:00| 100:   0:   0:   0
14-04 15:57:30| 101:   0:   0:   0
14-04 15:58:00| 100:   2:   2:   1
14-04 15:58:30| 101:   3:   4:   2
14-04 15:59:00| 102:   4:   5:   4
14-04 15:59:30|  98:   2:   3:   2
14-04 16:00:00| 100:   4:   4:   2
14-04 16:00:30| 103:  16:  16:  15
14-04 16:01:00|  98:  16:  17:  15
14-04 16:01:30| 101:  15:  15:  15
14-04 16:02:00|  99:   9:   8:   8
14-04 16:02:30| 100:   3:   4:   3
14-04 16:03:00| 100:   2:   4:   3
14-04 16:03:30| 104:   4:   4:   3
14-04 16:04:00|  95:   4:   4:   3
14-04 16:04:30| 100:   3:   4:   2
14-04 16:05:00| 101:   3:   4:   3
14-04 16:05:30|  99:  12:  13:  12
14-04 16:06:00| 102:  20:  22:  18
14-04 16:06:30|  98:  17:  19:  18
14-04 16:07:00| 101:   7:   9:   8
14-04 16:07:30|  99:   4:   5:   3
14-04 16:08:00| 102:   4:   5:   3
14-04 16:08:30|  98:   3:   5:   3
14-04 16:09:00| 100:   5:   7:   5
14-04 16:09:30| 101:   5:   5:   4
14-04 16:10:00| 100:   4:   4:   2
14-04 16:10:30| 100:  17:  18:  16
14-04 16:11:01| 105:  16:  20:  16
14-04 16:11:30|  95:  15:  17:  17
14-04 16:12:00| 100:  12:  11:  10
14-04 16:12:30|  34:  15:  16:  14

Is /dev/zero an actual file of sda ?
How do you interpret the results?

Regards
Santiago

Scrutinizer · April 14, 2016, 10:54am

You are producing stats of the disks, but the stripe is made with partitions on those disks.. Are there other partitions on those disks. Also, what filesystem was created on the striped md device and where is it mounted?

/dev/zero is a character (c) device on the root filesystem in the directory /dev

What output does df produce?

chebarbudo · April 14, 2016, 11:36am

Hi Scrutinizer,

Thanks for your contribution.

sda contains / , /tmp , /usr , /var and swap .
sdb , sdc and sdd contain only 1 partition each: sdb1 , sdc1 and sdd1 .
md0 is made of sdb1 , sdc1 and sdd1

Output of df :

Filesystem                                              1K-blocks       Used  Available Use% Mounted on
rootfs                                                    1922416     337512    1487248  19% /
udev                                                        10240          0      10240   0% /dev
tmpfs                                                      403248       5800     397448   2% /run
/dev/disk/by-uuid/314e823f-91b2-42f9-9a4b-d66b5e202e27    1922416     337512    1487248  19% /
tmpfs                                                        5120          0       5120   0% /run/lock
tmpfs                                                     2368980          0    2368980   0% /run/shm
/dev/sda2                                                 1922416      35744    1789016   2% /tmp
/dev/sda3                                                 4806140     683960    3878040  15% /usr
/dev/sda6                                                 9612252    1023896    8100076  12% /var
/dev/md0                                               5814368784 1562158184 3959183428  29% /galaxy

Output of parted -l (condensed):

Disk /dev/sda: 2000GB
Number  Start   End     Size    File system     Name  Flags
 1      17.4kB  2000MB  2000MB  ext4                  boot
 2      2000MB  4000MB  2000MB  ext4
 3      4000MB  9000MB  5000MB  ext4
 4      9000MB  14.0GB  5000MB  ext4
 5      14.0GB  22.0GB  8000MB  linux-swap(v1)
 6      22.0GB  32.0GB  10.0GB  ext4
 7      32.0GB  2000GB  1968GB  ext4            lvm

Disk /dev/sdb: 2000GB
Number  Start  End     Size    Type     File system  Flags
 1      512B   2000GB  2000GB  primary               raid

Disk /dev/sdc: 2000GB
Number  Start  End     Size    Type     File system  Flags
 1      512B   2000GB  2000GB  primary               raid

Disk /dev/sdd: 2000GB
Number  Start  End     Size    Type     File system  Flags
 1      512B   2000GB  2000GB  primary               raid

Disk /dev/md0: 6001GB
Number  Start  End     Size    File system  Flags
 1      0.00B  6001GB  6001GB  ext4

---------- Post updated at 17:36 ---------- Previous update was at 17:13 ----------

I found the problem.

The reason why sda is under heavy stress when using dd on md0(sdb,sdc,sdd) is the swap.

for i in {1..10}; do
  dd if=/dev/zero of=$(mktemp /galaxy/XXXXXXX) bs=1G count=1 &
done

Using 10 times dd with bs=1G require 10G memory which I don't have. So the system uses the swap on sda and md0 is quietly waiting doing nothing.

for i in {1..10}; do
  dd if=/dev/zero of=$(mktemp /galaxy/XXXXXXX) bs=256M count=4 &
done

Using 10 times dd with bs=256M require 2.5G memory which I have. So all the stress is on md0 .