Software RAID on Linux

Hey,

I have worked with Linux for some time, but have not gotten into the specifics of hard drive tuning or software RAID. This is about to change. I have a Dell PowerEdge T105 at home and I am purchasing the following:

1GBx4 DDR2 ECC PC6400 RAM
Rosewill RSV-5 E-Sata 5 bay disk enclosure with 1x PCI-e Silicon Image E-Sata card

I will soon purchase:
1TBx4 Western Digital Caviar Green drives and modify the firmware to disable standard TLER
It currently already has an 80GB boot drive, which I plan to keep solitary for now, but may mirror in the future when I buy another drive.

I am running OpenSUSE 11.0 and may upgrade to 11.1 if necessary. I plan on running software RAID-5 (md) on the 4 drives and at a later point putting in a 5th drive as a spare for the set.

I have read that there should be a significant amount of tuning to the file system and drives themselves. For the bulk of my storage, I plan on running JFS, even though it has limited support from OpenSUSE as it is still included and has some features I really like, like the inode dynamic allocation and resilience.

I have found info to set the chunk size to about 64-256K and test, as well as setting noatime. What other information can you give me to help make this reliable and fast.

Speed is not as big an issue, as I will be pulling files from a 100Mbit network, but it may be upgraded to gigabit in time.

The applications I plan to run on this machine are:

HypericHQ
OpenVPN
SAMBA
MondoRescue/Mindi
Maybe VMware (testing only)
Maybe DIMDIM
Perhaps a small local Apache or Lighttpd server
Maybe a test ORACLE instance, kept small

It seems to me that putting all the drives on one bus/SATA adaptor, is a sure way to degrade performance. I cannot answer about JFS, but another technique is to use the Logical Volume suite, which can deal with dynamically growing partitions. You can also use logical volumes for RAID 5, or you can use them for single partitions, use the "md" software raid technique to create a RAID 5 on top of those single partitions. Either way, you mount JFS on top of the create RAID device.

For IDE and slow ATA drives yes. But for S/ATA? I have not heard this.

As far as fs tuning, an optimum block size is good, and periodic defragmenting is good. For that reason, it makes sense to partition into 3 filesystems: 1 for infrequent writes, such as program images and libraries; 1 for write-mostly activity, such as for logfiles and /var activity; 1 for write-read-often, such as for configuration files and dynamic data. The first two you will rarely need to defragment. The third one you should defragment somewhat frequently. Also, you can disable atime on the first two, but can be kept on the third (IMHO).

I don't see how putting them on the same channel will degrade performance to a significant degree. The theoretical limit on SATA is 300MB/s and I have not seen any drives that reliably push more than 90MB/sec on for more than a few minutes at a time. The drives I am going for are lower power 5400-7200rpm variable rotational drives.

As far as the drives themselves, I was going to use a tool WD provides to enable RAID optimized TLER to prevent deep recovery on failed write/read as this could cause a drop on RAID performance (The main difference in their RE series drives).

As for the file systems, I usually break into more than 3, giving /boot, /, /usr, /srv, /home, var and /tmp their own, as well as probably going with a swap partition on the boot drive and a swap file on the array, at least until I get the boot drives mirrored. Then I will not use the swap file any longer and only use the partition on the boot mirror.
I had always planned on LVM for the devices, to aid in the management, or if not LVM, I thought perhaps about using EVMS, but with the reduced support of it, it made me hesitant to try it.

That's a good point about Sata throughput and drive throughput. What you must consider, however, is that RAID tries to write 5 blocks to five different drives right after another. Each contoller gets the data, sends a write request, waits for it to finish, sends the next, etc. Each drive gets the SATA command, positions the head to the correct place, then writes the data. When done serially, this process can reduce your throughput from 90 MB/s to 18 MB/s (90/5).

Since the PCI bus is much faster than the SATA bus, the OS can send the data to multiple controllers by the time the first disk is about to write. Thus, the OS can take advantage of multiple SATA controllers. (It could be that some SATA PCI controllers have multiple busses and multiple drive controllers, which would be nice). But if you're fine with the 20 MB/s, then just a single controller is fine.

As far as partitions, leave /boot, /, and /usr on your 80 GB drive. I don't know why you need to separate /tmp from / unless you have a large need for /tmp space (which is very uncommon these days -- most programs use the current working directory for scratch space). If you have a need for /tmp, but that consists of lots of small data files, use a memory-based solution.

You almost won't need swap at all with 4+ GB. If you have one, keep it on the array, but not on the slower 80 gb drive.

I plan on putting none of the system partitions on the RAID5. Everything system related will be on the 80GB drive, including tmp and swap, but there won't be much of it, maybe 1GB. I don't like the idea of swap on RAID5, but plan on having it on the 80GB for when I get a second drive and mirror.

I keep /tmp separated due to the fact that I dont want a full /tmp to interfere with the root volume. It is how I have always done it and see no need to change to something that may be less resilient, even if it is slightly more convenient.

Also, I believe that the OS can cache the requests until such time as they can get written, and then do a fsync.

If the host controller supports it, the single channel should not have a problem:
FIS(Frame Information Structure)-based switching

FIS-based switching is similar to a USB hub. In this method of switching the host controller can issue commands to send and receive data from any drive at any time. A balancing algorithm ensures a fair allocation of available bandwidth to each drive. FIS-based switching allows the aggregated saturation of the host link and does not interfere with NCQ.

from: Port multiplier - Wikipedia, the free encyclopedia

It turns out, the controller that ships with my enclosure, the Sil3132 supports FIS.

The FIS stuff sounds decent, then. Additional cards will get you some benefits, but for the costs, I suppose those benefits would be rather small.

One reason you might have put /tmp on its own partition is that in the older days of UNIX, lots of programs used /tmp for critical things ... like email for temporary folders, editors' swap files, and such. These days, it's hardly used, except for creating very small files by ssh, XWindows, and such. Even if /tmp does get full, it's unlikely to interfere with the system operation, whereas before, when it got full, things like editing a system file became dangerous.

The only things writing to / should be in /tmp, pretty much, so putting /var and /tmp onto the same volume isn't a bad idea -- then / can practically be read-only (unless you need to update a config file). The only hitch is mtab, the linker cache, and adjtime need to be soft-links to somewhere in /var or /tmp.

I got the enclosure. In case anybody was wondering, it is the Rosewill RSV-S5 and it is $199 till the end of the day today at the egg.

As for FIS with multiple cards, the only issue I can see is that they will be limited by the 250MB/s on my PCI-E 1x slot.

As far as what you said about /tmp, that makes sense. I came from an HP-UX world before I went to Linux (Well, played with Linux a little bit in college, and then did not really use it as much as HP-UX and some AIX till we started using it at that company
As for that, if /tmp was full, you could not even install a patch or software, as they were written out to /tmp before install.

I still feel more comfortable with it on its own partition, but the things you said about / being near read only piqued my interest. I doubt I will give it a try except maybe in a virtual instance, but it is something to consider.