Prediction of failures

incredible · August 5, 2009, 12:05pm

Any diagnostic tool to do predictive check on all the SUN hard disks before it fails, as a preventive measure? Meaning, is there any tool that can really check for hdd which are failing/or "will fail soon" for Sun servers?

Sun_Fire · August 5, 2009, 2:20pm

SunVTS is the tool that's supposed to do this. It should do a stress test on the machine.

jlliagre · August 5, 2009, 5:34pm

SunVTS is a tool designed to validate hardware components against Solaris. It might be used to stress components but that wouldn't be a good practice.

A better suited tool would be Solaris Fault Manager (a.k.a. predictive self healing) which is precisely designed to check components before they fail.

Have a look at this blog for an example related to disks.

Bob Netherton's Weblog

incredible · August 6, 2009, 11:55pm

Its not a new machine. Its a production server, so probably we cant go with running VTS as it will cause too much stress on the server. Any better options?

Sun_Fire · August 7, 2009, 2:33am

@ Incredible

but I think to predict hardware failures, then stress tests are the only way.

Other than that, you need to monitor your error messages wishing that the hardware itself (preferrably with up-to-date firmware) report early predictive failure.

jlliagre · August 7, 2009, 3:54am

It looks like both of you overlook the second part of my previous reply. The tools you are looking for already exist and are included with Solaris.

Some more links:

Solaris Fault Manager (Solaris 10 What's New) - Sun Microsystems
Getting notified when hardware breaks
SCSI DISK FMA Project Part 1: SCSI Device Drivers as FMA Telemetry Detectors

sbk1972 · August 7, 2009, 4:50am

Commands :- fmstat / fmadm
Logs :- /var/fm/fmd

Solaris 10 now has a ton of background health monitoring, which reports to the above.

SBK

Sun_Fire · August 7, 2009, 5:50am

Interesting...I'll be looking at these tools. Thanks guys !

incredible · August 7, 2009, 9:25am

You still dont get my point. I want prevention rather than reactive action after things happen.

jlliagre · August 7, 2009, 10:59am

You are still missing mine. Unless you expect a crystal ball to predict what will happen in the future with currently healthy components, the only reasonable way to prevent their future faults is by monitoring events coming from them. This is what SMF is designed to do.

Alternatively, if your goal is really to react to something that hasn't happened yet, you can pro-actively replace each disk after a period of use significantly smaller than its MTBF.

If you just care about your data, use something like RAIDZ2 with hot spares. Your system will happily survive two disks crashing at the same time and will automatically replace them by the spares.

Sun_Fire · August 7, 2009, 1:14pm

yes I agree, there's no magic way to really predict each and every hardware failure.

If the data is so critical, then you should invest more in redundancy and HA, and RAS.

incredible · August 9, 2009, 12:47pm

Thanks for your valuable feedback.

Sun_Fire · August 10, 2009, 4:57am

One more question:

After finishing installation, customer asked me to do "Network stress test" ...

any ideas ?