So I have a script that monitors my drives (/dev/sda and /dev/sdb) using smartctl (smartmontools). I'm by no means an expert in scripting, so this was my attempt at creating a way to email me if one of the values in smartctl output goes above a set threshold.
My question is, I'm trying to edit the "Subject" line of the email that it sends, so that I can tell which 'test' is failing. Is it temperature_celsius, etc? Is there a way to name each one and then use a variable like $failedtest in the subject line?
I'm at a loss for how to write this..
#!/bin/sh
export PATH=/ffp/bin:/ffp/sbin:$PATH
/ffp/bin/touch /ffp/tmp/smartmessage1
/ffp/bin/touch /ffp/tmp/smartmessage2
SMARTMESSAGE1=/ffp/tmp/smartmessage1
SMARTMESSAGE2=/ffp/tmp/smartmessage2
MAILMESSAGE=/ffp/tmp/mailmessage
FROMADDR=fromemail@gmail.com
SUBJECT="[ALERT!] SMART Monitoring as of [`date`]"
TO_EMAIL_ADDR=toemail@gmail.com
DEVA=/dev/sda
DEVB=/dev/sdb
LOG=/mnt/HD_a2/logs/smartctl_mail.log
echo [`date`] SMART Monitoring script started, variables set... >> $LOG
/ffp/sbin/smartctl -d marvell -a $DEVA > $SMARTMESSAGE1
echo [`date`] /dev/sda scanned. >> $LOG
/ffp/sbin/smartctl -d marvell -a $DEVB > $SMARTMESSAGE2
echo [`date`] /dev/sdb scanned. >> $LOG
cat $SMARTMESSAGE1 > $MAILMESSAGE
cat $SMARTMESSAGE2 >> $MAILMESSAGE
if [ `cat $SMARTMESSAGE1 | grep Raw_Read_Error_Rate | /ffp/bin/awk '{print $10}'` -gt 0 \
-o `cat $SMARTMESSAGE1 | grep Reallocated_Sector_Ct | /ffp/bin/awk '{print $10}'` -gt 0 \
-o `cat $SMARTMESSAGE1 | grep Seek_Error_Rate | /ffp/bin/awk '{print $10}'` -gt 0 \
-o `cat $SMARTMESSAGE1 | grep Spin_Retry_Count | /ffp/bin/awk '{print $10}'` -gt 0 \
-o `cat $SMARTMESSAGE1 | grep Calibration_Retry_Count | /ffp/bin/awk '{print $10}'` -gt 0 \
-o `cat $SMARTMESSAGE1 | grep Temperature_Celsius | /ffp/bin/awk '{print $10}'` -gt 40 \
-o `cat $SMARTMESSAGE1 | grep Reallocated_Event_Count | /ffp/bin/awk '{print $10}'` -gt 0 \
-o `cat $SMARTMESSAGE1 | grep Current_Pending_Sector | /ffp/bin/awk '{print $10}'` -gt 0 \
-o `cat $SMARTMESSAGE1 | grep Offline_Uncorrectable | /ffp/bin/awk '{print $10}'` -gt 0 \
-o `cat $SMARTMESSAGE1 | grep UDMA_CRC_Error_Count | /ffp/bin/awk '{print $10}'` -gt 0 \
-o `cat $SMARTMESSAGE1 | grep Multi_Zone_Error_Rate | /ffp/bin/awk '{print $10}'` -gt 0 \
-o `cat $SMARTMESSAGE2 | grep Raw_Read_Error_Rate | /ffp/bin/awk '{print $10}'` -gt 0 \
-o `cat $SMARTMESSAGE2 | grep Reallocated_Sector_Ct | /ffp/bin/awk '{print $10}'` -gt 0 \
-o `cat $SMARTMESSAGE2 | grep Seek_Error_Rate | /ffp/bin/awk '{print $10}'` -gt 0 \
-o `cat $SMARTMESSAGE2 | grep Spin_Retry_Count | /ffp/bin/awk '{print $10}'` -gt 0 \
-o `cat $SMARTMESSAGE2 | grep Calibration_Retry_Count | /ffp/bin/awk '{print $10}'` -gt 0 \
-o `cat $SMARTMESSAGE2 | grep Temperature_Celsius | /ffp/bin/awk '{print $10}'` -gt 40 \
-o `cat $SMARTMESSAGE2 | grep Reallocated_Event_Count | /ffp/bin/awk '{print $10}'` -gt 0 \
-o `cat $SMARTMESSAGE2 | grep Current_Pending_Sector | /ffp/bin/awk '{print $10}'` -gt 0 \
-o `cat $SMARTMESSAGE2 | grep Offline_Uncorrectable | /ffp/bin/awk '{print $10}'` -gt 0 \
-o `cat $SMARTMESSAGE2 | grep UDMA_CRC_Error_Count | /ffp/bin/awk '{print $10}'` -gt 0 \
-o `cat $SMARTMESSAGE2 | grep Multi_Zone_Error_Rate | /ffp/bin/awk '{print $10}'` -gt 0 ] ;
then
echo [`date`] Problems found...sending mail. >>$LOG
/ffp/bin/mailx -s "$SUBJECT" \
-S smtp-use-starttls \
-S ssl-verify=ignore \
-S smtp-auth=login \
-S smtp=smtp://smtp.gmail.com:587 \
-S from="$FROMADDR" \
-S smtp-auth-user=useremail@gmail.com \
-S smtp-auth-password=pw \
-S ssl-verify=ignore \
$TO_EMAIL_ADDR < $MAILMESSAGE
echo [`date`] Message Sent! >> $LOG
else
echo [`date`] No problems found...not sending mail. >>$LOG
fi
rm $SMARTMESSAGE1 $SMARTMESSAGE2 $MAILMESSAGE
exit 0
In case it matters, here is an output of `smartctl -d marvell -a /dev/sda`:
smartctl 5.39.1 2010-01-28 r3054 [arm-unknown-linux-uclibc] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Blue Serial ATA family
Device Model: WDC WD6400AAKS-00H2B0
Serial Number: WD-WMASY7478202
Firmware Version: 07.04C07
User Capacity: 640,135,028,736 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Fri Jan 21 09:55:36 2011 GMT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (12360) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 145) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303f) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 166 165 021 Pre-fail Always - 4675
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 848
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 100 253 000 Old_age Always - 0
9 Power_On_Hours 0x0032 098 098 000 Old_age Always - 2082
10 Spin_Retry_Count 0x0032 100 100 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 27
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 26
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 848
194 Temperature_Celsius 0x0022 109 100 000 Old_age Always - 38
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 1322 -
# 2 Short offline Completed without error 00% 1321 -
# 3 Short offline Completed without error 00% 1319 -
# 4 Short offline Aborted by host 10% 1318 -
# 5 Short offline Aborted by host 10% 1318 -
# 6 Conveyance offline Completed without error 00% 328 -
# 7 Short offline Completed without error 00% 328 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Any cleanup/shorter command suggestions are always welcome too, this was just the best I could come up with.
Thanks