Hi,
one of our customers have 8 physical HW devices (oracle X4-2).
We had a failed HD in the raid1 cluster on the server (not HW failure - just mdadm taking the server out the raid cluster as it was corrupt)
They wish to know if there is a way of monitoring and/or altering for the health of the RAID cluster in the future to avoid this situation ?
Since the only reason this was noticed is that I saw it during some physical work in the data centre on the console when connected (I then confirmed this as below):
> cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sda1[0] sdb1[1]
1048512 blocks super 1.0 [2/2] [UU]
bitmap: 0/1 pages [0KB], 65536KB chunk
md1 : active raid1 sdb2[1](F) sda2[0]
291787584 blocks super 1.1 [2/1] [U_]
bitmap: 2/3 pages [8KB], 65536KB chunk
and:
> mdadm --detail /dev/md1
/dev/md1:
Version : 1.1
Creation Time : Fri Dec 19 19:22:52 2014
Raid Level : raid1
Array Size : 291787584 (278.27 GiB 298.79 GB)
Used Dev Size : 291787584 (278.27 GiB 298.79 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Mon Jan 28 14:46:31 2019
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0
Name : localhost.localdomain:1
UUID : 9b7d309a:24840105:66650b93:cbb35cc9
Events : 43398733
Number Major Minor RaidDevice State
0 8 2 0 active sync /dev/sda2
2 0 0 2 removed
1 8 18 - faulty /dev/sdb2
They have SNMP enabled on the servers, but as far as I can see there are no counters visible through any of the installed MIBs that expose RAID health ?
I have asked CA Support, but they can offer no solutions.
mdadm COULD be used to monitor it looking at the documentation:
e.g. mdadm --monitor --daemonise --mail=root@localhost --delay=1800 /dev/md0
And configure the servers to a local email server.
or even use --program to call wget and an API on the GW for alterting/SNMP trap, etc
Perhaps it's just the rarity of HW appliances out there that CA Support can offer no support around this whatsoever.
I'd be interested in what solutions others have used out there if there are any before I embark further down the approach above of using mdadm to monitor.
stu