Replacing a Disk
This procedure describes how to replace a failed disk in a StoredSafe appliance. A warning will appear in the Web GUI when a disk has failed.
To replace the disk, follow these steps:
Logon on the StoredSafe server console as storedsafe and spawn a root shell.
Use the megacli or storcli tool located in /opt/storedsafe/bin to identify the failed disk and mark it as offline.
Physically replace the failed disk with a new one of the same or larger capacity.
Use the megacli or storcli tool to verify that the new disk is recognized by the system.
Use the megacli or storcli tool to mark the new disk as online.
Wait for the RAID array to rebuild. This may take several hours depending on the size of the disks and the load on the system.
Verify that the RAID array is healthy after the rebuild process is complete.
Exit the root shell and log out from the StoredSafe server console.
Verify in the Web GUI that the disk replacement was successful and that there are no warnings.
Here is an example of how to perform these steps using the megacli tool:
┌────────────────────────────────────────────────────────────────────────────┐
│ System Settings on node1 (Version X.X.X build XXXX) │
└────────────────────────────────────────────────────────────────────────────┘
┌─┬──────────────────────────────────────────────────────────────────────────┐
│1│Network Settings │
│2│GUI Settings │
│3│Backup Management │
│4│Storage Management │
│5│Service Management │
│6│Firmware Management │
│7│Database Maintenance │
│8│Appliance Management │
└─┴──────────────────────────────────────────────────────────────────────────┘
Move the cursor or enter a it's corresponding number (Q to Quit)
Main> Unsupported, proceed at own risk. Warranty may be voided.
Shell timeout set to 3600 seconds.
root@safe:~# /opt/storedsafe/bin/megacli -ldinfo -lALL -aALL
Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name :
RAID Level : Primary-5, Secondary-0, RAID Level Qualifier-3
Size : 2.727 TB
Sector Size : 512
Is VD emulated : No
Parity Size : 931.0 GB
State : Degraded
Strip Size : 256 KB
Number Of Drives : 4
Span Depth : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy : Disk's Default
Encryption Type : None
Bad Blocks Exist: No
PI type: No PI
Is VD Cached: No
In the example above the array is reported as ‘State : Degraded’, indicating one or more physical disks have failed or are missing. The megacli command produces a lengthy output; in this example it lists four physical disks and their key details:
root@safe:~# /opt/storedsafe/bin/megacli -pdlist -aALL
Enclosure Device ID: 252
Slot Number: 0
...
Firmware state: Online, Spun Up
Enclosure Device ID: 252
Slot Number: 1
...
Firmware state: Online, Spun Up
Enclosure Device ID: 252
Slot Number: 2
...
Firmware state: Online, Spun Up
Enclosure Device ID: 252
Slot Number: 3
...
Firmware state: Offline
In this example, the failed disk appears as “Enclosure Device ID: 252” and “Slot Number: 3.” For the MegaCli syntax, this drive is therefore referenced as [252:3] in the commands below.
Once we’ve identified the EID and slot number of the failed drive, we can proceed to remove it.
First, manually set the original disk to offline (if the controller hasn’t already done so due to the error):
root@safe:~# /opt/storedsafe/bin/megacli -PDOffline -PhysDrv [252:3] -a0
Adapter: 0: EnclId-252 SlotId-3 state changed to OffLine.
Exit Code: 0x00
Next, mark the failed disk as missing:
root@safe:~# /opt/storedsafe/bin/megacli -PDMarkMissing -PhysDrv [252:3] -a0
EnclId-252 SlotId-3 is marked Missing.
Exit Code: 0x00
Now, mark the failed disk as prepared for removal:
root@safe:~# /opt/storedsafe/bin/megacli -PDPrpRmv -PhysDrv [252:3] -a0
EnclId-252 SlotId-3 is marked for removal.
Exit Code: 0x00
It might be helpful to use the hdd identify command to locate the disk physically:
root@safe:~# /opt/storedsafe/bin/megacli -PDLocate -start -PhysDrv [252:3] -a0
Adapter: 0: Device at EnclId-252 SlotId-3 — PD Locate Start Command was successfully sent to Firmware
Exit Code: 0x00
Next, physically replace the failed disk with a new one of the same or larger capacity.
After replacing the disk, bring the new disk online which also should start the rebuild process automatically:
root@safe:~# /opt/storedsafe/bin/megacli -PDOnline -PhysDrv [252:3] -a0
...
Command completed successfully.
Finally, monitor the rebuild process:
root@safe:~# /opt/storedsafe/bin/megacli -PDList -aALL
...
Rebuild Status: Rebuild in Progress
...
Optional: You can monitor the rebuild process. Depending on the size of the array, this may take a significant amount of time. The RAID array remains usable during the rebuild, but expect reduced performance until the process is complete.
root@safe:~# /opt/storedsafe/bin/megacli -PDRbld -ShowProg -PhysDrv [252:3] -a0
...
Rebuild Progress: 45%
...
Once the rebuild is complete, the RAID array should report as ‘State : Optimal’, indicating that the disk replacement was successful.
root@safe:~# /opt/storedsafe/bin/megacli -ldinfo -lALL -aALL
...
State : Optimal
...
root@safe:~# exit
exit
Note
If the rebuild doesn’t start automatically after bringing the new disk online, you may need to initiate it manually:
root@safe:~# /opt/storedsafe/bin/megacli -PDRbld -Start -PhysDrv[252:3] -a0
Rebuild for EnclId-252 SlotId-3 started.
Exit Code: 0x00
When done, leave the root shell to return to the storedsafe user prompt.
root@safe:~# exit
exit