Replacing a Disk

This procedure describes how to replace a failed disk in a StoredSafe appliance. When a disk malfunctions or fails, a red banner appears in the Web GUI with a link to the status page. On the physical appliance, the affected disk is also marked by a red indicator light. These indicators may not always appear at the same moment, but both refer to the same underlying disk failure.

To replace the disk, follow these steps:

Logon on the StoredSafe server console as storedsafe and spawn a root shell.
Use the megacli or storcli tool located in /opt/storedsafe/bin to identify the failed disk and mark it as offline, missing and ready for removal.
Physically replace the failed disk with a new one of the same or larger capacity.
Use the megacli or storcli tool to verify that the new disk is recognized by the system.
Use the megacli or storcli tool to mark the new disk as online.
Wait for the RAID array to rebuild. This may take several hours depending on the size of the disks and the load on the system. The RAID array remains usable during the rebuild, but expect reduced performance until the process is complete.
Verify that the RAID array is healthy after the rebuild process is complete.
Exit the root shell and log out from the StoredSafe server console.
Verify in the Web GUI that the disk replacement was successful and that there are no warnings.

Here is an example of how to perform these steps using the MegaCli tool:

┌────────────────────────────────────────────────────────────────────────────┐
│              System Settings on node1 (Version X.X.X build XXXX)           │
└────────────────────────────────────────────────────────────────────────────┘

┌─┬──────────────────────────────────────────────────────────────────────────┐
│1│Network Settings                                                          │
│2│GUI Settings                                                              │
│3│Backup Management                                                         │
│4│Storage Management                                                        │
│5│Service Management                                                        │
│6│Firmware Management                                                       │
│7│Database Maintenance                                                      │
│8│Appliance Management                                                      │
└─┴──────────────────────────────────────────────────────────────────────────┘

Move the cursor or enter a it's corresponding number (Q to Quit)

Main> Unsupported, proceed at own risk. Warranty may be voided.
Shell timeout set to 3600 seconds.
root@safe:~# /opt/storedsafe/bin/megacli -ldinfo -lALL -aALL

Adapter 0 -- Virtual Drive Information:
Virtual Drive: 0 (Target Id: 0)
Name                :
RAID Level          : Primary-5, Secondary-0, RAID Level Qualifier-3
Size                : 2.727 TB
Sector Size         : 512
Is VD emulated      : No
Parity Size         : 931.0 GB
State               : Degraded
Strip Size          : 256 KB
Number Of Drives    : 4
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
Bad Blocks Exist: No
PI type: No PI
Is VD Cached: No

In the example above the array is reported as ‘State : Degraded’, indicating one or more physical disks have failed or are missing. The megacli command produces a lengthy output; in this example it lists four physical disks and their key details:

root@safe:~# /opt/storedsafe/bin/megacli -pdlist -aALL

Enclosure Device ID: 252
Slot Number: 0
...
Firmware state: Online, Spun Up

Enclosure Device ID: 252
Slot Number: 1
...
Firmware state: Online, Spun Up

Enclosure Device ID: 252
Slot Number: 2
...
Firmware state: Online, Spun Up

Enclosure Device ID: 252
Slot Number: 3
...
Firmware state: Offline

In this example, the failed disk appears as “Enclosure Device ID: 252” and “Slot Number: 3.” For the MegaCli syntax, this drive is therefore referenced as [252:3] in the commands below.

Once we’ve identified the EID and slot number of the failed drive, we can proceed to remove it.

First, manually set the original disk to offline (if the controller hasn’t already done so due to the error):

root@safe:~# /opt/storedsafe/bin/megacli -PDOffline -PhysDrv [252:3] -a0
Adapter: 0: EnclId-252 SlotId-3 state changed to OffLine.
Exit Code: 0x00

Next, mark the failed disk as missing:

root@safe:~# /opt/storedsafe/bin/megacli -PDMarkMissing -PhysDrv [252:3] -a0
EnclId-252 SlotId-3 is marked Missing.
Exit Code: 0x00

Now, mark the failed disk as prepared for removal:

root@safe:~# /opt/storedsafe/bin/megacli -PDPrpRmv -PhysDrv [252:3] -a0
EnclId-252 SlotId-3 is marked for removal.
Exit Code: 0x00

It might be helpful to use the hdd identify command to locate the disk physically:

root@safe:~# /opt/storedsafe/bin/megacli -PDLocate -start -PhysDrv [252:3] -a0
Adapter: 0: Device at EnclId-252 SlotId-3 — PD Locate Start Command was successfully sent to Firmware
Exit Code: 0x00

Next, physically replace the failed disk with a new one of the same or larger capacity.

After replacing the disk, bring the new disk online which also should start the rebuild process automatically:

root@safe:~# /opt/storedsafe/bin/megacli -PDOnline -PhysDrv [252:3] -a0
...
Command completed successfully.

Finally, monitor the rebuild process:

root@safe:~# /opt/storedsafe/bin/megacli -PDList -aALL
...
Rebuild Status: Rebuild in Progress
...

Optional: You can monitor the rebuild process. Depending on the size of the array, this may take a significant amount of time. The RAID array remains usable during the rebuild, but expect reduced performance until the process is complete.

root@safe:~# /opt/storedsafe/bin/megacli -PDRbld -ShowProg -PhysDrv [252:3] -a0
...
Rebuild Progress: 45%
...

Once the rebuild is complete, the RAID array should report as ‘State : Optimal’, indicating that the disk replacement was successful.

root@safe:~# /opt/storedsafe/bin/megacli -ldinfo -lALL -aALL
...
State               : Optimal
...

root@safe:~# exit
exit

Note

If the rebuild doesn’t start automatically after bringing the new disk online, you may need to initiate it manually:

root@safe:~# /opt/storedsafe/bin/megacli -PDRbld -Start -PhysDrv[252:3] -a0
Rebuild for EnclId-252 SlotId-3 started.
Exit Code: 0x00

When done, leave the root shell to return to the storedsafe user prompt.

root@safe:~# exit
exit