Fixing a degraded virtual disk in a storage pool with a drive configured as a hot spare | Quisitive
Fixing a degraded virtual disk in a storage pool with a drive configured as a hot spare
July 2, 2015
Cameron Fuller
Read more

I love the concept of Storage Spaces, especially the ability to leverage SSD drives combined with hard drives for high speed disk performance. I did run into some issues with one of my storage spaces so in this blog post we will discuss the issue that I ran into and steps which I took to resolve the issue. Major topics for this blog post are:

  • An introduction to storage spaces, virtual disks and physical disks
  • Identifying issues on a virtual disk
  • Identifying physical disks with errors
  • Finding and removing the physical disk which is going bad
  • Addressing challenges with the Hot Spare drive
  • Fixing the virtual disk and storage pool
  • Addressing a side issue with virtuals on the drive
  • Summary & reference links

Introducing storage spaces, virtual disks and physical disks

In my lab I have a storage space (Storage Pool Z) which I built to provide highly available storage shown below:

This storage space has a single virtual disk (Software Library and Virtual Storage) which has a capacity of just under 2 TB.

The virtual disk is configured as a mirror which spans four physical drives shown below:

To help to mitigate issues with the loss of a drive I also added a Hot Spare to this virtual disk (note the usage column for PhysicalDisk0 which shows it’s usage as “Hot Spare”).

Issues on the virtual disk

I started seeing some strange issues with my lab environment specially an inability to launch my virtual machines which were stored on this virtual disk (the screenshot below shows the inability to start one of the virtuals).

The files specified did exist but would not mount properly.

Investigation of the storage pool I had created showed the virtual disk as degraded (see the left side yellow exclamation point). But on the right all drives were all showing as OK.

For transparency, I tried a chkdsk on the Z drive but that did not provide a benefit. I also tried right-clicking on the virtual disk and doing a rebuild but that also provided no benefit.

Identifying physical disks with errors

Using PowerShell (Get-PhysicalDisk <DiskName> | Get-StorageReliabilityCounter) on each of the drives in the virtual disk I was able to identify one drive which appeared to be having issues. In this case it was “PhysicalDisk4” which was having read errors which were being corrected.

The other drives in this virtual disk did not have any errors associated with them as shown below.

At this point I was working from the assumption that I have a single drive in this virtual disk which is going bad and causing issues to occur. As is a best practice the next step was to backup all of the information on the drive. Seriously, don’t skip this as a failure could result in the loss of all of the data on the drive.

Based on what I found online I backed up the key information, removed the drive which I was seeing with the issue and then my plan was to have the hot spare take over for the missing drive.

Attempts to remove the drive through the UI were unsuccessful as shown below.

I also attempted to detach the virtual disk and then remove the bad drive through the UI but that also was not successful.

The next step was to retire the disk which was going bad with PowerShell (Set-PhysicalDisk –FriendlyName <DriveName> -Usage Retired). The results are shown below where PhysicalDisk4 is now listed in the usage column as “Retired”.

Finding and removing the physical disk which is going bad

The next step since I was unable to remove the disk from the UI was to physically remove the drive from the system. To identify the correct drive to pull the key is to find the drive’s serial number (the model is highlighted in the screenshot below, the serial number is directly under that). For details on this see: https://www.catapultsystems.com/cfuller/archive/2013/12/23/debugging-which-drive-is-dying-in-a-storage-pool-and-monitoring-for-it-in-operations-manager-sysctr-scom-windowsserver/

My drive which was having issues is shown below:

After removing the drive and rebooting the system the drive in question was now listed as “Retired” (I had also added another drive when I removed the dying one so that may have caused some confusion and the difference in the name of the drive which was now listed as “PhysicalDisk-1”.

Attempts to remove the disk which was now listed as “Retired” were unsuccessful.

Addressing challenges with the Hot Spare drive

I had assumed that the loss of a drive would cause physical disk which was part of the virtual disk to take over for the drive which was removed. That assumption however proved to be incorrect and caused me significant time to debug and determine how to proceed. The key was to convert the drive from being a hot spare to being automatic via PowerShell (set-physicaldisk –FriendlyName “<DiskName>” –usage AutoSelect)

Once this disk was set to Automatic (highlighted below) I could now take the steps required to fix the virtual disk.

Fixing the virtual disk and storage pool

After the drives in the virtual disk were all set to automatic, the next step was to user PowerShell to rebuild the virtual disk (Repair-VirtualDisk).

The rebuild took approximately 9 hours in my environment but running it through PowerShell made it easy to see that it was being rebuilt and that progress was taking place. Samples from this rebuild are shown below:

At 11:25 am:

At 1:00 pm

After the rebuild finished I re-checked storage space. While the storage pool was still yellow (due to the drive which had been removed) there was no longer a yellow exclamation point on the virtual disk!

The next step was to remove the bad disk which worked as expected within the UI.

The result was now a healthy state across the board (the Storage Space, Virtual Disks and Physical Disks).

Addressing a side issue with virtuals on the drive

As was stated early on in this blog post, the initial issue I which caused me to identify the issue with the virtual disk was an inability to launch the virtual machines which I keep on this storage pool. I initially found that I was still unable to mount the virtuals (the error message is shown below).

However after I rebooted the server which was hosting the drives the virtuals re-activated correctly and I was able to bring back online all of my systems which had been impacted.

Summary & Reference links: Key lessons to for me with this issue were the following:

  • A drive which is degrading may impact the functionality of a virtual disk and storage pool in what would appear to be strange ways such as inability to mount VHDX files on the drive.
  • Storage pools with a hot spare will not activate in the case of read failures and may not activate unless you change the hot spare to have its usage as Automatic.
  • PowerShell is your friend especially when dealing with issues with physical disks or virtual disks which are part of a storage space.

I owe a huge thank you to John Savill who provided a great sounding board for this issue and helped me to not give up and just destroy and recreate this virtual disk!

The reference links which I used in this article are as follows: (thank you to everyone who shared their experiences!)