Replace a Failed vSAN Diskgroup on a VCF Managed Node

September 9, 2021

Replace a Failed vSAN Diskgroup on a VCF Managed Node

VCF Upgrade Workflow Description – Upgrade – ESX_HOST

The Issue? SDDC Manager Upgrade Workflow Fails During the Host Upgrade Task If There Is a Failed vSAN Disk Group

7. Issue the command esxcli vsan storage remove -u 52ccb67b-eb3a-5145-6c3b-fe455f7091f9, there should be no output from this command.

The VCF upgrade workflow made it all the way to upgrading the hosts and failed. The error “null Manual intervention needed as upgrade failed during install stage. Check for errors in the lcm log files located on SDDC Manager under /var/log/vmware/vcf/lcm. Check for VSAN Health warnings/errors in the vSphere Client. Resolve the issues and make sure that VSAN health is green in vSphere Client.

2. Click on configure, then navigate to vSAN, Disk Management. From this image you can see vcfesxi3 is in an unhealthy state.

11. In this case, once the disks were claimed and the host was brought out of maintenance mode, the disk group was recreated on the host. If this does not occur for you, you may have to manually create the disk group, by selecting create disk group.

10. The drive must now be replaced in the host, and the disk group must be recreated. First claim the unused disks on the host by selecting the host in Disk Management, and selecting claim unused disks. Ensure the drives are marked correctly (cache or capacity). Select create when done.

sddc manager upgrade workflow fail — Workflow Failure

Troubleshooting the Issue

12. Once the cluster is healthy again, you should be able to re-run the upgrade workflow and it should now be successful.

Reference Token: DMP3M3

Log into vCenter and navigate to the cluster to identify any potential issues. The image below shows the cluster health and we are immediately prompted with a vSAN physical disk alert.

vsan cluster health issue — vSAN Cluster Status

5. Identify the vSAN disks and disk groups using the command esxcli vsan storage list

vSAN Disk dead or error — vSAN Disk Dead or Error

Resolving The Issue

The previous section identified the fault, which is a failed cache drive on vcfesxi3.shank.com. The next step is to remediate the disk failure. To do this, the disk group must be removed from the host, you cannot replace just the cache disk. The process to achieve this can be found here. However, as the disk group has already failed, you will need to select no data migration.

Click on the disk group under the host with the failed disk in the previous image, then select the three dots (highlighted in the image).

Cause:”

Sometimes the error messages you get in SDDC Manager are a red herring, and you simply retry the task and it passes. However, no amount of retrying the workflow resolved this one!

vsan disk group removal failure — Cannot remove the disk group

The error “null Manual intervention needed as upgrade failed during install stage. Check for errors in the lcm log files located on SDDC Manager under /var/log/vmware/vcf/lcm. Check for VSAN Health warnings/errors in the vSphere Client. Resolve the issues and make sure that VSAN health is green in vSphere Client“, generally indicates a health issue in your vSAN cluster. Check vCenter and ensure the cluster is green before attempting restart the workflow, in this case a failed physical caching disk needed replacement, before the workflow could be restarted.

It is sometimes hard to keep track of all nodes, disks and disk groups in an environment. The fact of the matter is.. drives fail. If this occurs on one of your nodes and you haven’t realised, depending on your vSAN storage policy, you could be in trouble very soon! This article shows you how to replace a failed vSAN diskgroup on a VCF managed node.

The error message is pretty clear, this is a vSAN related issue. Now you will need to identify which host has the issue, and what the issue is.

[root@vcfesxi3:~] esxcli vsan storage remove -u 52ccb67b-eb3a-5145-6c3b-fe455f7091f9 [root@vcfesxi3:~]

4. The next step is to try via command line. First ensure the host is in maintenance mode, then SSH onto the host.

manual removal of vsan disk group — Deleted Disk Group in vSAN Disk Management

3. Click on the host that is marked as unhealthy, in the bottom pane, each disk and their state should now be shown. Notice the cache disk state is Dead or Error. As the disk has most likely failed, the disk group on this host will not be used and this host will not be used as a storage node. However, VMs will continue to run and read data from the other nodes. vSAN Failure Scenarios can be viewed here.

claim vSAN disks — vSAN Claim Unused Disks

2. Select remove from the drop down.

[root@vcfesxi3:~] esxcli vsan storage list naa.5002538e9016f285 Device: naa.5002538e9016f285 Display Name: naa.5002538e9016f285 Is SSD: true VSAN UUID: 52514556-3f7b-6455-e924-d7b188d8c475 VSAN Disk Group UUID: 52ccb67b-eb3a-5145-6c3b-fe455f7091f9 VSAN Disk Group Name: Used by this host: true In CMMDS: false On-disk format version: 13 Deduplication: false Compression: false Checksum: 1956740779600151227 Checksum OK: true Is Capacity Tier: true Encryption Metadata Checksum OK: true Encryption: false DiskKeyLoaded: false Is Mounted: true Creation Time: Tue Jun 15 05:31:26 2021

t10.NVMe____CT1000P1SSD8____________________________5404782B0175A000 Device: t10.NVMe____CT1000P1SSD8____________________________5404782B0175A000 Display Name: t10.NVMe____CT1000P1SSD8____________________________5404782B0175A000 Is SSD: true VSAN UUID: 52de1788-674a-277b-b464-737e3073d858 VSAN Disk Group UUID: 52ccb67b-eb3a-5145-6c3b-fe455f7091f9 VSAN Disk Group Name: Used by this host: true In CMMDS: false On-disk format version: 13 Deduplication: false Compression: false Checksum: 3433439902698835217 Checksum OK: true Is Capacity Tier: true Encryption Metadata Checksum OK: true Encryption: false DiskKeyLoaded: false Is Mounted: true Creation Time: Tue Jun 15 05:31:26 2021

3. Select the type of data migration suitable for your environment, in this case, no data migration has been selected. The option to remove the disk group is greyed out. If you are in the same situation, continue to step 4.

The other issue arises when attempting to update a VCF environment that has a non-healthy vSAN cluster. You will likely encounter a similar situation to what is detailed below.

Replace a Failed vSAN Diskgroup on a VCF Managed Node

Replace a Failed vSAN Diskgroup on a VCF Managed Node

VCF Upgrade Workflow Description – Upgrade – ESX_HOST

The Issue? SDDC Manager Upgrade Workflow Fails During the Host Upgrade Task If There Is a Failed vSAN Disk Group

Troubleshooting the Issue

Related

Resolving The Issue

Firefox 133 Released: New Features and Installation Guide for Linux

Building a Multi-NIC VCF 4 Workload Domain using API | LAB2PROD

CrowdStrike Signal Transforms AI-Powered Threat Detection

CrowdStrike Launches Agentic AI Innovations to Fortify the AI-Native SOC

January 2025 Patch Tuesday: 10 Critical Vulnerabilities and Eight Zero-Days Among 159 CVEs

Replace a Failed vSAN Diskgroup on a VCF Managed Node

VCF Upgrade Workflow Description – Upgrade – ESX_HOST

The Issue? SDDC Manager Upgrade Workflow Fails During the Host Upgrade Task If There Is a Failed vSAN Disk Group

Troubleshooting the Issue

Related

Resolving The Issue

Similar Posts