As with any failed deployment, SDDC manager had a failed task that you could retry, which continuously failed after each retry. The domain-manager.log and vmware_vrlcm.log file weren’t highlighting any specific issues, the network was reachable, DNS records existed, no firewall ports between fleet / ops / NSX. Curl commands between all appliances, as well as telnet were working perfectly.
Error: “Deployment of VCF Management Components is not allowed as an existing deployment is already present”.
ALWAYS back up and/or snapshot SDDC Manager before any database change. Realistically you should avoid modifying the database altogether, especially without support. This process may leave you unsupported, ensure you are working with support if you do encounter this issue.
From what I could tell, the only place there was any mention of the deployment status was in the vcf_management_component postgres database on SDDC Manager. The image below is what a healthy deployment looks like.
Note my deployment failed early, after ops was deployed and registering the proxy. If you use this at a later stage ensure you have taken a snapshot / backup first!
The deployment of VCF 9 fleet components fails part way through, unable to restart the task successfully. You may need to forcefully delete VMs and update the database, which will allow you to redeploy successfully.
Investigating and Fixing
Unfortunately I didn’t capture the error message in SDDC Manager to paste here, but the message was quite generic.
Note: I have only had this occur once and it was in my lab, thus this is likely be lab related and IS NOT a widespread issue. Consider it an FYI in case you run into the issue.
As it is a greenfield deployment, there wasn’t too much else to check. Being a lab, I decided I would just redeploy the fleet appliances, I deleted the VMs from vCenter, and when I attempted to deploy them using the exact same API and payload, I hit the below error.
The Fix
Putty / SSH onto SDDC Manager, then run;
Related
In my failed deployment, VCF Operations and fleet management deployed, and then errored on deploying the proxy. Rather than seeing “SUCCEEDED“, I had “FAILED” for the cloud proxy and “NOT STARTED” for the rest.
TLDR: I always deploy my fleet appliances on VCF Networking (VMware NSX) Overlay networks. During the process of building a management domain in VCF 9, there is an option to skip deploying the fleet appliances, which gives you flexibility in their deployment, using SDDC Manager API and NSX networks. When doing this, the deployment in SDDC Manager failed after deploying VCF Operations and Fleet manager, while deploying Ops proxy.
Summary
After this you should be able to revalidate and redeploy the appliances using API. Ignore the hostname, this was a fresh SDDC Manager deploy and how it looks before anything is deployed.
The code block below is my payload to deploy VCF Operations, VCF Operations Proxy, and VCF Automation, as you can see two networks are specified down the bottom, local-region and xRegion. Both are NSX overlay networks in my environment.