VCF 9.0 GA Mental Model Part 6: Topology and Identity Boundaries for Single Site, Dual Site, and Multi-Region

TL;DR

Scope: VMware Cloud Foundation 9.0.0.0 GA (primary platform build 24703748) and the associated 9.0 GA BOM levels for key components:
- SDDC Manager: 9.0.0.0 build 24703751
- vCenter: 9.0.0.0 build 24755230
- ESXi: 9.0.0.0 build 24755229
- NSX: 9.0.0.0 build 24752083
- VCF Operations: 9.0.0.0 build 24705084
- VCF Operations Fleet Management: 9.0.0.0 build 24704881
- VCF Automation: 9.0.0.0 build 24786202
- VCF Identity Broker: 9.0.0.0 build 24786209
Your topology decision is really about failure domains:
Single site -> simplest operations.
Two sites in one region -> availability engineering (stretched networking and usually stretched storage).
Multi-region -> disaster recovery engineering (asynchronous replication + runbooks).
Your identity decision is a blast radius decision:
Fleet-wide Single Sign-On (SSO) maximizes convenience, but centralizes login impact.
Instance-level SSO shrinks blast radius, but increases operational overhead.
Operational punchline: Choose topology and SSO model as day-0 decisions, because your day-2 posture (change windows, incident scope, and who gets paged) is set by those boundaries.

Architecture Diagram

Scope and terminology guardrails
Assumptions
Decision criteria
Challenge
Solutions
Identity boundaries
Who owns what
Version compatibility matrix
Architecture tradeoff matrix
Failure domain analysis
Day-0, day-1, day-2 action map
Operational runbook snapshot
Validation
Troubleshooting workflow
Anti-patterns
Summary and takeaways
Conclusion

Scope and terminology guardrails

Operational reality

Fleet is your centralized governance and lifecycle scope for fleet-level services (for example, VCF Operations and VCF Automation).
Instance is a discrete VCF deployment unit with its own instance-level management components.
Domains (management domain and VI workload domains) are lifecycle and isolation boundaries inside an instance.
Clusters are the scaling unit inside a domain.

Best for

Region is one or more physical sites in a single metro area, typically aligned to synchronous replication latencies.
Single site is a single fault domain at some layer (power, HVAC, core network, etc.), even if you have multiple racks.
Multiple sites in a single region is an availability pattern, usually implemented with stretched clusters.
Multiple sites across multiple regions is a disaster recovery pattern. Treat it as DR engineering, not “metro HA, but farther away”.

Assumptions

You are designing for VCF 9.0.0.0 GA (not 9.0.x maintenance releases).
You are greenfield for VCF bring-up.
You plan to deploy both VCF Operations and VCF Automation from day-1.
You want to support three topology postures:
- Single site
- Two sites in one region
- Multi-region
You need to support two identity postures:
- Shared identity and shared SSO boundary where appropriate
- Separate SSO boundaries for regulated isolation where required

Decision criteria

VCF 9.0 gives you flexibility in how far you extend SSO convenience. Your decision should be explicit, because it determines operational coupling.

Availability objective
- Are you trying to survive host/rack failures, or a full site loss?
- Do you need “continue running” vs “recover quickly”?
Latency reality
- Two sites in one region implies tight latency constraints and resilient inter-site networking.
- Multi-region implies you are in DR territory, not synchronous HA territory.
Isolation and compliance
- Do you need separate admin planes and authentication boundaries for regulated workloads?
Operational model
- Can your teams support stretched designs (storage, networking, failure testing)?
- Do you have the maturity to run parallel instances and DR runbooks?
Scale and growth
- Will you scale by adding clusters, adding domains, or adding instances?
- Are you trying to cap blast radius for lifecycle events?

Challenge

Best for

Matches real failure domains (host, rack, site, region)
Keeps lifecycle operations predictable (patching, certificates, identity changes)
Makes ownership clear (platform team vs VI admins vs app/platform teams)
Avoids accidental coupling (shared services that turn into shared outages)

Solutions

Solution A: Single site

Run these DNS and connectivity checks:

You want the fastest path to a stable VCF 9.0 platform.
Your highest-probability failures are host and rack, not full-site loss.
You want to minimize “distributed systems” complexity in your management components.

Run these DNS and connectivity checks:

You need resilience across two facilities in the same metro area.
You can meet the networking and storage requirements to operate stretched designs reliably.
You accept more complex failure testing and more disciplined change management.

Run these DNS and connectivity checks:

You need regional survivability and a credible DR story.
You accept asynchronous replication and DR orchestration as first-class requirements.
You can operationalize regular failover testing.

Day-2 implications

One fleet with multiple instances, typically aligning instances to regions.
Each region runs its own instance-level management components for that instance.
You add replication and failover solutions on top (data replication is not “free” just because you have two regions).

Identity changes are rarely “undo-able” in a clean way.

Region loss becomes a recovery process, not an HA event.
Your RPO/RTO is determined by:
- Replication technology and mode (async, periodic)
- Runbook execution time (automation maturity)
- DNS, identity, and access dependencies

You need a shared language for “what breaks what”:

More upgrade surface area:
- More instance-level stacks to patch and validate
- More compatibility and sequencing to track
More change management work:
- Cross-region DR testing, runbook maintenance, replication monitoring

Identity boundaries

Avoid these and you avoid most self-inflicted outages.

Identity design-time decisions that matter

Do you want fleet-wide login convenience or per-instance blast radius control?
Do you need one identity provider across the fleet, or separate identity sources for isolation?
Do you need a highly available Identity Broker deployment model for scale and resilience?

Challenge

For topology conversations, you also need consistent physical vocabulary:

Solutions

Solution A: Fleet-wide Single Sign-On

Before you declare success, validate:

A single platform operations team supporting multiple instances
Environments where cross-instance operations are common
Organizations optimizing for ease of use and consistent access patterns

Before you declare success, validate:

Regulated isolation
Multi-tenant environments where identity boundaries must map to tenant boundaries
Organizations optimizing for smaller incident scope

Before you declare success, validate:

A practical middle ground
You want to group instances by risk domain (for example, production vs regulated)

Use the following commands to validate the basics from a jump host.

Multiple Identity Broker instances serve defined subsets of instances in the same fleet.
You reduce login blast radius versus a single shared Identity Broker, but still gain some cross-instance convenience.

Rollback and safety notes for identity

Use validation as your “trust but verify” step after topology or identity work.

Operational behaviors to plan for:

Resetting or deregistering SSO can remove provisioned users and groups and may be irreversible for the removed identities.
Even after configuring SSO centrally, you often still need to log in to individual components and assign roles and permissions for users and groups.

Failure posture

A change window item with defined blast radius
A runbook with an explicit backout plan (often “restore from backup” rather than “click undo”)

Who owns what

These are the “you will regret not deciding early” items:

Capability / Task Area	Platform team (fleet)	VI admin (instance + domains)	App/platform teams (consumers)
Fleet topology decisions (fleet count, instance strategy)	Own	Consult	Inform
VCF Operations + Fleet Management lifecycle	Own	Consult	Inform
VCF Automation lifecycle and platform guardrails	Own	Consult	Consult
Identity Broker and SSO model selection	Own	Consult	Inform
Identity provider integration and federation policy	Own	Consult	Inform
Instance bring-up, SDDC Manager health	Consult	Own	Inform
Management domain operations (vCenter/NSX for mgmt)	Consult	Own	Inform
Workload domain lifecycle (create/expand/delete)	Consult	Own	Inform
Network services consumption (projects, VPCs, templates)	Guardrails	Provide capacity	Own
Workload placement, sizing, app RTO/RPO	Guardrails	Provide platform SLAs	Own
DR runbooks for workloads	Provide platform primitives	Support infra failover	Own (execute + validate)

Version compatibility matrix

Use this table in design boards to turn opinions into tradeoffs.

Component	Role in the model	9.0 GA version	9.0 GA build
VMware Cloud Foundation	Platform level	9.0.0.0	24703748
SDDC Manager	Instance mgmt	9.0.0.0	24703751
vCenter	Domain mgmt	9.0.0.0	24755230
ESXi	Host layer	9.0.0.0	24755229
NSX	Network virtualization	9.0.0.0	24752083
VCF Operations	Fleet-level ops	9.0.0.0	24705084
VCF Operations Fleet Management	Fleet lifecycle plane	9.0.0.0	24704881
VCF Automation	Fleet-level consumption	9.0.0.0	24786202
VCF Identity Broker	Identity plane	9.0.0.0	24786209

Architecture tradeoff matrix

What it looks like operationally

Attribute	Single site	Two sites in one region	Multi-region
Primary goal	Operational simplicity	Site resilience (metro)	Regional survivability (DR)
Typical instance count	1	1	2+
Data protection posture	Local HA + backups	Often synchronous within region	Asynchronous replication + DR
Network demands	Standard DC	Stretched, resilient inter-site	L3 between regions + DR routing/DNS
Change risk	Lowest	Medium to high	High (more components)
Upgrade impact	Smallest	Broader (shared stretched deps)	Broadest (multiple instances)
Identity blast radius	Depends on SSO model	Depends on SSO model	Higher if identity is centralized
Best for	Getting started, most orgs	Metro availability	Regulated DR, geo resilience

Failure domain analysis

VMware Cloud Foundation 9.0 Documentation (TechDocs landing page): https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0.html
VMware Cloud Foundation 9.0 Release Notes – Bill of Materials: https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0/release-notes/vmware-cloud-foundation-90-release-notes/vmware-cloud-foundation-bill-of-materials.html
Design Blueprints for VMware Cloud Foundation 9.0: https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0/design/blueprints.html
VCF Fleet-Wide Single Sign-On Model: https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0/design/design-library/single-sign-on-models/-fleet.html
VCF Single Sign-On Models (Design Library index): https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0/design/design-library/single-sign-on-models.html
VCF Installer Product Support Notes (VCF 9.0 Release Notes): https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0/release-notes/vmware-cloud-foundation-90-release-notes/platform-product-support-notes/product-support-notes-installer.html
VMware Cloud Foundation Installer API Reference Guide: https://developer.broadcom.com/xapis/vcf-installer-api/latest
VMware Cloud Foundation API Reference Guide: https://developer.broadcom.com/xapis/vmware-cloud-foundation-api/latest

Fleet service incident (Operations/Automation/Identity Broker)
Impacts governance, provisioning workflows, centralized observability, and potentially login flows (depending on your SSO model).
It does not automatically mean instance-level vCenter or NSX is down.
Instance incident (SDDC Manager, management domain services)
Impacts domain lifecycle operations and management workflows for that instance. Workloads may keep running, but lifecycle and orchestration stop being safe.
Domain incident (a workload domain vCenter/NSX, or cluster issues)
Impacts workloads in that domain. Other domains and instances can remain healthy.

When it fits

Single site: Failure domains are clean, but “site loss” is still a hard stop unless you add DR.
Two sites in one region: Link failure and split-brain conditions become first-class failure modes.
Multi-region: DR orchestration and identity dependencies become the most common hidden risk.

Day-0, day-1, day-2 action map

Day-0 decisions

Day-2 characteristics

Topology pattern: single site vs dual site vs multi-region
Scaling strategy: add clusters vs add domains vs add instances
SSO model: fleet-wide vs instance-level vs segmented cross-instance
Where fleet services live, and how you protect them
Certificate authority strategy and renewal model
Backup and restore posture for:
- VCF Operations + Fleet Management
- VCF Automation
- Identity Broker
- SDDC Manager and management vCenter/NSX

Day-1 actions

Use this as a starting point and adjust to your org’s risk model.

Deploy and configure VCF Installer (new in VCF 9.x lineage vs older Cloud Builder workflows).
Bring up the first instance and management domain.
Deploy fleet services (Operations and Automation) to match your desired HA footprint.
Configure Identity Broker and SSO model.
Create initial workload domains and attach them to the consumption model you plan to support.
For anything beyond baseline wizard-driven deployment (for example, specific network constructs), plan on JSON spec-driven deployment where required.

Day-2 operations

Operational reality

Lifecycle management:
- Fleet services lifecycle
- Instance and domain lifecycle
Governance and drift:
- Out-of-band changes are the fastest way to break day-2 workflows
Capacity and scale:
- Add clusters to domains
- Add domains to instances
- Add instances to fleets (most often for geographic dispersal and isolation)
Identity and certificates:
- Role mapping validation after identity changes
- Certificate renewal to avoid service disruption
DR and resilience:
- Regular restoration testing for fleet services
- Runbook execution practice for multi-region

Operational runbook snapshot

When it fits

Minimum viable backup posture

Back up fleet services and identity:
- VCF Operations + Fleet Management
- VCF Automation
- Identity Broker
Back up instance-level management:
- SDDC Manager and management vCenter/NSX

You will move faster as an organization if you treat these as non-negotiable guardrails:

Fleet services (Operations/Automation/Identity):
- RPO: 24 hours (starter), 4 hours (mature), 1 hour (high-critical)
- RTO: 4-8 hours (starter), 2-4 hours (mature), under 2 hours (high-critical)
Workload domains (apps):
- RPO and RTO should be app-tier driven, not “platform averages”

Identity provider change runbook

Pre-change:
- Confirm break-glass access to each component
- Export role mappings and admin group membership
- Confirm backups exist for identity components
Change:
- Implement identity provider change in the selected SSO model scope
- Re-validate role mapping per component (vCenter, NSX, Operations, Automation)
Post-change:
- Validate login across:
  - Fleet UI
  - Instance components
  - Automation portals
- Update documentation and on-call procedures

Validation

Day-2 is where topology decisions become either leverage or pain:

Day-2 characteristics

DNS resolution for all management endpoints
NTP sync consistency across fleet services and instance services
Login paths:
- Fleet services
- vCenter and NSX
- Automation portals
Health and connectivity:
- VCF Operations cluster health
- Automation cluster health
- Identity Broker health

Use these criteria to keep topology and identity debates grounded in operational outcomes:

When it fits

Use this chart to stop ownership drift before it becomes incident fuel.

Step-by-step triage

Step 1: Is this a login issue or a lifecycle issue?
- Login failures often point to identity scope or identity broker health.
- Lifecycle failures often point to fleet management services or instance manager state.
Step 2: Is impact fleet-wide, instance-wide, or domain-only?
- Fleet-wide symptoms: multiple instances show the same governance or login issues.
- Instance-wide symptoms: one instance fails lifecycle tasks across its domains.
- Domain-only symptoms: a workload domain is isolated while other domains operate normally.
Step 3: Validate time and certificates
- Time drift and certificate issues are repeat offenders in management plane failures.
- Fixing time and trust chains often restores otherwise “mysterious” behavior.

Common issues

SSO works in one UI, fails in another
- Usually role mappings are incomplete in individual components even though SSO is configured centrally.
Automation provisioning failures after identity changes
- Often stale user/group bindings or missing project/organization role bindings.
Stretched design instability
- Often inter-site routing and gateway failover behavior, not “vSphere problems”.

Anti-patterns

You want a clean login experience for operators and consumers, without turning identity into a single point of operational failure.

Treating dual-site in one region like “simple HA”
- It is not simple. It is distributed systems engineering.
Treating multi-region as “active-active by default”
- Multi-region is a DR posture unless you intentionally architect otherwise.
Choosing fleet-wide SSO without an identity resilience plan
- Convenience without resilience becomes a fleet-wide login incident.
Mixing regulated tenants into a shared identity boundary “for simplicity”
- That is an audit finding waiting to happen.
Out-of-band changes without drift detection and an operational reconciliation practice
- This creates silent divergence between actual state and expected state.

Summary and takeaways

Topology is a failure domain decision. Identity is a blast radius decision.
Single site is the fastest path to a stable platform and clean day-2 operations.
Two sites in one region is an availability posture that requires disciplined engineering and testing.
Multi-region is a DR posture that requires replication, orchestration, and practiced runbooks.
Fleet-wide SSO is about user experience. Instance-level SSO is about containment.
Put the ownership model on paper early, or your incident bridge will do it for you.

Conclusion

What it looks like operationally

Sources

Failure posture

VCF 9.0 GA Mental Model Part 6: Topology and Identity Boundaries for Single Site, Dual Site, and Multi-Region

CrowdTour 2026: Securing the AI Era Together

CrowdStrike Partners with MITRE Center for Threat-Informed Defense to Launch Secure AI Project

How to Scale SOC Automation with Falcon Fusion SOAR

CrowdStrike Endpoint Security Achieves 273% ROI Over Three Years

How to Reset USB Device Using Command Line in Linux

How to Speed Up SSH Access with Shortcuts and Auto-Complete

TL;DR

Architecture Diagram

Table of Contents

Scope and terminology guardrails

Assumptions

Decision criteria

Challenge

Solutions

Solution A: Single site

Identity boundaries

Identity design-time decisions that matter

Challenge

Solutions

Solution A: Fleet-wide Single Sign-On

Rollback and safety notes for identity

Who owns what

Version compatibility matrix

Architecture tradeoff matrix

Failure domain analysis

Day-0, day-1, day-2 action map

Day-0 decisions

Day-1 actions

Day-2 operations

Operational runbook snapshot

Minimum viable backup posture

Identity provider change runbook

Validation

Step-by-step triage

Common issues

Anti-patterns

Summary and takeaways

Conclusion

Sources

Share this:

Like this:

Similar Posts