VCF 9.0 GA Mental Model Part 5: Topology Patterns for Single Site, Two Sites, and Multi-Region

TL;DR

Identity is not a “later” decision. It is a boundary decision.

The hierarchy you should standardize on is Fleet -> Instance -> Domain -> Cluster.
Your topology decision is mostly about:
- How many instances you deploy and how they map to sites and regions.
- How many fleets you operate as governance, identity, and operational boundary lines.
Three practical deployment postures:
- Single site: fastest path, smallest blast radius, simplest networking.
- Two sites in one region: stretched clusters, stronger site resilience, tighter latency constraints.
- Multi-region: multiple instances, DR-oriented operating model, more dependencies and more change control.
VCF 9.0 GA code levels referenced in this post (component set and build numbers):
- SDDC Manager 9.0.0.0 build 24703748
- vCenter 9.0.0.0 build 24755230
- ESX 9.0.0.0 build 24755229
- NSX 9.0.0.0 build 24733065
- VCF Operations 9.0.0.0 build 24695812
- VCF Automation 9.0.0.0 build 24701403
- VCF Identity Broker 9.0.0.0 build 24695128
- Note: the BOM for this release also calls out VCF Installer 9.0.1.0 build 24962180 as required to deploy all VCF 9.0.0.0 components.

Architecture Diagram

Scenario
Scope and version alignment
Core concepts: mapping physical topology to fleets and instances
Decision criteria you should agree on up front
Challenge: choose your deployment posture
Architecture tradeoff matrix
One private cloud vs multiple fleets
Identity and SSO boundary patterns
Failure domain analysis
Day-0, day-1, day-2 map by topology
Who owns what
Operational runbook snapshot
Troubleshooting workflow
Anti-patterns
Best practices
Summary and takeaways
Conclusion

Scenario

These are real-world starting points, not vendor commitments:

What VCF is actually managing.
Where you draw governance boundaries vs infrastructure boundaries.
What changes when you move from a single site to stretched sites to multiple regions.
How identity scope and fleet count become day-0 decisions with long tail operational consequences.

Scope and version alignment

Use this in design boards to stop circular debates.

VCF 9.0.0.0 GA terminology and workflows.
Greenfield deployment using VCF Installer.
You deploy both VCF Operations and VCF Automation from day-1, even if you phase consumption later.

Version compatibility matrix

What it looks like

Component	Version	Build
SDDC Manager	9.0.0.0	24703748
vCenter	9.0.0.0	24755230
ESX	9.0.0.0	24755229
NSX	9.0.0.0	24733065
VCF Operations	9.0.0.0	24695812
VCF Automation	9.0.0.0	24701403
VCF Identity Broker	9.0.0.0	24695128
VCF Installer	9.0.1.0	24962180

Core concepts: mapping physical topology to fleets and instances

The physical words that matter

You need a model that matches your org and compliance posture.

Site: your contained fault domain boundary. Power, cooling, ToR switches, upstream routing, and physical security usually correlate here.
Region: one or more sites within synchronous replication latencies. Crossing regions is typically a disaster recovery process, not an HA event.

The VCF objects you should use in every design discussion

Fleet: your shared governance and shared platform services boundary. This is where you centralize fleet services like operations and automation.
Instance: a discrete VCF deployment footprint. Each instance contains its own management domain and workload domains.
Domain: the lifecycle and isolation boundary. You patch and evolve domains independently.
- Management domain: hosts instance management components.
- VI workload domain(s): run consumer workloads.

Avoid these early and you remove a lot of future toil.

If someone says “we need another vCenter,” you force the conversation back to domain and instance.
If someone says “we need separation,” you ask whether they mean governance separation (fleet) or workload isolation (domain).

Decision criteria you should agree on up front

Practical rule:

Design-time decision criteria

Availability objective
- Host failure only
- Rack failure
- Site or availability zone failure
- Region failure
Latency and network fabric capability
- ESX host to ESX host latency within clusters
- Stretched VLAN and L2 adjacency requirements
- Fleet-wide connectivity constraints between instances
Isolation objective
- Logical isolation only
- Physical isolation by cluster or domain
- Regulated tenant isolation that requires separate identity and change control
Operating model maturity
- Do you have a platform team that can own fleet services and identity lifecycle?
- Do you have standardized change windows and patch discipline?

Day-2 reality check questions

Can you support fleet services as shared dependencies?
Can you operationalize backup schedules, certificate lifecycle, and password rotation consistently?
Can you troubleshoot cross-site failures without escalating everything to vendors?

Challenge: choose your deployment posture

Treat “private cloud” as your organizational wrapper. VCF objects start at fleet.

Solution A: Single site

Use this to prevent “everyone owns it, so no one owns it.”

Operational implications

One fleet
One instance
One management domain
One or more workload domains

Operational implications

One fleet
One instance stretched across two sites in the same region
A management domain designed for high availability across the two sites
Workload domains separated from management

Operational implications

One fleet
Multiple instances
Each instance has its own management domain and workload domains
Regions are connected with a cross-region network for centralized management and visibility

This is the “site resilience” posture. It usually assumes stretched constructs and tighter network constraints.

Separate failure domains by region.
Enable recovery workflows that survive region-level events, given adequate capacity and replication strategy.

This is where topology turns into real operational outcomes.

You introduce an explicit dependency chain:
- Fleet services live somewhere (commonly the first instance), and other instances rely on cross-region connectivity to reach them.
Change control becomes multi-region by default:
- Certificates, identity, and patching must be coordinated across locations.

Architecture tradeoff matrix

Design intent

Attribute	Single site	Two sites in one region	Multi-region
Primary goal	Simplicity	Site resilience	Region separation and DR posture
Typical instance count	1	1	2+
Network complexity	Low	High	Medium to high
Latency sensitivity	Moderate	High	Medium
Fleet service dependency	Local	Local but stretched	Cross-region dependency
Operational overhead	Low	High	High
Cost drivers	Host count, storage	Stretched fabric, witness, failover capacity	Duplicate capacity, replication, bandwidth

Cost model snapshot

Use UI and workflow validation before you declare success:

Single site
- Cheapest fleet service hosting footprint.
- Lowest network engineering cost.
Two sites in one region
- You pay for:
  - Higher-quality inter-site links
  - Stretched VLAN support
  - Additional failover capacity (because you are engineering for a site loss)
Multi-region
- You pay for:
  - Duplicate management footprints per region
  - Data replication and orchestration tooling
  - Higher operational toil unless you automate day-2 heavily

One private cloud vs multiple fleets

Design intent

When one fleet is enough

Design intent

You want centralized observability and automation.
You can accept shared governance services across multiple instances.
You want a standard operating model across locations.

When you should operate multiple fleets

Operational reality:

Separate identity providers or separate SSO boundaries for regulated isolation.
Independent change windows and patch schedules.
Hard blast radius separation for fleet services.

Use this when:

Fleet separation is about governance, identity scope, and shared service blast radius.
Domain separation is about workload isolation and lifecycle independence.

Identity and SSO boundary patterns

Choose one fleet when:

Challenge: unify access or isolate tenants

Use this when:

Solution A: Fleet-wide SSO

You will hear these terms used casually. Align them to your constraints:

You want one set of credentials and SSO across all components in the fleet.
You can tolerate that an identity broker impact affects the fleet.

You will hear these terms used casually. Align them to your constraints:

You want shared identity across a subset of instances, not necessarily all.
You want more control over blast radius than a single fleet-wide configuration.

Solution C: Single instance SSO boundaries

You will hear these terms used casually. Align them to your constraints:

You need regulated or tenant isolation.
You need different identity providers or different authentication policies per instance.
You want to localize identity outages.

Embedded vs appliance identity broker

VMware Cloud Foundation 9.0 and later Documentation: https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0.html

Embedded identity broker is simpler but inherits dependency on instance components.
Appliance identity broker adds overhead but improves availability and scale.
Design constraint worth calling out: there is a maximum number of instances that can connect to a single identity broker deployment.

Failure domain analysis

Use this as your “what are we talking about” anchor in architecture reviews and CAB meetings.

Failure domains you should model

Failure	What breaks	What keeps running	Typical owner response
Fleet services unavailable	Central observability, centralized automation, fleet management workflows, and optionally SSO experience	Existing workloads and instance-level management planes continue to run	Platform team restores fleet services, validates integrations
Instance management domain impaired	Domain lifecycle actions, some instance operations	Workloads may still run, but you lose safe lifecycle control and may lose some vCenter or NSX management functions depending on the failure	VI admin + platform team coordinate recovery
Workload domain impaired	Workloads in that domain	Other domains and instances continue	VI admin + app teams execute workload recovery runbooks

Practical RTO and RPO examples you can use as starting targets

If you cannot state these targets, you should at least agree on the priority order:

Fleet services (VCF Operations, VCF Automation, identity broker)
- RTO: 2 to 8 hours depending on how automated your restore is
- RPO: 24 hours is a common baseline when backups are daily
Instance management domain components
- RTO: 1 to 4 hours if you have clean backups and documented restore procedures
- RPO: 24 hours baseline, tighter if you replicate critical config
Workload domains and applications
- RTO and RPO are application-specific and often require orchestration tooling

Operational implications

Identity and authentication
SDDC Manager and lifecycle
vCenter and NSX management
Workload recovery

Day-0, day-1, day-2 map by topology

Day-0: decisions you should lock

You are about to deploy VCF 9.0 GA greenfield and you need a shared language for:

Fleet count and naming standard.
Instance to site or region mapping.
Domain strategy:
- Management domain is not where you run business workloads.
- Workload domains align to lifecycle and isolation needs.
Network and IP plan:
- Treat subnet sizing as irreversible planning, not something you “fix later”.
- Allocate address space with room for expansion.
Identity model:
- Fleet-wide vs instance-level boundaries.
- Corporate IdP integration and MFA policy alignment.
Certificate authority and certificate lifecycle plan.
Backup targets and backup schedule owners.

Day-1: bring-up sequence that fits the object model

What it looks like

Deploy VCF Installer appliance.
Start new fleet deployment and create the first instance.
License and stand up fleet services:
- VCF Operations
- VCF Automation
Deploy identity broker and configure VCF SSO with your directory.
Create workload domain(s) and establish network connectivity patterns.
Stand up VCF Automation constructs for consumption.

Day-2: operations you should operationalize early

Patch and lifecycle:
- Domain-based upgrades and maintenance windows.
- Explicit rollback plans when upgrading shared fleet services.
Backup and restore:
- SFTP backup targets for management components.
- Backup schedules for fleet services and for instance components.
Security lifecycle:
- Password rotation and account management.
- Certificate replacement and renewal.
Expansion:
- Add workload domains, clusters, and potentially additional instances.

Who owns what

These are design-time decisions that are expensive to reverse later.

Capability	Platform team	VI admin	App and platform teams
Fleet services lifecycle	Own	Consult	Informed
VCF Operations configuration and alerts	Own	Consult	Informed
VCF Automation provider setup	Own	Consult	Informed
Identity broker and SSO model	Own	Consult	Informed
Instance bring-up and health	Own	Own	Informed
SDDC Manager operations	Consult	Own	Informed
vCenter and NSX in management domain	Consult	Own	Informed
Workload domain creation and lifecycle	Consult	Own	Informed
Workload provisioning via automation	Own the platform	Consult	Own consumption
Application deployment and runtime	Informed	Consult	Own

Operational runbook snapshot

When something breaks, troubleshoot by boundary.

Weekly

Review fleet service health and integrations.
Validate that all instances are reporting metrics and logs.
Confirm certificate expiration windows and rotation queue.

Monthly

Execute backup restore tests for:
- Fleet services
- Instance management components
Review capacity and “failover capacity” assumptions for your topology.

Quarterly

Patch at domain boundaries, not by ad hoc component upgrades.
Re-validate cross-site network latency and packet loss.
Run a tabletop exercise:
- Fleet services outage
- Instance outage
- Site outage

Validation checklist

Hard constraints you must respect

In VCF Operations, confirm each VCF instance is visible and healthy.
Confirm your automation provider and tenant access paths work with the chosen identity model.
Confirm backups are running and stored off the platform.

Troubleshooting workflow

This is not pricing. It is what actually moves your bill of materials.

Provisioning failures
- Check VCF Automation health and its integration to VCF Operations.
- Validate identity provider connectivity and token issuance.
- Validate network connectivity between fleet services and target instance.
Instance lifecycle failures
- Inspect SDDC Manager alarms and recent change history.
- Validate domain health and vCenter availability.
- Check for drift from out-of-band changes.
Cross-site weirdness
- Start with latency and MTU validation.
- Validate gateway HA behavior for stretched segments.
- Confirm site affinity rules for critical components.

Anti-patterns

This is your default starting posture unless you have a clear availability driver.

Treating a two-site stretched design like “just two data centers”.
Using a single fleet across regulated tenants when you actually need separate identity and change boundaries.
Running meaningful workloads in the management domain because it was “available”.
Designing IP space too tightly and assuming you can resize later.
Assuming multi-region means “active-active” without defining replication, orchestration, and capacity for failover.

Best practices

Standardize vocabulary in writing:
- Fleet -> Instance -> Domain -> Cluster
Keep fleet services highly available and backed up like any other Tier 0 platform.
Make identity a design board item, not an implementation checkbox.
Use domains as your lifecycle boundary:
- Patch domains, validate domains, roll back at domain boundaries.
Write failure-mode runbooks for:
- Fleet services down
- Instance down
- Site down
- Region down

Summary and takeaways

Your topology posture is an operating model decision, not just an architecture diagram.
Two sites in one region usually increases availability but also increases network and day-2 complexity.
Multi-region usually improves fault domain separation, but it introduces cross-region dependencies for fleet services unless you deliberately isolate with multiple fleets.
Decide identity scope and fleet count at day-0. The cost of changing later is always higher than the cost of deciding carefully now.

Conclusion

These apply to all topologies:

Sources

If you want architects, operators, and leadership aligned, you need a topology mental model that starts with VCF objects and only then maps to your physical sites.

VCF 9.0 GA Mental Model Part 6: Topology and Identity Boundaries for Single Site, Dual Site, and Multi-Region

TL;DR Scope: VMware Cloud Foundation 9.0.0.0 GA (primary platform build 24703748) and the associated 9.0 GA BOM levels for key components: SDDC Manager: 9.0.0.0 build 24703751 vCenter: 9.0.0.0 build 24755230…

VCF 9.0 GA Mental Model Part 5: Topology Patterns for Single Site, Two Sites, and Multi-Region

How to Start, Stop, and Restart Services in Linux

November 2024 Patch Tuesday: Four Critical and Three Zero-Days Among 158 Vulnerabilities Patched

How to Disable Package Updates in Ubuntu, Debian and Mint

Architecting the Foundation — LLM Function Calling and Toolchains

24 Best Command Line Tools to Monitor Linux Performance

How to Verify Debian and Ubuntu Packages Using MD5 Checksums

TL;DR

Architecture Diagram

Table of Contents

Scenario

Scope and version alignment

Version compatibility matrix

Core concepts: mapping physical topology to fleets and instances

The physical words that matter

The VCF objects you should use in every design discussion

Decision criteria you should agree on up front

Design-time decision criteria

Day-2 reality check questions

Challenge: choose your deployment posture

Solution A: Single site

Architecture tradeoff matrix

Cost model snapshot

One private cloud vs multiple fleets

When one fleet is enough

When you should operate multiple fleets

Identity and SSO boundary patterns

Challenge: unify access or isolate tenants

Solution A: Fleet-wide SSO

Solution C: Single instance SSO boundaries

Embedded vs appliance identity broker

Failure domain analysis

Failure domains you should model

Practical RTO and RPO examples you can use as starting targets

Day-0, day-1, day-2 map by topology

Day-0: decisions you should lock

Day-1: bring-up sequence that fits the object model

Day-2: operations you should operationalize early

Who owns what

Operational runbook snapshot

Weekly

Monthly

Quarterly

Validation checklist

Troubleshooting workflow

Anti-patterns

Best practices

Summary and takeaways

Conclusion

Sources

Next Post

Share this:

Like this:

Similar Posts