Dynamically Rebalance or Evacuate VKS Control Plane / Worker Nodes across vSphere Zones in VCF 9.0

vSphere Zones in VMware Cloud Foundation (VCF) 9.0 have been enhanced to offer greater flexibility in resource consumption and isolation for both vSphere Supervisor Control Plane VMs (Management), vSphere Kubernetes Service (VKS) Cluster (Workloads) or a combination of the two.

Depending on your required level of management availability and workload isolation, administrations have several vSphere Supervisor Zone deployment options to select from:

Note: The management zone selection (single vs multi) is only configurable during the initial enablement of vSphere Supervisor. It is currently NOT possible to reconfigure the vSphere Supervisor to switch from a single to multi-zone management without re-deploying vSphere Supervisor.

However, for workloads running in a vSphere Namespace like VKS Clusters, users can start with a single workload zone and expand to additional zones as required, with a maximum of up to three vSphere Zone assignment.

To demonstrate the flexibility of vSphere Zones in VCF 9.0, I have an environment configured with the following vSphere Zones:

sf0-m01-c01 corresponds to vz-01
sfo-m01-c02 corresponds to vz-w-01
sfo-m01-c03 corresponds to vz-w-02
sfo-m01-c04 corresponds to vz-w-03

I am using a Single Management Zone (vz-01) for running my vSphere Supervisor Control Plane VMs and the remainder vSphere Zones will be used for my workloads.

I have a vSphere Namespace (ns-01) that is initially configured with a single vSphere Zone (vz-w-01) that I have deployed my VKS Cluster ( 3 x Control Plane and 3 x Worker Nodes). Using the faultDomain property, I can specify my desired placement for my VKS Worker Nodes, but since I only have a single vSphere Zone, this is not required and will automatically deploy to all available vSphere Zones.

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: vks01
  namespace: legal
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
        - 192.168.156.0/20
    services:
      cidrBlocks:
        - 10.96.0.0/12
    serviceDomain: cluster.local
  topology:
    class: builtin-generic-v3.4.0
    classNamespace: vmware-system-vks-public
    version: v1.33.3---vmware.1-fips-vkr.1
    variables:
      - name: vsphereOptions
        value:
          persistentVolumes:
            defaultStorageClass: sfo-m01-cl01
      - name: kubernetes
        value:
          certificateRotation:
            enabled: true
            renewalDaysBeforeExpiry: 90
      - name: vmClass
        value: best-effort-small
      - name: storageClass
        value: sfo-m01-cl01
    controlPlane:
      replicas: 3
      metadata:
        annotations:
          run.tanzu.vmware.com/resolve-os-image: os-name=photon
    workers:
      machineDeployments:
        - class: node-pool
          name: np-1
          failureDomain: vz-w-01
          replicas: 1
          metadata:
            annotations:
              run.tanzu.vmware.com/resolve-os-image: os-name=photon
        - class: node-pool
          name: np-2
          failureDomain: vz-w-01
          replicas: 1
          metadata:
            annotations:
              run.tanzu.vmware.com/resolve-os-image: os-name=photon
        - class: node-pool
          name: np-3
          failureDomain: vz-w-01
          replicas: 1
          metadata:
            annotations:
              run.tanzu.vmware.com/resolve-os-image: os-name=photon

As you can see from the screenshot below, I have my VKS Cluster running only in the vSphere Cluster that is mapped to vSphere Zone (vz-w-01)

Rebalance Workloads across vSphere Zones

When you add additional vSphere Zones to your vSphere Namespace, net new workloads that are deployed after the operation will automatically take advantage of all available vSphere Zones. For existing workloads, they will continue to run within the initial vSphere Zones that they were deployed in. However, we can easily rebalance our VKS Cluster workload to get additional levels of availability.

Step 1 - Add the additional vSphere Zones to your desired vSphere Namespace.

Step 2 - Update the machineDeployments section of your VKS Cluster YAML manifest and specify the desired vSphere Zone placement using the failureDomain property as shown in example below:

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: vks01
  namespace: legal
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
        - 192.168.156.0/20
    services:
      cidrBlocks:
        - 10.96.0.0/12
    serviceDomain: cluster.local
  topology:
    class: builtin-generic-v3.4.0
    classNamespace: vmware-system-vks-public
    version: v1.33.3---vmware.1-fips-vkr.1
    variables:
      - name: vsphereOptions
        value:
          persistentVolumes:
            defaultStorageClass: sfo-m01-cl01
      - name: kubernetes
        value:
          certificateRotation:
            enabled: true
            renewalDaysBeforeExpiry: 90
      - name: vmClass
        value: best-effort-small
      - name: storageClass
        value: sfo-m01-cl01
    controlPlane:
      replicas: 3
      metadata:
        annotations:
          run.tanzu.vmware.com/resolve-os-image: os-name=photon
    workers:
      machineDeployments:
        - class: node-pool
          name: np-1
          failureDomain: vz-w-01
          replicas: 1
          metadata:
            annotations:
              run.tanzu.vmware.com/resolve-os-image: os-name=photon
        - class: node-pool
          name: np-2
          failureDomain: vz-w-02
          replicas: 1
          metadata:
            annotations:
              run.tanzu.vmware.com/resolve-os-image: os-name=photon
        - class: node-pool
          name: np-3
          failureDomain: vz-w-03
          replicas: 1
          metadata:
            annotations:
              run.tanzu.vmware.com/resolve-os-image: os-name=photon

Apply the YAML manifest using kubectl and you will see that the applicable VKS Worker Nodes will be re-deployed into the newly available vSphere Zones.

Note: The failureDomain property only applies to the VKS Worker Nodes, we can see from the screenshot above that the VKS Control Plane VMs are still running in the initial vSphere Zone.

To change the VKS Control Plane VMs, we need to apply an additional change that requires access to vSphere Supervisor.

Step 3 - SSH to the vCenter Server Appliance (VCSA) with the root credentials and then run the following command which will provide you with the decrypted root password to the vSphere Supervisor Control Plane VM.

/usr/lib/vmware-wcp/decryptK8Pwd.py

SSH to the IP Address shown in output along with the root credentials to login to vSphere Supervisor Control Plane VM.

Step 4 - Run the following command to identify the VKS Control Plane VM machine IDs and their current associated vSphere Zones:

k -n ns-01 get machine -l cluster.x-k8s.io/control-plane -o custom-columns=NAME:.metadata.name,ZONE:.spec.failureDomain

Step 5 - Run the following command and specify the VKS Control Plane VM machine IDs that need to be re-deployed to take advantage of the available vSphere Zones:

k -n ns-01 annotate machine vks01-8skvd-tpf2b 'cluster.x-k8s.io/remediate-machine=""'
k -n ns-01 annotate machine vks01-8skvd-vflbb 'cluster.x-k8s.io/remediate-machine=""'

As you can see from the screenshot below, we now have our existing VKS Cluster (3 x Control Plane and 3 x Worker Nodes) distributed across all three of the available vSphere Zones.

Evacuate Workloads from vSphere Zones

We can also remove a vSphere Zone from an existing vSphere Namespace for decommissioning or troubleshooting purposes.

Step 1 - Mark the desired vSphere Zone (vz-w-01) for removal, this ensures no additional workloads will be placed in that vSphere Zone.

Step 2 - Update the faultDomain property in your VKS Cluster YAML manifest to ensure the VKS Worker Nodes is not using the vSphere Zone that will be removed.

apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: vks01
  namespace: legal
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
        - 192.168.156.0/20
    services:
      cidrBlocks:
        - 10.96.0.0/12
    serviceDomain: cluster.local
  topology:
    class: builtin-generic-v3.4.0
    classNamespace: vmware-system-vks-public
    version: v1.33.3---vmware.1-fips-vkr.1
    variables:
      - name: vsphereOptions
        value:
          persistentVolumes:
            defaultStorageClass: sfo-m01-cl01
      - name: kubernetes
        value:
          certificateRotation:
            enabled: true
            renewalDaysBeforeExpiry: 90
      - name: vmClass
        value: best-effort-small
      - name: storageClass
        value: sfo-m01-cl01
    controlPlane:
      replicas: 3
      metadata:
        annotations:
          run.tanzu.vmware.com/resolve-os-image: os-name=photon
    workers:
      machineDeployments:
        - class: node-pool
          name: np-1
          failureDomain: vz-w-02
          replicas: 1
          metadata:
            annotations:
              run.tanzu.vmware.com/resolve-os-image: os-name=photon
        - class: node-pool
          name: np-2
          failureDomain: vz-w-03
          replicas: 1
          metadata:
            annotations:
              run.tanzu.vmware.com/resolve-os-image: os-name=photon
        - class: node-pool
          name: np-3
          failureDomain: vz-w-03
          replicas: 1
          metadata:
            annotations:
              run.tanzu.vmware.com/resolve-os-image: os-name=photon

Step 3 - Similar to the workload rebalancing workflow, we need to login to vSphere Supervisor and identify the VKS Control Plane VMs machine IDs running on the vSphere Zone that will be removed and have them redeploy.

k -n ns-01 get machine -l cluster.x-k8s.io/control-plane -o custom-columns=NAME:.metadata.name,ZONE:.spec.failureDomain
k -n ns-01 annotate machine vks01-8skvd-6n7j8 'cluster.x-k8s.io/remediate-machine=""'

We have now successfully evacuated the applicable VKS Control Plane and Worker Nodes to the remainder vSphere Zones

Comments

KAVEH GOODARZEY says

01/10/2026 at 5:43 am

Great article William!.. Only i'm confused as to why the YAML manifest shows the failureDomain property as: "vz-01","vz-02","vz-03"?
Should it not be "vz-w-01","vz-w-02","vz-w-03"?
Also at the top you mentioned "I am using a Single Management Zone (vz-01) for running my vSphere Supervisor Control Plane" but again that does not reflect what is in the screen shots is it? Should it not be "vz-w-01"?
other than that just want to say this is absolutely fantastic material

- William Lam says
  
  01/10/2026 at 7:59 am
  
  I'm so sorry, thats what happens when you use multiple environment for testing and you forget which one you took the screenshots on ... yes, you're right it should be vz-w-* naming convention but hopefully the article conceptually is clear. I've already fixed the YAML

Rebalance Workloads across vSphere Zones

Evacuate Workloads from vSphere Zones

Comments

Thanks for the comment!Cancel reply