Troubleshooting tips for configuring vSphere with Kubernetes

With more and more folks trying out the new vSphere with Kubernetes capability, I have seen an uptick in questions both internally and externally around the initial setup of the infrastructure required for vSphere with Kubernetes but also during the configuration of a vSphere Cluster for Workload Management.

One of the most common question is why are there no vSphere Clusters listed or why a specific vSphere Cluster is showing up as Incompatible? There are a number of reasons that this can occur including vCenter Server not being able to communicate with NSX-T Manager to retrieve the list of NSX pre-checks which would cause the list to either be empty or listed as incompatible. Not having proper time sync between vCenter Server and NSX-T which can also manifest in a similar behavior among other infrastructure issues.

Having ran into some of these issues myself when developing my automation script, I figure it might be useful to share some of the troubleshooting tips I have used when trying to figure out what is going on whether that is during the initial setup or actually deploying workloads using vSphere with Kubernetes.

As an aside, If you are just getting started and want to quickly explore what vSphere with Kubernetes has to offer, one of the easiest way is to leverage my vSphere 7 with Kubernetes Automation Lab Deployment Script. The script supports a number of customization and can also be adjusted to deploy a minimal vSphere with Kubernetes environment that requires the least amount of physical resources as explained in this blog post. I know this may not be for everyone as it used Nested ESXi but it certainly is the fastest and most consistent way to deploy a complete functional environment in less than 40min!

Compatibility Checks

During the configuration of vSphere with Kubernetes, there are a number of compatibility checks that are performed to ensure the vSphere Cluster you wish to enable Workload Management is going to work. Today, the vSphere UI does not provide much details around these incompatibilities, but the underlying vSphere with Kubernetes Management API does and this can be used to understand what the issues are.

Luckily, we do not have to write any code, we can simply use DCLI (Datcenter CLI) which is available directly on the VCSA. The vCenter REST API namespace which we are interested in is called Namespace Management (namespacemanagement) and below are the various "compatibility" checks which you can run. If you are interested in automating various aspects of vSphere with Kubernetes, be sure to check out this blog post by Vikas Shitole on how to get started.

vSphere Cluster Compatibility
This command will give you cluster level checks to see why a particular vSphere Cluster is compatible or not compatible along with a list of reasons which can also be useful for automation purposes.

dcli com vmware vcenter namespacemanagement clustercompatibility list

vSphere Distributed Switch Compatibility
This command will give you compatibility checks for the underlying network switch and it expects the ID of a specific vSphere Cluster which you can retrieve from our previous command.

dcli com vmware vcenter namespacemanagement distributedswitchcompatibility list --cluster domain-c8

NSX-T Edge Compatibility
This last command will give you compatibility checks for the NSX-T Edge Cluster that you are expecting to use with your vSphere with Kubernetes Cluster. It expects both the ID of a specific vSphere Cluster as well as the UUID of the Distributed Virtual Switch which you can find in the previous command.

dcli com vmware vcenter namespacemanagement edgeclustercompatibility list --cluster domain-c8 --distributed-switch "50 1c 91 d0 d0 11 e5 b8-c5 a8 fa e1 0d f4 d2 6c"

Once the enablement of Workload Management has begun on a vSphere Cluster, the next most common question is there are a number of errors and warnings in the vSphere UI, is that something to be concerned about?

The simple answer is no, this is expected and I know this will be addressed in a future update of vSphere with Kubernetes. You will see various warnings and errors like "HTTP communication could not completed with status 404" but these can simply be ignored. The overall process can take anywhere from 30 minutes to an hour to complete depending on the size of your environment. You can confirm that everything was configured correctly when you refresh the vSphere UI and see the Cluster Config Status show "Running" as shown in the screenshot below.

If it has been more than 1 hour, then you most likely have a configuration issue and you need to debug further by looking at the relevant logs.

Logging

If you prefer to get a better sense of what is happening during the enablement, you can login to the VCSA and take a look at some of the logs, especially as it can be useful to debug when enablement has failed. The following two logs will give you information about general Workload Management enablement but also anything related to NSX-T which is also a large portion of the initial configuration as various NSX-T components will be deployed.

Workload Management Logs on VCSA: /var/log/vmware/wcp/wcpsvc.log
NSX-T Logs on VCSA: /var/log/vmware/wcp/nsxd.log

In addition to the logs above, I have also personally found using the NSX-T Manager API logs to be useful as you can see the exact error being returned from NSX-T Manager when a particular vSphere with Kubernetes operation has failed such as requesting an IP Address from Ingress IP Pool. This stemmed from my experience using the NSX-T API, where some times the actual response from the API is not very clear on what the issue is and when looking at the actual NSX-T API logs, it gives greater details and usually will pin point the issue. This was something I had found useful and hopefully this will be something the team considers as another source for useful information which can be turned into an actionable task for users trying to self-troubleshoot.

NSX-T Manager API Logs: /var/log/proton/nsxapi.log

Troubleshooting Kubernetes

Once a vSphere Cluster has been enabled with Workload Management, then you can start deploying workloads whether that is a vSphere Pod VM running in the Supervisor Cluster or Tanzu Kubernetes Grid (TKG) Cluster. At this point, you are interacting and using Kubernetes and if you are new, it can certainly be daunting when something is not working as expected or if you see a partial deployment of VMs but things may still not be working.

The declarative nature of Kubernetes certainly makes it challenging as the platform will attempt to deliver the desire state but if there is not enough resources or there are underlying configuration issues, it will simply keep trying and/or wait until the issue is resolved. This definitely took some time to get used as the errors may not always be apparent and you need to look at the Kubernetes specific events. Luckily, as part of the vSphere with Kubernetes integration, you can quickly see all relevant events under the specific vSphere Namespace which you had to have created to start deploying workloads.

To do so, select your vSphere Namespace and then navigate to Monitor->Kubernetes to view the Kubernetes events. To simulate a resource issue, I had created a very constrained vSphere Cluster with insufficient resources and attempted to provision a TKG Cluster. If we look at the screenshot below, we can quickly see why the provisioning of the TKG Cluster has not progressed further.

One thing to note is that the Kubernetes events UI is not automatically refreshed which means you must explicitly refresh to see updated events which can be difficult when troubleshooting in real time. You can always use the filter to look for warning or error messages.

Another tool which I have been using quite a bit when working with Kubernetes is Octant, I wrote about it in my Useful Interactive Terminal and Graphical UI tools for Kubernetes blog post which I highly recommend to learn about other useful tools. Octant not only provides a graphical UI to easily explore and interact with a Kubernetes Cluster which includes vSphere with Kubernetes Cluster but it offers a real time refresh of events and logs which I have found to be extremely useful. You simply run the Octant binary and it will automatically launch your web browser and if you are in the context of your vSphere Namespace, you simply scroll down to the very bottom to quickly what events are happening and this is the exact same data which the vSphere UI is providing.

Hopefully the above tips will be useful as you start to explore the capabilities of vSphere with Kubernetes. If you have any other tips/tricks, feel free to leave a comment as it may help others.

Additional Resources

Enabling Pods to pull from external image repositories in vSphere with Kubernetes by Cormac Hogan

Comments

Eric says

05/06/2020 at 12:54 pm

Thanks for posting this. Have been digging through logs for issues above so good to know they are normal. Off to look at more logs to figure out why it's failing, though.

Chiang Yen Lee says

05/07/2020 at 5:48 am

Thanks for the octant tip, I will sure to try it out. And all thanks to your deployment script, I was able to setup my homelab to test out vSphere for Kubernetes.
However, I have encountered an annoying problem. My vSphere pods will always not have network connectivity after the full vSphere cluster reboot. The status will show that it is running, but there is no network. It was working perfectly fine before vSphere was shutdown.
The only way to solve this is to use kubectl delete pod --force to remove the pod and redeploy.
But the harbor registry, which is a bunch of vSphere pods do not allow me to manage directly using kubectl. When I try to disable the registry because if fails to run correctly after the reboot, it will throw an error because it timed out. Do you have any idea how I can resolve the issue?
Cheers

Roberto Covarrubias says

05/07/2020 at 8:29 am

William,
Is there a way to automatically cleanup a failed WCP deployment? I am mostly interested on the NSX-T clean up part.

Thank you!

Mamata Desai says

06/04/2020 at 10:50 am

William,

Is vSAN necessary for deploying VCF management cluster? If we use another Policy Based storage (vVol) would it be possible to proceed with deployment, even though unsupported?

Thanks!
--m

kastro says

11/17/2020 at 2:39 am

I have problem that only first VM (ControlPlane) gets deployed, workers are never deployed so I am stuck.
Using Nested ESXi, VSAN, dVS, HAproxy(3 NICs) mode. HAproxy.cfg gets updated with ControlPlane IP but thats all.
Any ideas?

ReconcileFailure wcpcluster/tkg-cluster unexpected error while reconciling control plane endpoint for tkg-cluster: failed to reconcile loadbalanced endpoint for WCPCluster tanzu-namespace-01/tkg-cluster: failed to get control plane endpoint for Cluster tanzu-namespace-01/tkg-cluster: VirtualMachineService LB does not yet have VIP assigned: VirtualMachineService LoadBalancer does not have any Ingresses

- cytroon says
  
  11/23/2020 at 12:41 pm
  
  The same issue (warning) here. ESXi (NUC), vDS, HAproxy (3 NICs).
  
  - kastro says
    
    11/24/2020 at 12:51 am
    
    haproxy.cfg gets populated with VIP address and also backend but kubectl still don't recognize it.
    
    Waiting for control plane to pass control plane health check to continue reconciliation: tanzu-ns-01/tkc-01: Get https://172.16.97.65:6443/api?timeout=30s: dial tcp 172.16.97.65:6443: connect: connection refused
    Waiting for control plane to pass control plane health check to continue reconciliation: tanzu-ns-01/tkc-01: Get https://172.16.97.65:6443/api?timeout=30s: net/http: TLS handshake timeout
    
    And also if I try this VIP:
    
    curl --insecure -X GET https://172.16.96.13:6443
    {
    "kind": "Status",
    "apiVersion": "v1",
    "metadata": {
    },
    "status": "Failure",
    "message": "forbidden: User \"system:anonymous\" cannot get path \"/\"",
    "reason": "Forbidden",
    "details": {
    
    },
    "code": 403
    
  - cytroon says
    
    11/25/2020 at 3:30 pm
    
    Problem solved. 😉 Frontend IP 10.10.40.6/24 is NOT outside of the Load Balancer IP Range 10.10.40.128/25. IP 10.10.40.6/25 is. 🙂 #BasicNetworkKnowledge #CIDRCalculatoRullezzzz https://www.ipaddressguide.com/cidr
    
    - kastro says
      
      11/27/2020 at 1:13 am
      
      Nice to have it resolved.
      I have tried also "Default" 2-NIC, without Frontend IP, only LB/Virtual Servers and everything in one subnet (Workload) with same errors.
      Could you show your IP ranges, CIDRs as example?
      Thanx
      
Rajesh says

09/22/2023 at 9:43 am

I initiated the WCP cluster and it failed now I am retrying operation getting below error , any idea how can i fix this
Username: *protected email*
Password: ********************
Do you want to save credentials in the credstore? (y or n) [y]:y
|---------|----------|---------------------------------------------------------- -------------------------------------------------------------------------------- ----------------------------------------|
|cluster |compatible|incompatibility_reasons |
|---------|----------|---------------------------------------------------------- -------------------------------------------------------------------------------- ----------------------------------------|
|domain-c8|False |Cluster domain-c8 does not have at least 1 host(s) require d to enable Workload Management. |
| | |All hosts in cluster domain-c8 must be configured with atl east 2 physical CPU threads to run supervisor cluster of size TINY. Consider cho osing a smaller supervisor cluster size.

Compatibility Checks

Logging

Troubleshooting Kubernetes

Additional Resources

More from my site

Comments

Thanks for the comment!Cancel reply