load balancer

Running sk8s (Simple Kubernetes) on VMC with an AWS Elastic Load Balancer

02.27.2019 by William Lam // Leave a Comment

Last week I wrote about a really nifty Virtual Appliance called sk8s which can be used to quickly setup a Kubernetes (k8s) cluster for development and testing purposes. If you have not checked out that article, be sure to give that a read first to get the full context. As mentioned in the previous article, sk8s runs great on any vSphere deployment but it can also run on VMware Cloud on AWS (VMC) which adds an additional capability where an AWS Elastic Load Balancer (ELB) can automatically be provisioned and configured to front-end the k8s control plane as part of the deployment for external access.

The nice benefit of this is that you only need to configure access to the ELB and not directly to the underlying VMs running within the SDDC, both simplifying the setup but also reducing the need to expose the VMs directly to the internet. The write-up below is similar to that of the previous article, but it does expand into greater detail when deploying to VMC and all the required configuration changes within the VPC using the AWS Console and the Network and Security changes using the VMC Console.

Note: If you decide to use the integrated AWS ELB integration, please be aware that you will be charged for the consumption. For pricing, please see the AWS documentation here.

Prerequisites:

Access to the VMC Console and VMC SDDC
NSX-T Logical Network with DHCP enabled
AWS Access & Secret Key for automatically creating ELB (Optional)
govc

Step 1 - Install govc on your local desktop which has access to your VMC vSphere environment. If you have not installed govc, the quickest way is to simply download the latest binary, below is an example of installing the latest MacOS version:

curl -L https://github.com/vmware/govmomi/releases/download/v0.20.0/govc_darwin_amd64.gz | gunzip > /usr/local/bin/govc
chmod +x /usr/local/bin/govc

Step 2 - We need to verify a few settings in the AWS Console to ensure that the VPC that is connected to your SDDC is properly configured so that the provisioning of the ELB will be successful.

[Read more...]

How to automatically repoint & failover VCSA to another replicated Platform Services Controller (PSC)?

12.18.2015 by William Lam // 30 Comments

For those of you who read my previous article (if you have not read it, please do so before proceeding forward), at the very end I showed off a screenshot of a script that I had created for the vCenter Server Appliance (VCSA) which automatically monitors the health of the primary Platform Services Controller (PSC) it is connected to and in the event of a failure, it would automatically repoint and failover to another healthy PSC. The way it accomplishes this is by first deploying two externally replicated PSC's and then associating the VCSA with just the first PSC which we will call our primary/preferred PSC node. Both PSC's are in an Active/Active configuration using a multi-master replication and any changes made in SSO on psc-01 (as shown in the diagram) will automatically be replicated to psc-02.

From a vCenter Server's point of view, it is only get requests serviced by a single PSC, which is psc-01 as shown in the diagram above. Within the VCSA, there is a script which runs a cronjob that will periodically check psc-01's connectivity by performing a simple GET operation on the /websso endpoint. If it is unable to connect, the script will retry for a certain number of times before declaring that the primary/preferred PSC node is no longer available. At this point, the script will automatically re-point the VCSA to the secondary PSC and in a couple of minutes, any users who might have tried to login to the vSphere Web Client will be able login and this happens transparently behind the scenes without any manual interaction. For users that have already logged in to vCenter Server, those sessions will continue to work unless they have timed out, in which case you would need to log back in.

The script is configurable in terms of the number of times to check the PSC for connectivity as well as the amount of time to wait in between each check. In addition, if you have an SMTP server configured on the vCenter Server, you can also specify an email address which the script can send a notification after the failover and alert administrators to the failed PSC node. Although this example is specific to the VCSA, a similar script could be developed on a Windows platform using the same core foundation.

Disclaimer: This script is not officially supported by VMware, it is intended as an example of what can be done with the cmsso-util utility. Use at your own risk.

To setup a similar configuration, you will need to perform the following:

Step 1 - Deploy two External PSCs that are replicated with each other. Ensure you select the "Join an SSO domain in an existing vCenter 6.0 platform services controller" option to setup replication and ensure you are joining the same SSO Site.

Step 2 - Deploy your VCSA and when asked to specify the PSC to connect to, specify the primary/preferred PSC node you had deployed earlier. In my example, this would be psc-01 as seen in the screenshot below.

Step 3 - Download the checkPSCHealth.sh script which can be found here.

Step 4 - SCP the script to your VCSA and store it under /root and then set it to be executable by running the following command:

chmod +x /root/checkPSCHealth.sh

Step 5 - Next, you will need to edit the script and adjust the following variables listed below. They should all be self explanatory and if you do not have an SMTP server setup, you can leave the EMAIL_ADDRESS variable blank.

PRIMARY_PSC - IP/Hostname of Primary PSC
SECONDARY_PSC- IP/Hostname of Secondary PSC (must already be replicating with Primary PSC)
NUMBER_CHECKS- Number of times to check PSC connectivity before failing over (default: 3)
SLEEP_TIME - Number of seconds to wait in between checks (default: 30)
EMAIL_ADDRESS- Email when failover occurs

Step 6 - Lastly, we need to setup a scheduled job using cron. To do so you can run the following command:

crontab -e

Copy the following snippet as shown below into the crontab of the root user account. The first half just covers all the default paths and the expected libraries to perform the operation. I found that without having these paths, you will run into issues calling into the cmsso-util and I figured it was easier to take all the VMware paths from running the env command and just making it available. The very last last line is actually what setups the scheduling and in the example below, it will automatically run the script every 5 minutes. You can setup even more complex rules on how to run the script, for more info, take a look here.

PATH=/sbin:/usr/sbin:/usr/local/sbin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:/usr/java/jre-vmware/bin:/opt/vmware/bin

SHELL=/bin/bash
VMWARE_VAPI_HOME=/usr/lib/vmware-vapi
VMWARE_RUN_FIRSTBOOTS=/bin/run-firstboot-scripts
VMWARE_DATA_DIR=/storage
VMWARE_INSTALL_PARAMETER=/bin/install-parameter
VMWARE_LOG_DIR=/var/log
VMWARE_OPENSSL_BIN=/usr/bin/openssl
VMWARE_TOMCAT=/opt/vmware/vfabric-tc-server-standard/tomcat-7.0.55.A.RELEASE
VMWARE_RUNTIME_DATA_DIR=/var
VMWARE_PYTHON_PATH=/usr/lib/vmware/site-packages
VMWARE_TMP_DIR=/var/tmp/vmware
VMWARE_PERFCHARTS_COMPONENT=perfcharts
VMWARE_PYTHON_MODULES_HOME=/usr/lib/vmware/site-packages/cis
VMWARE_JAVA_WRAPPER=/bin/heapsize_wrapper.sh
VMWARE_COMMON_JARS=/usr/lib/vmware/common-jars
VMWARE_TCROOT=/opt/vmware/vfabric-tc-server-standard
VMWARE_PYTHON_BIN=/opt/vmware/bin/python
VMWARE_CLOUDVM_RAM_SIZE=/usr/sbin/cloudvm-ram-size
VMWARE_VAPI_CFG_DIR=/etc/vmware/vmware-vapi
VMWARE_CFG_DIR=/etc/vmware
VMWARE_JAVA_HOME=/usr/java/jre-vmware

*/5 * * * * /root/checkPSCHealth.sh

Step 7 - Finally, you will probably want to test the script to ensure it is doing what you expect. The easiest way to do this is by disconnecting the vNIC on psc-01 and depending on how you have configured the script, in a short amount of time it should automatically start the failover. All operations are automatically logged to the system logs which you can find under /var/log/messages.log and I have also tagged the log entries with a prefix of vGhetto-PSC-HEALTH-CHECK, so you can easily filter out those message in Syslog as seen in the screenshot below.

If a failover occurs, the script will also log additional output to /root/psc-failover.log which can be used to troubleshoot in the case a failover was attempted but failed. To ensure that the script does not try to failover again, it creates an empty file under /root/ran-psc-failover which the script checks at the beginning before proceeding. Once you have verified the script is doing what you expect, you will probably want to manually fail back the VCSA to the original PSC node and then remove the /root/ran-psc-failover file else the script will not run when it is schedule to.

As mentioned earlier, though this is specifically for the the VCSA, you can build a similar solution on a Windows system using Windows Task Scheduler and the scripting language of your choice. I, of course highly recommend customers to take a look at the VCSA for its simplicity in management and deployment, but perhaps thats just my bias 🙂

What does load balancing the Platform Services Controller really give you?

12.16.2015 by William Lam // 22 Comments

The Platform Services Controller (PSC) is a new infrastructure component that was first introduced in vSphere 6.0 that provides common services such as Single Sign-On, Licensing and Certificate Management capabilities for vCenter Server and other VMware-based products. A PSC can be deployed on the same system as the vCenter Server referred to as an Embedded deployment or outside of the vCenter Server which is known as an External PSC deployment. The primary use case for having an External PSC is to be able to take advantage of the new Enhanced Linked Mode (ELM) feature which provides customers with a single pane of glass for managing all of their vCenter Servers from within the vSphere Web Client.

When customers start to plan and design their vSphere 6.0 architecture, a topic that is usually brought up for discussion is whether or not they should be load balancing a pair (up to four) of their PSC's? The idea behind using a load balancer is to provider higher levels of availability for their PSC infrastructure, however it does come as an additional cost both from an Opex and Capex standpoint. More importantly, given the added complexity, does it really provide you with what you think it does?

A couple of things that stood out to me when I look at the process (VMware KB 2113315) of setting up a load balancer (VMware NSX, F5 BIG-IP, & Citrix NetScalar) for your PSC:

The load balancer is not actually "load balancing" the incoming requests and spreading the load across the different backend PSC nodes
Although all PSCs behind the load balancer is in an Active/Active configuration (multi-master replication), the load balancer itself has been configured to affinitzed to just a single PSC node

When talking to customers, they are generally surprised when I mention the above observations. When replication is setup between one or more PSC nodes, all nodes are operating in an Active/Active configuration and any one of the PSC nodes can service incoming requests. However, in a load balanced configuration, a single PSC node is actually "affinitized" to the load balancer which will be used to provide services to the registered vCenter Servers. From the vCenter Server's point of view, only a single PSC is really active in servicing the requests even though all PSCs nodes are technically in an Active/Active state. If you look at the implementation guides for the three supported load balancers (links above), you will see that this artificial "Active/Passive" behavior is actually accomplished by specifying a higher weight/priority on the primary or preferred PSC node.

So what exactly does load balancing the PSC really buy you? Well, it does provide you with a higher levels of availability for your PSC infrastructure, but it does this by simply failing over to one of the other available PSC nodes when the primary/preferred PSC node is no longer available or responding. Prior to vSphere 6.0 Update 1, this was the only other option to provide higher availability to your PSC infrastructure outside of using vSphere HA and SMP-FT. If you ask me, this is a pretty complex and potentially costly solution just to get a basic automatic node failover without any of the real benefits of setting up a load balancer in the first place.

In vSphere 6.0 Update 1, we introduced a new capability that allows us to repoint an existing vCenter Server to another PSC node as long as it is part of the same SSO Domain. What is really interesting about this feature is that you can actually get a similar behavior to what you would have gotten with load balancing your PSC minus the added complexity and cost of actually setting up the load balancer and the associated configurations on the PSC.

In the diagram above, instead of using a load balancer as shown in the left, the alternative solution that is shown to the right is to manually "failover" or repoint to the other available and Active PSC nodes when the primary/preferred is no longer responding. With this solution, you are still deploying the same number of PSC's and setting up replication between the PSC nodes, but instead of relying on the load balancer to perform the failover for you automatically, you would be performing this operation yourself by using the new repoint functionality. The biggest benefit here is that you get the same outcome as the load balanced configure without the added complexity of setting up and managing a single or multiple load balancers which in my opinion is huge cost. At the end of the day, both solutions are fully supported by VMware and it is important to understand what capabilities are provided with using a load balancer and whether it makes sense for your organization to take on this complexity based on your SLAs.

The only down side to this solution is that when a failure occurs with the primary/preferred PSC, a manual intervention is required to repoint to one of the available Active PSC nodes. Would it not be cool if this was automated? ... 🙂

Well, I am glad you asked as this is exactly what I had thought about. Below is a sneak peak at a log snippet for a script that I had prototyped for the VCSA which automatically runs a scheduled job to periodically check the health of the primary/preferred PSC node. When it detects a failure, it will retry N-number of times and when concludes that the node has failed, it will automatically initiate a failover to the available Active PSC node. In addition, if you have an SMTP server configured on your vCenter Server, it can also send out an email notification about the failover. Stay tune for a future blog post for more details on the script which can be found here.