The Platform Services Controller (PSC) is a new infrastructure component that was first introduced in vSphere 6.0 that provides common services such as Single Sign-On, Licensing and Certificate Management capabilities for vCenter Server and other VMware-based products. A PSC can be deployed on the same system as the vCenter Server referred to as an Embedded deployment or outside of the vCenter Server which is known as an External PSC deployment. The primary use case for having an External PSC is to be able to take advantage of the new Enhanced Linked Mode (ELM) feature which provides customers with a single pane of glass for managing all of their vCenter Servers from within the vSphere Web Client.
When customers start to plan and design their vSphere 6.0 architecture, a topic that is usually brought up for discussion is whether or not they should be load balancing a pair (up to four) of their PSC's? The idea behind using a load balancer is to provider higher levels of availability for their PSC infrastructure, however it does come as an additional cost both from an Opex and Capex standpoint. More importantly, given the added complexity, does it really provide you with what you think it does?
A couple of things that stood out to me when I look at the process (VMware KB 2113315) of setting up a load balancer (VMware NSX, F5 BIG-IP, & Citrix NetScalar) for your PSC:
- The load balancer is not actually "load balancing" the incoming requests and spreading the load across the different backend PSC nodes
- Although all PSCs behind the load balancer is in an Active/Active configuration (multi-master replication), the load balancer itself has been configured to affinitzed to just a single PSC node
When talking to customers, they are generally surprised when I mention the above observations. When replication is setup between one or more PSC nodes, all nodes are operating in an Active/Active configuration and any one of the PSC nodes can service incoming requests. However, in a load balanced configuration, a single PSC node is actually "affinitized" to the load balancer which will be used to provide services to the registered vCenter Servers. From the vCenter Server's point of view, only a single PSC is really active in servicing the requests even though all PSCs nodes are technically in an Active/Active state. If you look at the implementation guides for the three supported load balancers (links above), you will see that this artificial "Active/Passive" behavior is actually accomplished by specifying a higher weight/priority on the primary or preferred PSC node.
So what exactly does load balancing the PSC really buy you? Well, it does provide you with a higher levels of availability for your PSC infrastructure, but it does this by simply failing over to one of the other available PSC nodes when the primary/preferred PSC node is no longer available or responding. Prior to vSphere 6.0 Update 1, this was the only other option to provide higher availability to your PSC infrastructure outside of using vSphere HA and SMP-FT. If you ask me, this is a pretty complex and potentially costly solution just to get a basic automatic node failover without any of the real benefits of setting up a load balancer in the first place.
In vSphere 6.0 Update 1, we introduced a new capability that allows us to repoint an existing vCenter Server to another PSC node as long as it is part of the same SSO Domain. What is really interesting about this feature is that you can actually get a similar behavior to what you would have gotten with load balancing your PSC minus the added complexity and cost of actually setting up the load balancer and the associated configurations on the PSC.
In the diagram above, instead of using a load balancer as shown in the left, the alternative solution that is shown to the right is to manually "failover" or repoint to the other available and Active PSC nodes when the primary/preferred is no longer responding. With this solution, you are still deploying the same number of PSC's and setting up replication between the PSC nodes, but instead of relying on the load balancer to perform the failover for you automatically, you would be performing this operation yourself by using the new repoint functionality. The biggest benefit here is that you get the same outcome as the load balanced configure without the added complexity of setting up and managing a single or multiple load balancers which in my opinion is huge cost. At the end of the day, both solutions are fully supported by VMware and it is important to understand what capabilities are provided with using a load balancer and whether it makes sense for your organization to take on this complexity based on your SLAs.
The only down side to this solution is that when a failure occurs with the primary/preferred PSC, a manual intervention is required to repoint to one of the available Active PSC nodes. Would it not be cool if this was automated? ... 🙂
Well, I am glad you asked as this is exactly what I had thought about. Below is a sneak peak at a log snippet for a script that I had prototyped for the VCSA which automatically runs a scheduled job to periodically check the health of the primary/preferred PSC node. When it detects a failure, it will retry N-number of times and when concludes that the node has failed, it will automatically initiate a failover to the available Active PSC node. In addition, if you have an SMTP server configured on your vCenter Server, it can also send out an email notification about the failover. Stay tune for a future blog post for more details on the script which can be found here.