How to automatically repoint & failover VCSA to another replicated Platform Services Controller (PSC)?

12.18.2015 by William Lam // 30 Comments

For those of you who read my previous article (if you have not read it, please do so before proceeding forward), at the very end I showed off a screenshot of a script that I had created for the vCenter Server Appliance (VCSA) which automatically monitors the health of the primary Platform Services Controller (PSC) it is connected to and in the event of a failure, it would automatically repoint and failover to another healthy PSC. The way it accomplishes this is by first deploying two externally replicated PSC's and then associating the VCSA with just the first PSC which we will call our primary/preferred PSC node. Both PSC's are in an Active/Active configuration using a multi-master replication and any changes made in SSO on psc-01 (as shown in the diagram) will automatically be replicated to psc-02.

From a vCenter Server's point of view, it is only get requests serviced by a single PSC, which is psc-01 as shown in the diagram above. Within the VCSA, there is a script which runs a cronjob that will periodically check psc-01's connectivity by performing a simple GET operation on the /websso endpoint. If it is unable to connect, the script will retry for a certain number of times before declaring that the primary/preferred PSC node is no longer available. At this point, the script will automatically re-point the VCSA to the secondary PSC and in a couple of minutes, any users who might have tried to login to the vSphere Web Client will be able login and this happens transparently behind the scenes without any manual interaction. For users that have already logged in to vCenter Server, those sessions will continue to work unless they have timed out, in which case you would need to log back in.

The script is configurable in terms of the number of times to check the PSC for connectivity as well as the amount of time to wait in between each check. In addition, if you have an SMTP server configured on the vCenter Server, you can also specify an email address which the script can send a notification after the failover and alert administrators to the failed PSC node. Although this example is specific to the VCSA, a similar script could be developed on a Windows platform using the same core foundation.

Disclaimer: This script is not officially supported by VMware, it is intended as an example of what can be done with the cmsso-util utility. Use at your own risk.

To setup a similar configuration, you will need to perform the following:

Step 1 - Deploy two External PSCs that are replicated with each other. Ensure you select the "Join an SSO domain in an existing vCenter 6.0 platform services controller" option to setup replication and ensure you are joining the same SSO Site.

Step 2 - Deploy your VCSA and when asked to specify the PSC to connect to, specify the primary/preferred PSC node you had deployed earlier. In my example, this would be psc-01 as seen in the screenshot below.

Step 3 - Download the checkPSCHealth.sh script which can be found here.

Step 4 - SCP the script to your VCSA and store it under /root and then set it to be executable by running the following command:

chmod +x /root/checkPSCHealth.sh

Step 5 - Next, you will need to edit the script and adjust the following variables listed below. They should all be self explanatory and if you do not have an SMTP server setup, you can leave the EMAIL_ADDRESS variable blank.

PRIMARY_PSC - IP/Hostname of Primary PSC
SECONDARY_PSC- IP/Hostname of Secondary PSC (must already be replicating with Primary PSC)
NUMBER_CHECKS- Number of times to check PSC connectivity before failing over (default: 3)
SLEEP_TIME - Number of seconds to wait in between checks (default: 30)
EMAIL_ADDRESS- Email when failover occurs

Step 6 - Lastly, we need to setup a scheduled job using cron. To do so you can run the following command:

crontab -e

Copy the following snippet as shown below into the crontab of the root user account. The first half just covers all the default paths and the expected libraries to perform the operation. I found that without having these paths, you will run into issues calling into the cmsso-util and I figured it was easier to take all the VMware paths from running the env command and just making it available. The very last last line is actually what setups the scheduling and in the example below, it will automatically run the script every 5 minutes. You can setup even more complex rules on how to run the script, for more info, take a look here.

PATH=/sbin:/usr/sbin:/usr/local/sbin:/usr/local/bin:/usr/bin:/bin:/usr/X11R6/bin:/usr/games:/usr/lib/mit/bin:/usr/lib/mit/sbin:/usr/java/jre-vmware/bin:/opt/vmware/bin

SHELL=/bin/bash
VMWARE_VAPI_HOME=/usr/lib/vmware-vapi
VMWARE_RUN_FIRSTBOOTS=/bin/run-firstboot-scripts
VMWARE_DATA_DIR=/storage
VMWARE_INSTALL_PARAMETER=/bin/install-parameter
VMWARE_LOG_DIR=/var/log
VMWARE_OPENSSL_BIN=/usr/bin/openssl
VMWARE_TOMCAT=/opt/vmware/vfabric-tc-server-standard/tomcat-7.0.55.A.RELEASE
VMWARE_RUNTIME_DATA_DIR=/var
VMWARE_PYTHON_PATH=/usr/lib/vmware/site-packages
VMWARE_TMP_DIR=/var/tmp/vmware
VMWARE_PERFCHARTS_COMPONENT=perfcharts
VMWARE_PYTHON_MODULES_HOME=/usr/lib/vmware/site-packages/cis
VMWARE_JAVA_WRAPPER=/bin/heapsize_wrapper.sh
VMWARE_COMMON_JARS=/usr/lib/vmware/common-jars
VMWARE_TCROOT=/opt/vmware/vfabric-tc-server-standard
VMWARE_PYTHON_BIN=/opt/vmware/bin/python
VMWARE_CLOUDVM_RAM_SIZE=/usr/sbin/cloudvm-ram-size
VMWARE_VAPI_CFG_DIR=/etc/vmware/vmware-vapi
VMWARE_CFG_DIR=/etc/vmware
VMWARE_JAVA_HOME=/usr/java/jre-vmware

*/5 * * * * /root/checkPSCHealth.sh

Step 7 - Finally, you will probably want to test the script to ensure it is doing what you expect. The easiest way to do this is by disconnecting the vNIC on psc-01 and depending on how you have configured the script, in a short amount of time it should automatically start the failover. All operations are automatically logged to the system logs which you can find under /var/log/messages.log and I have also tagged the log entries with a prefix of vGhetto-PSC-HEALTH-CHECK, so you can easily filter out those message in Syslog as seen in the screenshot below.

If a failover occurs, the script will also log additional output to /root/psc-failover.log which can be used to troubleshoot in the case a failover was attempted but failed. To ensure that the script does not try to failover again, it creates an empty file under /root/ran-psc-failover which the script checks at the beginning before proceeding. Once you have verified the script is doing what you expect, you will probably want to manually fail back the VCSA to the original PSC node and then remove the /root/ran-psc-failover file else the script will not run when it is schedule to.

As mentioned earlier, though this is specifically for the the VCSA, you can build a similar solution on a Windows system using Windows Task Scheduler and the scripting language of your choice. I, of course highly recommend customers to take a look at the VCSA for its simplicity in management and deployment, but perhaps thats just my bias 🙂

Comments

Doug McIntyre says

12/18/2015 at 3:42 pm

The procedure for replicating 6.0 PSC's was to download an external python script, VMware-psc-ha-6.0.0.2503195, and do some more extensive steps copying certs, data & scripts back and forth.

Is that all built into 6.0 U1 now?
If so, cool, but I haven't heard of that being built in before?

Reply
- Ryan Johnston says
  
  12/18/2015 at 3:52 pm
  
  I believe this was only if you wanted to use a Load Balancer as well. If you just want two PSC's, just deploy and join the second to the same SSO site. Replication between the two takes care of itself as it is already.
  
  Reply
- William Lam says
  
  12/18/2015 at 4:23 pm
  
  No, setting up PSC replication has always been available in the installer UI for both Windows/VCSA since vSphere 6.0 GA. When you deploy a PSC, you'll be asked to either setup a new instance or join and existing and this is how you would setup replication.
  
  The only thing that's required as mentioned by Ryan is if you plan to setup a load balancer, there's a set of scripts that you'll need to properly configure certs/etc. This is that added complexity when setting up a load balancer.
  
  Reply
  - Qing L says
    
    02/09/2016 at 11:34 am
    
    Hi William,
    
    this is absolutely add-ons function for failover psc. Thanks!
    
    on the other hand, it would be nice if VMware team could add these kinds of scripts such as space monitor for vCSA 6.0 and PSC, etc. i am not great at scripting but i like to be able to grab and reconfigure for our vCenter infrastructure environment.
    
    in addition, for increase disk space, VMware only put vdx_servicecfg script on vCSA but not PSC. this make a sort of inconsistency.
    
    Regards/Qing
    
    Reply
P. Cruiser says

12/18/2015 at 7:14 pm

Any chance of support for KB 2131191 which allows repointing between sites?

Reply
David Paulus says

12/23/2015 at 2:38 am

Very good article, but I use also SRM and vRealize operations, can you give me the commands to repoint SRM and VROPS to the new PSC?

Reply
jnewmaster says

01/06/2016 at 12:50 pm

Hi William,
Is there any chance something like this may be officially implemented within the VCSA at some point?
Also, if you have other PSCs at other sites would it be best if the other sites PSCs replicate with the primary PSC only or with both PSCs?
Thanks.

Reply
- William Lam says
  
  01/07/2016 at 8:07 am
  
  If you attended VMworld, there was a session on the future of PSC ... that's all I'll say 🙂
  
  Reply
  - jnewmaster says
    
    01/07/2016 at 7:45 pm
    
    Went to VMworld. Missed that session. I'll give it a watch now. Thanks!
    
    Reply
KB1IBT says

01/19/2016 at 7:17 am

Based on your comment about needing to delete /root/ran-psc-failover after testing, does that mean that if this fails over in production that you would need to delete that file in order for them to revert back to their preferred PSC?

Reply
- William Lam says
  
  01/19/2016 at 7:39 am
  
  The /root/ran-psc-failover just ensures the script does not try to re-run after a failover. You will have to manually re-point to your preferred PSC and then remove the file to have the script continue checking. I've left this to be a manual process as its difficult to assume what *could* happen
  
  Reply
esx_buf says

02/01/2016 at 1:39 pm

Hi William,
last week i could test your great article/script and it works with just a few modification in our new Environment (6.0U1). Because we have a proxy server in our company the "curl check" doesn´t work and in i think it makes no sense to use one because if the proxy server was down for patching maybe the script do a failover, so i use curl with the --noproxy argument which works fine. The second thing which i don´t get to run was the repointing to the primary PSC, it won´t work with the cmsso-util. After a search i found the KB2113917, this Solution Method worked for me.
Another great article and work, thanks a lot!
Regards,
Franz

Reply
jfordbos says

02/17/2016 at 1:40 pm

William, would this work when more than one SSO domain SITE is involved? Your diagram at the top shows a single vCenter and 2 PSC's, but it doesn't indicate that the PSC's are in seperate SSO 'sites'. Guess I'll have to build out a test environment and give it a try.

Along these lines, I'm wondering how best to configure our vCenter environment. We only need 2 vCenters for the entire US. Originally, I've been thinking that I'll install a first PSC, then point vCenter in that physical site to that PSC in that 'SSO site'. Then, install a PSC in the other city/physical site, joining to the SSO domain by pointing to the first installed PSC but configuring as a new SSO site instead of joining the existing SSO site in that first site that was installed. Finally, install the 2nd vCenter and point it to the 2nd install PSC in the same SSO domain but in its particular SSO site.

Then I started thinking---why configure a 2nd SSO site in the 2nd physical location? The WAN links are plenty fast if I ever needed to repoint a vCenter to the other physical site's PSC. Couldn't I simplify my deployment further by not creating a 2nd SSO site in the SSO domain? Why not just have the 2nd PSC join the existing SSO site that was created when the 1st PSC was deployed? Each vCenter would use the PSC specified in its PHYSICAL location at deployment time. Having just a single SSO site with 2 PSC's in it doesn't appear to be one of the many deployment topologies that VMware has made available in various KB's etc UNLESS it's behind a hardware load balancer----and I just plain don't want to use an LB for this deployment because I don't think the complexity is worth it. Thoughts?

Reply
- Qing Lin says
  
  02/17/2016 at 1:48 pm
  
  i have been thinking about these 2nd psc with 2nd sites for a while. but it worth to have the 2nd sites if they are in different geo. recently i ran into a problem with one of psc site sync issue with vcenter pointed to it. forturnately it does not screw up the 2nd sites info, so that i can re-point vc into different sites within sso domain.
  
  Reply
  - jfordbos says
    
    02/18/2016 at 6:20 am
    
    Thanks for your comment. I still am looking for a definitive answer as to whether I can just have one logical SSO 'site' despite having the PSC's (2 in my case, 1 in each city) in seperate phyiscal locations. Why create a 2nd SSO site at all, given my intended layout of 1 vCenter and 1 PSC, per site, in 2 geographical/physical locations? Why not just have a single SSO site within the context of the SSO domain? What's the drawback, if any, of such a layout?
    
    From what I've seen of the configuration options, best practices, and so forth, there's nothing that suggests I can't/shouldn't have a single SSO site for my 2 physical locations. vCenter 'repointing' options that are supported as of 6.0 U1 are of 2 types: 'Intrasite' and 'Intersite', suggesting that I can failover from one PSC to another PSC 'intrasite' (in the single SSO site design I'm considering).
    
    Reply
  - jfordbos says
    
    02/18/2016 at 8:07 am
    
    Okay---I just now noticed that in Step 1 above, William specifies that the 2nd PSC added must join the same SITE. This script, therefore, only works for PSC's within a singular SSO site, but something tells me William can make it possible. So now I'm going to tear down my test environment and create 2 PSC's but the PSC's will be part of the same SSO site and the vCenter in each physical location will be configured to point to its own PSC within each site. Then I can test William's script as well as go through the process in KB 2113917, which William's script essentially automates for INTRAsite repointing.
    
    Reply
Ainz13 says

03/03/2016 at 12:20 pm

Great Script. What is the reason why the script does not "move services" after repointed?

Reply
Mehul says

03/08/2016 at 9:43 am

Could you please explain in detail how we copying these script and where we copying and what are the commands - I am not Linux experts - step by step with screen shot will be highly appreciate.

Reply
Joel says

04/26/2016 at 1:58 pm

William, your site is awesome and so is this script! It literally saved my department thousands of dollars. I just wanted to point out one minor thing. As the script is written currently the automatically generated email that is sent when a failover occurs will have no subject line and the line "Subject: PSC Failover Notification" as the first line in the body of the email. This is due to the spaces in front of the line in the script. It makes the script cleaner and easier to read but it messes with the layout of the email. By removing the spaces in front of the lines "Subject: PSC Failover Notification" and "VC ${VC_HOSTNAME} failed over to passive PSC ${SECONDARY_PSC} at $(date)" the email will look the way it is supposed to.

Thanks again for all your hard work and great ideas.

Reply
- William Lam says
  
  04/27/2016 at 8:36 am
  
  Hi Joel,
  
  I really appreciate you sharing this and thanks for the feedback. I've gone ahead and removed the leading white space and push the changes my Github repo.
  
  Reply
Domininc Laplante says

04/28/2016 at 8:42 am

Hi William,

you said "If you ask me, this is a pretty complex and potentially costly solution just to get a basic automatic node failover without any of the real benefits of setting up a load balancer in the first place."

If we already got our Load Balancer, I have to run the "SSO HA Script" on both PCS and add the Certificate in my Load balancer for the service that will use SSL ?

But that's a reliable deployment ?

Reply
David says

07/26/2016 at 8:19 am

Thank you for your work on this script. I am attempting to setup and having a slight issue...
# sh P10checkPSCHealth.sh
'10checkPSCHealth.sh: line 23: syntax error near unexpected token `
'10checkPSCHealth.sh: line 23: ` for i in $(seq 0 ${NUMBER_CHECKS});

This was originally line 31 but bash was complaining about blank lines, so I removed them.

Reply
Domininc Laplante says

09/07/2016 at 6:38 am

Does the PSC automatic FAILOVER will be include in the next release ? So no need to add the Load Balancer Layer ?

Thanks

Reply
txolson says

12/13/2016 at 12:50 pm

Thanks for this William (and all your great posts) -- Question: When I failover, , the behavior I see is that existing connections (fat client and web client) are disconnected which contradicts what you descibe above. I have 2 external PSCs and 2 vCenters in ELM. Any ideas?

Reply
Charles says

02/09/2017 at 5:33 pm

The script doesn't seems to work with VCSA 6.5

I've replaces the ENV Varriable from VCSA 6.5 into cron like this page but I'm getting thir error when I run the script with cron

2017-02-09T20:32:01.284060-05:00 pscas crond[11529]: (root) PAM ERROR (Permission denied)
2017-02-09T20:32:01.284145-05:00 pscas crond[11529]: (root) FAILED to authorize user with PAM (Permission denied)
2017-02-09T20:33:01.285258-05:00 pscas crond[11931]: PAM _pam_load_conf_file: unable to open config for password-auth
2017-02-09T20:33:01.285510-05:00 pscas crond[11931]: PAM _pam_load_conf_file: unable to open config for password-auth
2017-02-09T20:33:01.285591-05:00 pscas crond[11931]: PAM _pam_load_conf_file: unable to open config for password-auth

Reply
- Adek says
  
  05/08/2017 at 6:54 pm
  
  When trying to enable scheduled jobs via cron on VMware VCSA 6.5 you see errors below.
  
  2017-04-19T09:56:01.996673-04:00 VCSA crond[104661]: PAM _pam_load_conf_file: unable to open config for password-auth
  2017-04-19T09:56:01.996797-04:00 VCSA crond[104661]: PAM _pam_load_conf_file: unable to open config for password-auth
  2017-04-19T09:56:01.996907-04:00 VCSA crond[104661]: PAM _pam_load_conf_file: unable to open config for password-auth
  2017-04-19T09:56:01.997010-04:00 VCSA crond[104661]: (root) PAM ERROR (Permission denied)
  2017-04-19T09:56:01.997116-04:00 VCSA crond[104661]: (root) FAILED to authorize user with PAM (Permission denied)
  
  The contents of /etc/pam.d/crond had 3 references to “password-auth”, however there was no file in /etc/pam.d called “password-auth”. Changed “password-auth” to “system-auth” in /etc/pam.d/crond, as seen below, and everything works.
  
  account required pam_access.so
  account include system-auth
  session required pam_loginuid.so
  session include system-auth
  auth include system-auth
  
  IMPORTANT: I have no idea how this change may affect other services and cron jobs listed in the /etc/cron.hourly /etc/cron.daily and so on.
  
  Reply
  - Adek says
    
    05/08/2017 at 7:40 pm
    
    UPDATE:
    After further investigation I discovered that creating password-auth file and populating it with same authentication policies as the system-auth e.g.
    
    auth required pam_tally2.so file=/var/log/tallylog deny=3 onerr=fail even_deny_root unlock_time=86400 root_unlock_time=300
    
    works as well.
    
    I guess by default the crond looks for authentication policy in password-auth and if the file does not exist throws error "PAM _pam_load_conf_file: unable to open config for password-auth"
    Creating this config file resolves the issue.
    
    You can also directly copy system-auth to password-auth by executing "cp system-auth password-auth" on the affected vCenter to quickly resolve the issue.
    
    Reply
clack1987 says

03/15/2017 at 7:22 pm

Thanks for your great artical. How about this in the latest release vmware vcenter 6.5? did you have any testing for it?

Reply
clack1987 says

03/15/2017 at 7:28 pm

Repointing between Sites 1 Repointing the VMware vCenter Server 6.0 between Sites in a vSphere Domain (2131191)

Caution: This operation is no longer supported in vSphere 6.5 and running these steps can cause permanent damage.

VMware is terrible!

Reply
Totie Bash says

06/13/2017 at 11:47 pm

Hi William, this sucks if I have vCenter 6.5 HA, 2 external active-active PSC, no load balancer, VSAN stretch cluster. I have to leverage your script to auto-point to a working PSC (but thank you for the script dude). Best is vCenter 6.5 HA with embedded PSC you can leverage HA but you don't get enhance link mode. I am in the same logic as everyone else, I wish this is a built-in vCenter feature, to have primary secondary PSC entry so that vCenter can just detect automagically which PSC to use. Are you taking registration entries for VMware Fling contest?

Reply

Thanks for the comment!Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

More from my site

Comments

Thanks for the comment!Cancel reply