Over the weekend, I was troubleshooting an issue that was reported by one of our VMware Event Broker Appliance (VEBA) users who was helping with testing one of our upcoming features. The user found that after rebooting the VEBA appliance, the Antrea interfaces were no longer being re-created and pod networking seems to have been broken.
We initially thought it was related to switching to the latest Photon OS version or updating to the latest Antrea CNI release, since everything else was pretty much the same. Even after reverting both versions back to what we initially had, the reboot issue continued to persist. What was even more strange was that the current shipping version of the VEBA (v0.6.1) OVA was not experiencing this issue and had no problems with an OS reboot, which is something I have done many times.
The only logical conclusion that I could come up with to explain this problem is that a behavior change must have occurred within Photon OS from the time we built the previous appliance to what we are seeing now. While troubleshooting Antrea, it was pointed out that Kubernetes (K8s) node is probably unhealth and if so, I may want to look at the kubelet logs to see if it provided any hints. I initially did not both looking at the K8s layer, thinking this was related to change in Antrea since it handled pod networking. Looking at the kubelet logs, I found a ton of entries with the following:
396 kubelet.go:2243] node "veba" not found
I thought this was a bit strange, especially as our appliance has its hostname configurred with a Fully Qualified Domain Name (FQDN) which is veba.primp-industries.local and we had proper entries in both /etc/hostname and /etc/hosts.
Sure enough, when I ran hostname, they all returned the short hostname instead of the FQDN (which it returned properly prior to the reboot)
# hostname veba # hostname -s veba # hostname -f veba
We use hostnamectl to change the hostname, so I manually ran that and then restarted kubelet and of course, everything started to function properly again!
While doing an online search to figure out why this behavior was happening, I came across this thread and it turns out this is some how related to cloud-init resetting the hostname settings. There were a couple of solutions in the post, but I went with the quickest one which was simply touching a file called /etc/cloud/cloud-init.disabled and after rebooting, the hostname was preserved. What was even more baffling is that we did not use cloud-init to setup Photon OS, it is built using Packer and leveraging its Kickstart automation and its possible, that maybe cloud-init is used under the hood or maybe the service was some how enabled in newer updates.
In any case, after configuring your hostname, make sure to create this file and you will not run into this problem.