Over the weekend, I was troubleshooting an issue that was reported by one of our VMware Event Broker Appliance (VEBA) users who was helping with testing one of our upcoming features. The user found that after rebooting the VEBA appliance, the Antrea interfaces were no longer being re-created and pod networking seems to have been broken.
We initially thought it was related to switching to the latest Photon OS version or updating to the latest Antrea CNI release, since everything else was pretty much the same. Even after reverting both versions back to what we initially had, the reboot issue continued to persist. What was even more strange was that the current shipping version of the VEBA (v0.6.1) OVA was not experiencing this issue and had no problems with an OS reboot, which is something I have done many times.
The only logical conclusion that I could come up with to explain this problem is that a behavior change must have occurred within Photon OS from the time we built the previous appliance to what we are seeing now. While troubleshooting Antrea, it was pointed out that Kubernetes (K8s) node is probably unhealth and if so, I may want to look at the kubelet logs to see if it provided any hints. I initially did not both looking at the K8s layer, thinking this was related to change in Antrea since it handled pod networking. Looking at the kubelet logs, I found a ton of entries with the following:
396 kubelet.go:2243] node "veba" not found
I thought this was a bit strange, especially as our appliance has its hostname configurred with a Fully Qualified Domain Name (FQDN) which is veba.primp-industries.localย and we had proper entries in both /etc/hostname and /etc/hosts.
Sure enough, when I ran hostname, they all returned the short hostname instead of the FQDN (which it returned properly prior to the reboot)