One thing I love about the VMware Community is the constant sharing of knowledge and information on a regular basis. I always enjoy discovering new tricks and tidbits from the community, especially as it helps me refine my own knowledge and understanding of a given technology or solution.
My good buddy Ariel Sanchez cc'ed me on Twitter yesterday referencing a blog post by Paul Wilk about an issue he was observing in his Nested ESXi environment when configuring vSphere with Tanzu.
This is interesting! Wonder if @lamw ir @eric_shanks have ever seen something like it
— Ariel Sanchez Mora @*protected email* (@arielsanchezmor) November 15, 2020
This was in regards to the dreaded 404 message displayed in the vSphere UI:
HTTP communication could not be completed with status 404
which is actually not unique to a Nested environment. In fact, this cryptic error message was observed even in the first release of vSphere with Tanzu which used to be called vSphere with Kubernetes with the release of vSphere 7.0 release.
Although Paul's conclusion on why his fixed work was not exactly correct, it was the fix itself that I was actually most interested in. Even with the initial vSphere 7.0 release, I had assumed this was just a cosmetic vCenter Server error message. It was not ideal, but like many other customers, I just ignored it as the enablement of Workload Management was still successful.
What helped me connect the dots was the fact that Paul solved the problem by disabling the ESXi firewall, which meant this was actually an ESXi issue. Given this was related to the OVF deployment, I immediately knew what this was actually referring to and is related to an earlier blog post I had shared about a new feature that would allow ESXi to "pull" remote OVF/OVA files from a HTTP(s) endpoint. In this case, it was not OVFTool driving the deployment but rather vCenter Server and the Content Library service, which is also responsible for OVF/OVA deployments.
It turns out that as part of deploying the Supervisor VMs, instead of using the typical "push" method for uploading an OVA, vCenter is instructing the ESXi host to "pull" the OVA files remotely which are actually hosted on the vCenter Server Appliance (VCSA) itself. What ends up happening is that because ESXi does not have the correct port in which the OVA is hosted on the VCSA, the "pull" method fails and it automatically falls back to the old "push" method. This is why you see the error message and then progress is immediately progressing.
It took a bit more digging to figure out what port VCSA was actually serving the OVA file, because I would have assumed it was on 443. It turns out, it is being served on 5480 which is also the same port for hosting the Virtual Appliance Management Interface (VAMI), which I suspect is due to the fact that it has a lighthtpd running. The way I figured out 5480 was actually because I had been spending some time with vSphere with Tanzu configuration file which is stored under /etc/vmware/wcp/wcpsvc.yaml and there is a commented out configuration mentioning where the WCP Agent VM which is another word for Supervisor VM:
ovfurl: 'https://this_vc_pnid:5480/wcpagent/photon-ova-%%SIZE%%.ovf'
As you can see from the example, it defaults to 5480. Looking at the URL path, I was able to determine where these files actually lived on the VCSA filesystem and I found there was a symlink from /opt/vmware/share/htdocs/wcpagent pointing to /storage/lifecycle/vmware-wcp/wcpagent which is where the Supervisor VM (Photon) OVAs are stored on the VCSA.
To actually confirm my suspicion, we need to configure our ESXi host to allow for outbound connectivity to 5480. To do that, I had to use one of my older blog articles back in 2011 on how to create a custom ESXi firewall rule, since 5480 was not one of the default ports that is available for configuration.
Create /etc/vmware/firewall/wcp.xml on ESXi host with the following configuration:
<ConfigRoot> <service> <id>wcp</id> <rule id='0006'> <direction>outbound</direction> <protocol>tcp</protocol> <porttype>src</porttype> <port>5480</port> </rule> </service> </ConfigRoot>
Then run the following two ESXCLI commands to load our new firewall configuration and enable the new ruleset:
esxcli network firewall refresh
esxcli network firewall ruleset set -e true -r wcp
If we now enable Workload Management on our vSphere with Tanzu cluster, you will see that the "Download remote files" no longer throw a 404 but is progressing as expected!
So now that we know why this happening, the custom ESXi firewall rule is not really a good solution. Since we do not allow for any custom firewall policies in ESXi, customers must create this XML file and then package it up into a custom VIB for any type of automated and scalable solution. It is also not ideal because the optimized deployment workflow should just work out of the box and if we do require ports opened on ESXi, it should be done as part of the service and then disabled when not required.
Lastly, I did find it strange that we would host the OVA files behind something other than 443 which is pretty common when serving HTTP(s) files. The VCSA does have another web server which I thought would have made more sense, which is the main landing page and is served on 443. Since the OVA files is not actually stored in the current htdocs folder but rather symlinked. A quicker and more ideal permanent solution is to just symlink the OVA files to VCSA primary htdocs directory and then update the OVA URL in the wcpsvc.yaml configuration file. The other really nice benefit is that you do not have to make any changes to the ESXi firewall nor mess with custom firewall policies.
Disclaimer: This is not officially supported by VMware, especially as changes to the VCSA filesystem can be reverted the next time it is patched or upgraded.
Step 1 - SSH to VCSA and change into /etc/vmware-vpx/docRoot directory and then run the following command to create symlink:
ln -s /storage/lifecycle/vmware-wcp/wcpagent wcpagent
Step 2 - Edit /etc/vmware/wcp/wcpsvc.yaml and uncomment the kubevm and ovfurl section and then replace the address with either the Hostname or IP Address of the VCSA and remove port 5480
kubevm: ovfurl: 'https://192.168.30.200/wcpagent/photon-ova-%%SIZE%%.ovf'
Step 3 - Restart the wcp service for the changes to go into effect:
service-control --restart wcp
So there you have it, the reason and solution to why the HTTP 404 error is showing up when enabling vSphere with Tanzu. I definitely will be sharing this analysis with the Engineering team in case they were not aware and hopefully this will be resolved in a future update and this error will no longer show up and the system will automatically do the right thing.
Carlos says
Nice, many of us where intrigued by this error, specially while trying to troubleshoot why deployment would not work.
Now in your snippet for firewall rule you put porttype "src", but if this is the ESXi consuming the file, wouldn't it be "dst" ?
William Lam says
No, its "src" as I actually tested and verified myself 🙂
Carlos says
Well, that I don't understand then.
Carlos says
were were were 🙂
Steve Ballmer says
Mr. Lam great work as usual.
Jonathan says
Hello,
It's missing the at the end of the config file.
Without it, it won't be able to load the rule.
William Lam says
Thanks for the catch Jonathan! I've fixed it up and removed the other comments 🙂
Jonathan says
Yes sorry for this, I could not delete them by myself 🙂
FoW says
Symbolic link way does not works in 7.0.2.00200.
Eventually I had to turn off the firewall for the entire supervisor cluster. And even now.
FoW says
Sorry, sir.
It was my mistake in confusing situations.
The remote file download and OVF deployment fail on the first attempt, but the OVF deployment is attempted again and succeeds.
The supervisor node configuration fails, but this seems to be a different issue.
It works fine.
Justin says
Before any attempt at a fix, I am getting HTTP communication could not be completed with status 404
fails and then retries and works
tried the above
cd /etc/vmware-vpx/docRoot ]#
ln -s /storage/lifecycle/vmware-wcp/wcpagent wcpagent
kubevm:
ovfurl: 'https://192.168.1.20:5480/wcpagent/photon-ova-%%SIZE%%.ovf'
service-control --restart wcp
Still fails the first time, redeploys
Andrew Wood says
Did you drop the port 5480 from the ovfurl?
gokou340 says
Just an FYI, this has not been fixed in ESXi 8. I am actively deploying Tanzu in our environment that is on ESXi 8.0b and it still gets the same error.
radioplankton says
In case you try William's solution and it doesn't work, you also have to enable the outbound "httpClient" rule on your ESXi hosts.
raudi says
The firewall config file is wrong, you wrote:
src
but it must be:
dst