Tanzu

vSphere with Tanzu stuck in waiting for ESXi hosts to be ready

Robin posted Oct 13, 2024 01:53 PM

Hi,

I posted the same question the VMware vSphere thread before discovering this thread, sorry if this is posted twice now:

I am currently facing issues when deploying the workload management on my vSphere Cluster in my homelab. I'm using NSX networking + ALB, everything is setup according to the documentation as far as I can tell but the installation hangs on the host preparation. I can see inside vCenter that the task "starting service" for each of the ESXi hosts is repeated around every 3 minutes. The Spherelet service is running and is showing no errors inside /var/log/spherelet.log.

On each Host inside the wcp setup there is a warning shown: "Kubernetes Worker Node is schedulable A general system error occurred. Error message: waiting for node esxhost1.example.com to move to ready state." and the configuration task is stuck at this point

The configuration for all work Supervisor VMs did complete and was successful.

When connecting to one of the supervisors and running a kubectl get node all of my ESXi hosts are shown and being reported as running.

root@421c3a4c505c2409b59c1114ece9025d [ ~ ]# kubectl get node
NAME                               STATUS   ROLES                  AGE   VERSION
421c3a4c505c2409b59c1114ece9025d   Ready    control-plane,master   46m   v1.29.7+vmware.wcp.1
421c77c0e1b71771775219aeb4ecb244   Ready    control-plane,master   42m   v1.29.7+vmware.wcp.1
421cc0a0ef3d95c171cbed8040c021f9   Ready    control-plane,master   42m   v1.29.7+vmware.wcp.1
esx03.*************                Ready    agent                  35m   v1.29.3-sph-c8e42be
esx04. *************               Ready    agent                  35m   v1.29.3-sph-c8e42be
esx05.*************                Ready    agent                  37m   v1.29.3-sph-c8e42be

inside /var/log/vmware/wcp/wcpsvc.log I can see the following debug message, but not sure if this is relevant:

2024-08-18T16:35:32.206Z info wcp [eamagency/resolve.go:169] [opID=vCLS] Successfully invoked resolve on agency &{0xc004f0e638 0xc0004c3840}
2024-08-18T16:35:32.277Z info wcp [] W0818 18:35:32.277611  676647 reflector.go:539] pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: failed to list clusters: failed to list clusters: the server could not find the requested resource
2024-08-18T16:35:32.277Z info wcp [] W0818 18:35:32.277611  676647 reflector.go:539] pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: failed to list clusters: failed to list clusters: the server could not find the requested resource
2024-08-18T16:35:32.277Z info wcp [] E0818 18:35:32.277684  676647 reflector.go:147] pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: Failed to watch clusters: failed to list clusters: failed to list clusters: the server could not find the requested resource
2024-08-18T16:35:32.277Z info wcp [] E0818 18:35:32.277684  676647 reflector.go:147] pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: Failed to watch clusters: failed to list clusters: failed to list clusters: the server could not find the requested resource
2024-08-18T16:35:32.277Z info wcp [] E0818 18:35:32.277684  676647 reflector.go:147] pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: Failed to watch clusters: failed to list clusters: failed to list clusters: the server could not find the requested resource

Also the same error as in the setup process is shown:

2024-08-18T16:37:03.363Z error wcp [kubelifecycle/node_controller.go:1525] [opID=d06d1726-89c7-4144-891b-06256eff58af-host-70016] Failed to move host esxhost1.example.com to ready state, err:waiting for node esxhost1.example.com to move to ready state
2024-08-18T16:37:03.375Z error wcp [kubelifecycle/node_controller.go:474] [opID=d06d1726-89c7-4144-891b-06256eff58af-host-70016] Failed to realize node {nodeID:host-70016 supervisorID:d06d1726-89c7-4144-891b-06256eff58af} state. Err waiting for node esxhost1.example.com to move to ready state. Will retry.

This one, might also be relevant, this is being shown as an event inside each ESXi host and inside the wcp log file:

2024-08-18T16:42:54.504Z debug wcp [vclib/guestop.go:213] [opID=66c210b0-d06d1726-89c7-4144-891b-06256eff58af-SecretUploader-vm-77043] Failed to delete file from /dev/shm/secret.tmp: ServerFaultCode: File /dev/shm/secret.tmp was not found

What I already checked/tried:

checked for uniform NTP configuration and tested it
tested DNS on Supervisor, ALB, ESXi, vCenter
restarted wcp service on vCenter
restarted all services on ESXi host
reinstalled one of my ESXi hosts

Does anyone have an idea on what I can look for/ try?

Franck VIEIRA posted Nov 28, 2024 10:24 AM

Hi Robin,
I have exactly the same issue as you. as you've post this issue more than a month ago, did you resolv your issue ?

Thanks a lot for your help !

Franck

Robin posted Nov 28, 2024 02:57 PM

@Franck VIEIRA. Nope, no solution yet sadly I just gave up and hope it will be fixed in a future release...

Franck VIEIRA posted Nov 29, 2024 02:55 AM

What's strange is that i was able to perform the deployment on an other cluster (lab environment) in the same vSphere releases but i'm not able to deploy it on the production cluster. So i have to figure this out :p
If i find something, i'll let you know (i'm trying to reach out the Broadcom support and Broadcom teams but it's not easy ...)

Robin posted Nov 29, 2024 04:46 AM

Yeah, took a couple of month for someone to fix our licenses aswell :/. Sadly we do not use tanzu/tkg at work, would really appreciate it if you could share the Solution when the Support got back to you. Thanks Franck!

Franck VIEIRA posted Dec 10, 2024 10:02 AM

Hi Robin,

I have an update regarding this issue !! I've finally managed to successfully deploy the Supervisor Cluster.

On my side, the issue was on the NSX ALB configuration. The Broadcom support told me that only Default-Cloud and Default-Group (the Service Engine Group) are supported on NSX ALB Essential for deployment.

So i've deleted the custom cloud that i've configured on NSX ALB in the begining and edit the Default-Cloud with the correct parameters, and then start the WCP workflow once again and everything went green ! ;)

If you need, i've followed this video witch explain exactly the config to perform on NSX ALB in Essential licence : https://www.youtube.com/watch?v=tZejlqONoN0

I hope those informations will help you ;)

Regards,

Franck

Robin posted Dec 16, 2024 11:25 AM

Hi Franck,

This looks to have solved my problem aswell after trying it together with some other steps.

Thank you very much for your help!

Robin