vCenter

 View Only
Expand all | Collapse all

Cluster Agent VM is missing on cluster XYZ (vCLS)

  • 1.  Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Nov 10, 2020 03:42 PM

    Just upgraded to vSphere 7 Update 1 and see that in the VMs and Templates view I see it created the folder for vCLS.  I only have one cluster and DRS is complaining about the unhealthy state of the vSphere Cluster Service...which makes sense as none have been created.  I created a new cluster out of curiosity and moved some hosts into that cluster but getting the same issue.

     

    In Virtual Center, when I go to Admin -> vCenter Server Extensions and look at the vSphere ESX Agent Manager I see both clusters have alerts and both have the same message..."Cluster agent VM is missing in the cluster" which makes sense, none exist.  Nice that there is a Resolve All Issues button but that doesn't resolve any issues of mine.

     

    I am poking around trying to find logs that help pin point the exact issue but haven't been successful just yet.  Has anyone seen this before or can point me in the right direction of the logs to find the underlying issue why the vCLS VMs are not getting created.

     

    All ESXi hosts have been patched to 7.0.1 and have the same CPU make / model (Intel)

     

    ** Update: Ok found in the EAM log "can't provision VM for ClusterAgent due to lack of suitable datastore".  All of my stores have 100 or more GBs free....but will start down that path **



  • 2.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Nov 10, 2020 04:32 PM

    Hey ,

    Do you have vSAN in this environment?



  • 3.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Nov 10, 2020 04:34 PM

    No, the datastores are iSCSI



  • 4.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Nov 10, 2020 04:42 PM

    So vCLS is deployed in the datastore which best ranked and shared in multiple ESXi. I presume this iSCSI that you are talking is presented on all your ESXi inside the cluster. How many nodes do you have?



  • 5.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Nov 10, 2020 04:46 PM

    Yes, there are 4 nodes in the cluster and they all have access to about 12 different iSCSI data stores all with more than enough space.  I see vCLS needs 2GB I think, they all have well over 100GB free.  There is also a NFS datastore that is accessible to all 4 nodes as well.

    .  



  • 6.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Nov 11, 2020 01:59 PM

    Oh, definitely then you have the resources to deploy the vCLS, so I can think of two scenarios. One that is a bug and the second is that the service is kinda stuck. Have you tried to restart the EAM service?



  • 7.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Nov 11, 2020 02:38 PM

    We upgraded our vcenter from 6.5 to 7.0 update 1 yesterday and are experiencing exactly the same problems.

    At the moment DRS doesnt work on any of our clusters - because all the vCLS vm's failled to deploy.

    The eam log states - Can't provision VM for ClusterAgent(XYZ) due to lack of suitable datastore......

    If anyone knows how to fix this please speak up!



  • 8.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Nov 12, 2020 10:40 AM

    I've just had an update from VMware support - its still with engineering and they do not have a fix for this issue yet.



  • 9.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Nov 12, 2020 01:32 PM

    Ok, glad I am not the only one with this issue.  Hopeful this is addressed sooner rather later.



  • 10.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Nov 12, 2020 01:57 PM

    After some digging around in the EAM log, I found that its deploying this ovf

    -rw-r--r-- 1 root root    36782 Aug  1 17:02 photon-ova-0.0.1-16677410.ovf

    -rw-r--r-- 1 root root     1909 Aug  1 17:02 photon-ova.cert

    -rw-r--r-- 1 root root 75251200 Aug  1 17:02 photon-ova-disk1.vmdk

    -rw-r--r-- 1 root root      148 Aug  1 17:02 photon-ova.mf

    I was wondering if vCLS VM's could be deployed manually - but dont really want to test my theory on a production VC...



  • 11.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Nov 22, 2020 10:47 PM

    Did you get anywhere on this?  I'm having the same issue and seeing the same errors in the EAM log.  I don't have a vSAN configured, so the VMWare docs don't appear to offer any thoughts.



  • 12.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Nov 23, 2020 09:37 AM

    Nope -VMware support said that the CLS VM's cannot be deployed manually - and they also still dont have a fix or an ETA on a fix.

    The support engineer mistakenly thought that the problem was only affecting vsan clusters - so we set them straight on that.

    The last we heard from support was that they were "opening a PR internally to flag the issue further with engineering as there are many similar known issues which are unresolved" and that it could be a while before a fix was available.

    So we reverted back to 6.5 last week , which involved a fair amount of turning HA on and off again , disconnecting and reconnecting a bunch of hosts and a couple of reboots of the 6.5 VC before everything would behave properly....

    We are upgrading to 6.7 this week to avoid being affected by the loss of flash at the end of december.

     



  • 13.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Nov 23, 2020 01:14 PM

    Well that's a bummer.  I'm running an educational cluster with ~100 resource groups, ~1000 machines, way too may people touching vCenter, and we're in the middle of final project season.  I think I'm going to cross my fingers that everything remains stable as we limp through the last couple of weeks in the semester.  It's super exciting that VMWare put us in this situation though. 



  • 14.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Dec 01, 2020 12:23 PM

    the same problem, something new from vmware?



  • 15.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)
    Best Answer

    Posted Dec 02, 2020 08:16 AM

    I had this response from support yesterday:

    I've been working with engineering and they mentioned a similar case to this which suggested the issue had to do with certificates.

    Can you try following the steps in the below KB article to fix the SSL trusts and rebuild service registrations on the affected nodes using lsdoctor?

    https://kb.vmware.com/s/article/80469

    1. Run lsdoctor with the "-t, --trustfix" option to fix any trust issues.

    2. Run lsdoctor with the "-r, --rebuild" option to rebuild service registrations

    However we already rolled back vcenter to 6.5 and then re-upgraded it to 6.7 so cannot test whether this works at the moment. If anyone decides to try it please reply and let us know whether it worked for you or not?



  • 16.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Dec 02, 2020 01:04 PM

    before fix:

    root@vcenter [ ~/lsdoctor/lsdoctor-master ]# python ./lsdoctor.py -l

    ATTENTION: You are running a reporting function. This doesn't make any changes to your environment.
    You can find the report and logs here: /var/log/vmware/lsdoctor

    2020-12-02T11:11:50 INFO main: You are reporting on problems found across the SSO domain in the lookup service. This doesn't make changes.
    2020-12-02T11:11:51 INFO live_checkCerts: Checking services for trust mismatches...
    2020-12-02T11:11:51 INFO generateReport: Listing lookup service problems found in SSO domain
    2020-12-02T11:11:51 ERROR generateReport: default-first-site\vcenter.masked.domain (VC 7.0 or CGW) found Port 7444 Found: Please run python ls_doctor.py --stalefix option on this node.
    2020-12-02T11:11:51 ERROR generateReport: default-first-site\vcenter.masked.domain (VC 7.0 or CGW) found SSL Trust Mismatch: Please run python ls_doctor.py --trustfix option on this node.
    2020-12-02T11:11:51 INFO generateReport: No issues detected in the lookup service entries for ##NO_HOSTNAME##.
    2020-12-02T11:11:51 INFO generateReport: Report generated: /var/log/vmware/lsdoctor/vcenter.masked.domain-2020-12-02-111150.json

    First fix - stalefix:

    root@vcenter [ ~/lsdoctor/lsdoctor-master ]# python ./lsdoctor.py --stalefix

    WARNING: This script makes permanent changes. Before running, please take *OFFLINE* snapshots
    of all VC's and PSC's at the SAME TIME. Failure to do so can result in PSC or VC inconsistencies.
    Logs can be found here: /var/log/vmware/lsdoctor

    2020-12-02T11:12:36 INFO main: You are running a check on this node for stale 5.x data. NOTE: Please run this script on all VC's or PSC's in the SSO domain to be thorough.

    Have you taken offline (PSCs and VCs powered down at the same time) snapshots of all nodes in the SSO domain or supported backups?[y/n]y


    Provide password for administrator@vsphere.local:
    2020-12-02T11:12:57 INFO __init__: Retrieved services for machine with hostname: vcenter.masked.domain
    2020-12-02T11:12:57 INFO checkStale: Checking for logbrowser or 5.x vsphere client services...
    2020-12-02T11:12:57 WARNING checkStale: PROBLEM FOUND: logbrowser service found. Attempting to unregister...
    2020-12-02T11:12:57 INFO checkStale: Success!
    2020-12-02T11:12:57 WARNING checkStale: PROBLEM FOUND: stale 5.x webclient service found. Attempting to unregister...
    2020-12-02T11:12:57 INFO checkStale: Success!
    2020-12-02T11:12:57 INFO checkStale: PASSED: 5.x vcenter service not found.
    2020-12-02T11:12:57 INFO backup_machine: Exporting MACHINE_SSL_CERT cert and key
    2020-12-02T11:12:57 INFO checkLegacy: Checking for STS_INTERNAL_SSL_CERT...
    2020-12-02T11:12:57 INFO backup_machine: Exporting MACHINE_SSL_CERT cert and key
    2020-12-02T11:12:57 INFO check_sts_internal: Checking for STS_INTERNAL_SSL_CERT...
    2020-12-02T11:12:57 INFO backup_sts_internal: Backing up STS_INTERNAL_SSL_CERT
    2020-12-02T11:12:57 INFO checkLegacy: PROBLEM FOUND: STS_INTERNAL_SSL_CERT found!
    2020-12-02T11:12:57 INFO replace_sts_internal: Replacing STS_INTERNAL_SSL_CERT with MACHINE_SSL_CERT
    2020-12-02T11:12:57 INFO replace_sts_internal: Successfully replaced STS_INTERNAL_SSL_CERT
    2020-12-02T11:12:57 INFO checkLegacy: Checking for 7444 in legacy services...
    2020-12-02T11:12:57 WARNING checkLegacy: PROBLEM FOUND: Found port 7444 in service registration URL! https://vcenter.masked.domain:7444/sso-adminserver/sdk/vsphere.local
    2020-12-02T11:12:57 WARNING checkLegacy: PROBLEM FOUND: Found port 7444 in service registration URL! https://vcenter.masked.domain:7444/sts/STSService/vsphere.local
    2020-12-02T11:12:57 WARNING checkLegacy: PROBLEM FOUND: Found port 7444 in service registration URL! https://vcenter.masked.domain:7444/sso-adminserver/sdk/vsphere.local
    2020-12-02T11:12:57 INFO checkLegacy: Recreating legacy SSO service registrations...
    2020-12-02T11:12:59 INFO checkLegacy: Successfully recreated legacy SSO endpoints.
    2020-12-02T11:12:59 INFO main: Please restart services on all PSC's and VC's when you're done.

    After stalefix:

    root@vcenter [ ~/lsdoctor/lsdoctor-master ]# python ./lsdoctor.py -l

    ATTENTION: You are running a reporting function. This doesn't make any changes to your environment.
    You can find the report and logs here: /var/log/vmware/lsdoctor

    2020-12-02T11:22:17 INFO main: You are reporting on problems found across the SSO domain in the lookup service. This doesn't make changes.
    2020-12-02T11:22:17 INFO live_checkCerts: Checking services for trust mismatches...
    2020-12-02T11:22:17 INFO generateReport: Listing lookup service problems found in SSO domain
    2020-12-02T11:22:17 ERROR generateReport: default-first-site\vcenter.masked.domain (VC 7.0 or CGW) found SSL Trust Mismatch: Please run python ls_doctor.py --trustfix option on this node.
    2020-12-02T11:22:17 INFO generateReport: No issues detected in the lookup service entries for ##NO_HOSTNAME##.
    2020-12-02T11:22:17 INFO generateReport: Report generated: /var/log/vmware/lsdoctor/vcenter.masked.domain-2020-12-02-112217.json

    Second fix - trustfix:

    root@vcenter [ ~/lsdoctor/lsdoctor-master ]# python ./lsdoctor.py --trustfix

    WARNING: This script makes permanent changes. Before running, please take *OFFLINE* snapshots
    of all VC's and PSC's at the SAME TIME. Failure to do so can result in PSC or VC inconsistencies.
    Logs can be found here: /var/log/vmware/lsdoctor

    2020-12-02T11:22:33 INFO main: You are checking for and fixing SSL trust mismatches in the local SSO site. NOTE: Please run this script one PSC or VC per SSO site.

    Have you taken offline (PSCs and VCs powered down at the same time) snapshots of all nodes in the SSO domain or supported backups?[y/n]y


    Provide password for administrator@vsphere.local:
    2020-12-02T11:22:42 INFO __init__: Retrieved services from SSO site: Default-First-Site
    2020-12-02T11:22:42 INFO findAndFix: Checking services for trust mismatches...
    2020-12-02T11:22:42 INFO findAndFix: Attempting to reregister d51c3647-4896-4823-acb2-1d1cb3acb48 for vcenter.masked.domain
    2020-12-02T11:22:43 INFO findAndFix: We found 1 mismatch(s) and fixed them
    2020-12-02T11:22:43 INFO main: Please restart services on all PSC's and VC's when you're done.

    After trustfix:

    root@vcenter [ ~/lsdoctor/lsdoctor-master ]# python ./lsdoctor.py -l

    ATTENTION: You are running a reporting function. This doesn't make any changes to your environment.
    You can find the report and logs here: /var/log/vmware/lsdoctor

    2020-12-02T11:39:16 INFO main: You are reporting on problems found across the SSO domain in the lookup service. This doesn't make changes.
    2020-12-02T11:39:17 INFO live_checkCerts: Checking services for trust mismatches...
    2020-12-02T11:39:17 INFO generateReport: Listing lookup service problems found in SSO domain
    2020-12-02T11:39:17 INFO generateReport: No issues detected in the lookup service entries for vcenter.masked.domain (VC 7.0 or CGW).
    2020-12-02T11:39:17 INFO generateReport: No issues detected in the lookup service entries for ##NO_HOSTNAME##.
    2020-12-02T11:39:17 INFO generateReport: Report generated: /var/log/vmware/lsdoctor/vcenter.masked.domain-2020-12-02-113916.json

    Problem with vCLS is not fixed.
    Third fix - rebuild:

    root@vcenter [ ~/lsdoctor/lsdoctor-master ]# python ./lsdoctor.py -r

    WARNING: This script makes permanent changes. Before running, please take *OFFLINE* snapshots
    of all VC's and PSC's at the SAME TIME. Failure to do so can result in PSC or VC inconsistencies.
    Logs can be found here: /var/log/vmware/lsdoctor

    2020-12-02T13:36:22 INFO main:
    You have selected the Rebuild function. This is a potentially destructive operation!
    All external solutions and 3rd party plugins that register with the lookup service will
    have to be re-registered. For example: SRM, vSphere Replication, NSX Manager, etc.

    Have you taken offline (PSCs and VCs powered down at the same time) snapshots of all nodes in the SSO domain or supported backups?[y/n]y


    Provide password for administrator@vsphere.local:
    2020-12-02T13:36:27 INFO __init__: Established LS connection to vcenter.masked.domain

    Version Detected
    Deployment type: embedded
    Version: 17004997_7.0.1.00100_vcsa
    ========================

    0. Exit
    1. Generate a template.
    2. Replace all services with new services.
    3. Replace individual service.
    4. Restore services from backup file.

    ========================

    Please select an action: 2

    No template found for 17004997_7.0.1.00100_vcsa. Proceeding to file select.

    2020-12-02T13:36:33 INFO fileSelect: Getting files from /root/lsdoctor/lsdoctor-master/templates
    Please select a file:

    [0] 13010631_6.7.0.30000_vcsa.json
    ..
    [80] 16749653_7.0.0.10700_vcsa.json
    ..
    [96] 15808842_6.5.0.32300_vcsa.json
    Select number:

    You can see - in current version lsdoctor is missing template for build 17004997_7.0.1.00100_vcsa

     

     



  • 17.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Dec 02, 2020 01:11 PM

    Thanks for testing - I've given the link to this thread to support so that they can see your output.



  • 18.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Dec 02, 2020 01:19 PM

    from eam.log:

     

    2020-12-02T13:08:09.934Z | INFO | sso-0 | AcquireTokenProvider.java | 53 | [CreateSAMLToken:124579314592473] Acquiring HoK token.
    2020-12-02T13:08:09.980Z | INFO | sts-0 | Workflow.java | 121 | [CreateSAMLToken:124579314592473] FAILED
    com.vmware.eam.sso.exception.TokenNotAcquired: Couldn't acquire token due to: Signature validation failed
    ..
    2020-12-02T13:08:09.983Z | WARN | sts-0 | TagsChecker.java | 157 | [FilterNotAllowedDatastores:56795461689246] Unexpected error filtering datastores by tag category names.
    ..
    2020-12-02T13:08:09.991Z | ERROR | cluster-agent-3 | AuditedJob.java | 106 | JOB FAILED: [#45471321] DeployVmJob(ClusterAgent(ID: 'Agent:54654654-6554-5454-4747-4587215768741:null'))
    com.vmware.eam.job.DeployVmJob$DeployVmJobFailure: Can't provision VM for ClusterAgent(ID: 'Agent:54654654-6554-5454-4747-4587215768741:null') due to lack of suitable datastore.

     

    Does it look like a problem with STS? but check STS is OK - https://kb.vmware.com/s/article/79248



  • 19.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Dec 02, 2020 01:23 PM

    Same result here.  I had to run stalefix, but everything is clear after that.  Still getting the "lack of suitable datastore" error.  Here are my lsdoctor -l results:

    root@vcsa1 [ ~/lsdoctor-master ]# python lsdoctor.py -l

    ATTENTION: You are running a reporting function. This doesn't make any changes to your environment.
    You can find the report and logs here: /var/log/vmware/lsdoctor

    2020-12-02T13:17:03 INFO main: You are reporting on problems found across the SSO domain in the lookup service. This doesn't make changes.
    2020-12-02T13:17:03 INFO live_checkCerts: Checking services for trust mismatches...
    2020-12-02T13:17:03 INFO generateReport: Listing lookup service problems found in SSO domain
    2020-12-02T13:17:03 INFO generateReport: No issues detected in the lookup service entries for vcsa1.eecs.net (VC 7.0 or CGW).
    2020-12-02T13:17:03 INFO generateReport: No issues detected in the lookup service entries for ##NO_HOSTNAME##.
    2020-12-02T13:17:03 INFO generateReport: Report generated: /var/log/vmware/lsdoctor/vcsa1.eecs.net-2020-12-02-131703.json



  • 20.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Dec 02, 2020 02:33 PM

    My issue is finally resolved.  Turned out that checksts.py was telling me that there weren't any issues, but there were four certificates (1 leaf and 3 roots).  I've read in a few places that there is only supposed to be one.  I ran fixsts.sh, which dropped me to 3 certs in checksts (1 root and 2 leaf certs).  After that, the vCLS machines showed up almost immediately.



  • 21.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Dec 02, 2020 09:42 PM

    Exact same issue, I had .. just fixed by running fixsts.sh : )) 



  • 22.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Dec 03, 2020 10:08 AM

    My problem was a little "bigger".
    Before stsfix, we created a test cluster and moved ESXi - vcenter didn't create a vCLS VM here either.
    After stsfix in the existing cluster, vcenter tried to create vCLS and immediately deleted it (every minute!), in the cluster test vcenter created it without any problems. So stsfix didn't help us.

    However, we tried the procedure described in https://kb.vmware.com/s/article/80472 and after disabling/enabling vCLS creations, the existing cluster recovered and created vCLS correctly and DRS is fully functional.

     

    Thanks to All for your help!



  • 23.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Sep 20, 2021 07:57 AM

    fixsts.sh fixed my problem too ... thank you !



  • 24.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Jan 13, 2021 05:47 PM

    Anyone hear any updates yet?  Is there a 17004997_7.0.1.00100_vcsa floating about, or an updated lsdoctor?



  • 25.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Jan 27, 2021 09:18 PM

    So I swore at this issue for a day or so too, after migrating from VCSA 6.7 to 7.01 (build 17327586) yesterday.  In my case, my 6.7 VCSA (whatever the most recent version of 6.7 was on January 24, 2021) had been migrated multiple times over the years from version to version to version as required from I want to say VCSA 5 (but maybe it was 5.5).  It's also had an AD Certificate Authority issued certificate on it for many years (the cert says it's valid from May 2015 to July 2024, so it's been around for a while).  Eventually (after I opened a ticket with VMware Support 3+ hrs ago to which I haven't gotten a response to yet) I stumble on to this thread, which lead me to this route of resolution.

    These are the steps I took, in the order I took them.

    1. Enabled then disabled Retreat Mode as per https://kb.vmware.com/s/article/80472
    2. Ran lsdoctor.py from https://kb.vmware.com/s/article/80469 and had to use the fixtrust and fixstale options to fix two issues identified.
    3. Ran checksts.py from https://kb.vmware.com/s/article/79248 - this identified multiple root certs with different thumbprints and expiry dates
    4. Ran fixsts.sh from https://kb.vmware.com/s/article/76719
    5. Ran checksts.py from https://kb.vmware.com/s/article/79248 again, and showed I now only have a single root cert.
    6. Ran "service-control --stop --all" to stop all the services after fixsts.sh finished (as is detailed in the KB article).
    7. Ran "service-control --start --all" to restart all services after fixsts.sh finished (as is detailed in the KB article).

    By the time I made a pitstop for coffee, got the Chrome cache cleared, and managed to get logged back into VC, all the vCLS were finally deployed.

    Incidently, the EAM.log indicated this prior to the fixsts.sh:

    FAILED:  com.vmware.eam.sso.exception.TokenNotAcquired: Couldn't acquire token due to: Signature validation failed
    Caused by: com.vmware.vapi.std.errors.ServiceUnavailable: ServiceUnavailable (com.vmware.vapi.std.errors.service_unavailable)
    Can't provision VM for ClusterAgent(ID: 'Agent:48c988c8-570a-43d6-a12a-XXXXXXXXXX:null') due to lack of suitable datastore.

    dcc



  • 26.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Mar 11, 2021 05:15 AM

    dcolpitts - that process worked for me. I had previously gone through the lsdoctor script but didn't resolve the issue. I t was the fixsts that was needed as I had 3 root certs within the system. This vcenter has also been upgraded from previous versions. Thanks for sharing the solution



  • 27.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted Apr 19, 2021 08:10 PM

    Ive just spend hours trying to fix this issue, running all the scripts / commands from VMware.

    Your guide worked for me! Thanks so much for the help!  



  • 28.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted May 14, 2021 01:12 PM

    dcolpitts, I opened a support ticket with VMware and the technician and I ended up using steps 2-7 of your solution.  Thanks.  He also sends his Kudos to you.



  • 29.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted May 17, 2021 02:47 PM

    I raised a ticket for the same problem, pointed out this post to the support engineer, and still ended up waiting hours for them to drip feed me the steps themselves. 

    Thank you dcolpitts



  • 30.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)

    Posted May 17, 2021 04:09 PM

    Slight update on my original instructions.  Getting the scripts onto the vCenter is a pain, so I now just use curl to pull them down  The overall steps are still the same...

    1. Enabled then disabled Retreat Mode as per https://kb.vmware.com/s/article/80472
    2. Ran lsdoctor.py from https://kb.vmware.com/s/article/80469 and had to use the fixtrust and fixstale options to fix two issues identified.
    3. Ran checksts.py from https://kb.vmware.com/s/article/79248 - this identified multiple root certs with different thumbprints and expiry dates
    4. Ran fixsts.sh from https://kb.vmware.com/s/article/76719
    5. Ran checksts.py from https://kb.vmware.com/s/article/79248 again, and showed I now only have a single root cert.
    6. Ran "service-control --stop --all" to stop all the services after fixsts.sh finished (as is detailed in the KB article).
    7. Ran "service-control --start --all" to restart all services after fixsts.sh finished (as is detailed in the KB article).

    SSH the vCenter appliance with Putty and login as root and then cut and paste these commands down to the first "--stop--".  Then apply each command / fix as required for your environment.  Note that the curl links were valid at the time I created this post (2021.05.17).

     

    --start cut & paste below here--

     

    curl https://kb.vmware.com/sfc/servlet.shepherd/version/download/0685G00000NxYfZQAV -o /root/configure_retreat_mode.py

    curl https://kb.vmware.com/sfc/servlet.shepherd/version/download/0685G00000S5Q77QAF -o /root/lsdoctor.zip

    curl https://kb.vmware.com/sfc/servlet.shepherd/version/download/068f400000HW9InAAL -o /root/checksts.py

    curl https://kb.vmware.com/sfc/servlet.shepherd/version/download/068f400000JAn50AAD -o /root/fixsts.sh

    chmod +x /root/fixsts.sh

    unzip /root/lsdoctor.zip

    cd /root/lsdoctor-master

    python /root/lsdoctor-master/lsdoctor.py -l

     

    --stop--

     

    python /root/lsdoctor-master/lsdoctor.py --stalefix

     

    --stop--

     

    python /root/lsdoctor-master/lsdoctor.py --trustfix

     

    --stop--

     

    python /root/checksts.py

     

    --stop--

     

    cd /root

    /root/fixsts.sh

     

    --stop--

     

    service-control --stop --all

     

    --stop--

     

    service-control --start --all

     



  • 31.  RE: Cluster Agent VM is missing on cluster XYZ (vCLS)