VMware Cloud Director Container Service Extension

  • 1.  Deployment stuck in loop

    Posted Aug 04, 2022 11:15 AM

    Hi,

    The cluster creation never completed and the cluster never ready.
    Then vapp and vm are removed and recreated later in loop.

    Result of a "cat /var/log/cloud-final.err"
    Note: I am not sure if we can login as root with ssh, so screenshot is from the console

    ccalvetbeta_0-1659607662110.png

    ccalvetbeta_1-1659610970959.png
    All pods seems ok.

    ccalvetbeta_2-1659611033242.png
    And from the journal i don't see relevant error

    ccalvetbeta_3-1659611311153.png


    But few errors earlier of type "412 Precondition Failed"

    ccalvetbeta_4-1659611416100.png

    ccalvetbeta_5-1659611492148.png

     

    Seems the latest error in event then is associated to the event of deleting vapp and load balancer

    ccalvetbeta_6-1659611620069.png

     

    ccalvetbeta_7-1659611634049.png

     

    Any suggestion of what should be the next step in troubleshooting?

     



  • 2.  RE: Deployment stuck in loop

    Posted Aug 04, 2022 12:26 PM

    I just noticed from Software requirements:
    VCD 10.3.3.1 (tested). Will work with VCD 10.3.1 or newer
    NSX-T 3.1.1
    Avi 20.1.3

    Does CSE next works with newer version of NSX and AVI as well? (So it is a minimum version and not an exact version requirement)

    (Theoretically, CSE is only supposed to communicate with Cloud Director and let cloud director communicate with the other.)
    In our environment which was supported for cse 3.1.3 we are using:
    NSX-T 3.1.3.4
    AVI 21.1.1

    And the NSX-T version is managed by the Vcloud Foundation so a downgrade is not an option.



  • 3.  RE: Deployment stuck in loop

    Posted Aug 04, 2022 02:11 PM
      |   view attached

    Attached are the logs generated from https://github.com/vmware/cloud-provider-for-cloud-director/blob/main/scripts/generate-k8s-log-bundle.sh 

    I have stopped the cse service to keep access to the ephemeral VM.

    Attachment(s)



  • 4.  RE: Deployment stuck in loop

    Broadcom Employee
    Posted Aug 05, 2022 12:04 AM

    Thanks for the log files. We will look into this file and provide any troubleshooting required



  • 5.  RE: Deployment stuck in loop

    Broadcom Employee
    Posted Aug 05, 2022 12:24 AM

    capi-kubeadm-boostrap-system/logs.txt has : has the following infra issue. we will check more and provide any troubleshooting

    "reconciler kind"="KubeadmConfig" "worker count"=10
    I0804 13:47:51.777262 1 kubeadmconfig_controller.go:236] controller/kubeadmconfig "msg"="Cluster infrastructure is not ready, waiting"
    "kind"="Machine" "name"="beta004-worker-pool-1-8554c96b77-2hdm5" "namespace"="beta004-ns" "reconciler
    group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="1634"
    I0804 13:47:51.811648 1 kubeadmconfig_controller.go:236] controller/kubeadmconfig "msg"="Cluster infrastructure is not ready, waiting" "kind"="Machine" "name"="beta004-worker-pool-1-8554c96b77-2hdm5" "namespace"="beta004-ns" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="1641"
    I0804 13:47:51.817895 1 kubeadmconfig_controller.go:236] controller/kubeadmconfig "msg"="Cluster infrastructure is not ready, waiting" "kind"="Machine" "name"="beta004-worker-pool-1-8554c96b77-2hdm5" "namespace"="beta004-ns" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="1641"
    I0804 13:47:51.838246 1 kubeadmconfig_controller.go:236] controller/kubeadmconfig "msg"="Cluster infrastructure is not ready, waiting" "kind"="Machine" "name"="beta004-worker-pool-1-8554c96b77-2hdm5" "namespace"="beta004-ns" "reconciler group"="bootstrap.cluster.x-k8s.io" "reconciler kind"="KubeadmConfig" "version"="1642"
    ==== END logs for container manager of pod capi-kubeadm-bootstrap-system/capi-kubeadm-bootstrap-controller-manager-56bdcdf797-skq6h ====



  • 6.  RE: Deployment stuck in loop

    Posted Aug 05, 2022 07:31 AM

    Thank you for the reply.
    I can confirm that the "machine" is in pending state.

    I have also noticed pending task at the load balancer level.
    I do not know if they are at the origin of this issue or a symptom of it.

    ccalvetbeta_0-1659684467537.png

    No member in the pool so maybe they are not there because machine is in pending state, or the machine is in pending state because the pool doesn't have the members.

    ccalvetbeta_0-1659684660645.png

     





  • 7.  RE: Deployment stuck in loop

    Posted Aug 05, 2022 01:19 PM

    Update: There was an issue with the vcenter service account used by NSX ALB.
    It has been fixed and the cluster creation reach new steps now.



  • 8.  RE: Deployment stuck in loop

    Broadcom Employee
    Posted Aug 10, 2022 08:34 PM

    If you could elaborate what the issue with the service account was and what resolved it, might be helpful to other users. Thanks.

    Aashima



  • 9.  RE: Deployment stuck in loop

    Posted Sep 07, 2022 11:31 AM

    Hi,
    I do not have full details but from what i understood:
    NSX ALB communicate with vCenter using a "vCenter account" dedicated for this purpose. (This is part of "create NSX-T Cloud) in vcenter.
    So it seems somehow that NSX-ALB was not able to communicate with vCenter anymore. So maybe password has been modified or something like this.
    Note: I am maybe mistaken an issue was with account connecting to NSX-Manager but the concept is the same, issue with credentials used with NSX-T cloud)
    After fixing credentials the deployment was successful.

    Summary:
    The issue was not related to Tanzu/CSE but the underlying NSX-ALB infrastructure. Unfortunately it is not easy to pinpoint the origin when looking at error at Tanzu/CSE level.
    Therefore, the feature requests of adding "pre-requisite" check and/or a wizard showing the progression of a cluster deployment step by step. (Showing the steps completed, current step, and next steps.) In this way it would be easier to pinpoint the origin of such issue if one step is stuck.



  • 10.  RE: Deployment stuck in loop

    Broadcom Employee
    Posted Aug 05, 2022 12:03 AM

    thanks for the screenshots. Screenshots show Ephemeral vm has all pods up and running. 
    But, it looks like the target machine objects are in pending state that is in a loop.