VMware Tanzu Kubernetes Grid Integrated Edition

 pks cluster failed

yann s posted Oct 02, 2019 06:44 PM


yann s

how to fix the issue

yann s

pks cluster k8s-02


PKS Version:    1.5.0-build.32

Name:      k8s-02

K8s Version:    1.14.5

Plan Name:    pl3

UUID:      da8bae2b-c733-4f11-b802-8cdffce7e52b

Last Action:    CREATE

Last Action State:  failed

Last Action Description: Instance provisioning failed: There was a problem completing your request. Please contact your operations team providing the following information: service: p.pks, service-instance-guid: da8bae2b-c733-4f11-b802-8cdffce7e52b, broker-request-id: 663a97b1-5172-4a42-b4ab-94161d97d311, task-id: 4330, operation: create, error-message: 0 succeeded, 1 errored, 0 canceled

Kubernetes Master Host: k8s-cluster-02.res01.unicc.org

Kubernetes Master Port: 8443

Worker Nodes:    3

Kubernetes Master IP(s): In Progress

Network Profile Name:



Deployment 'service-instance_da8bae2b-c733-4f11-b802-8cdffce7e52b'


Instance          Process State AZ IPs   VM CID         VM Type  Active

master/59550f17-f971-4b88-a0ad-b554dbd7ace1 running  AZ-2 vm-74097b55-735f-444c-8103-ce0f237a762c medium.disk true

master/64f6da4c-32bb-46e2-a7d8-1a48c897daa1 running  AZ-1 vm-891f088e-10a9-42c2-8507-c11699dd49ae medium.disk true

master/ebd2c7e8-25d0-4209-9893-61311f4fa43d running  AZ-3 vm-47e82356-7b82-4408-89ca-a0d530879cad medium.disk true

worker/33ff86c0-eb96-4d68-8c58-b136efe17713 running  AZ-3 vm-659e0100-ce31-402e-b1e9-c5b750c8b246 medium.disk true

worker/4ba5644e-d2f1-4d87-9b28-cd327ea72869 running  AZ-2 vm-f766a7c1-6f96-4562-9e6e-4fd9fbdbc4ba medium.disk true

worker/9f01bb0b-ec2f-4cae-b906-8692f29f71a0 running  AZ-1 vm-56095c01-2296-4de9-aa1d-cf8a2cf78d4f medium.disk true


yann s

bosh events -d 'service-instance_da8bae2b-c733-4f11-b802-8cdffce7e52b'


Using environment 'boshvm.res01.unicc.org' as client 'ops_manager'




ID    Time       User           Action  Object Type Object Name                      Task ID Deployment            Instance            Context                             Error


30022   Wed Oct 2 15:43:03 UTC 2019 health_monitor         release  lock   lock:deployment:service-instance_da8bae2b-c733-4f11-b802-8cdffce7e52b       4359  service-instance_da8bae2b-c733-4f11-b802-8cdffce7e52b -              -                              -


30021   Wed Oct 2 15:42:29 UTC 2019 health_monitor         create  alert  1570030949.20359261@localhost                 -  service-instance_da8bae2b-c733-4f11-b802-8cdffce7e52b worker/33ff86c0-eb96-4d68-8c58-b136efe17713   message: 'nsx-node-agent ( - Does not exist - restart. Alert @ 2019-10-02         -


                                                                                     15:42:29 UTC, severity 1: process is not running'


30020   Wed Oct 2 15:42:25 UTC 2019 health_monitor         create  alert  1570030945.384568188@localhost                 -  service-instance_da8bae2b-c733-4f11-b802-8cdffce7e52b worker/33ff86c0-eb96-4d68-8c58-b136efe17713   message: 'nsx-node-agent ( - Execution failed - alert. Alert @ 2019-10-02         -


                                                                                     15:42:25 UTC, severity 1: failed to start'


30019   Wed Oct 2 15:41:50 UTC 2019 health_monitor         create  alert  1570030910.376023438@localhost                 -  service-instance_da8bae2b-c733-4f11-b802-8cdffce7e52b worker/33ff86c0-eb96-4d68-8c58-b136efe17713   message: 'docker ( - Does not exist - restart. Alert @ 2019-10-02 15:41:50         -


                                                                                     UTC, severity 1: process is not running'


30018   Wed Oct 2 15:38:40 UTC 2019 health_monitor         create  alert  d084ce4f-9439-4a6d-88e2-fb7aa60ba735               -  service-instance_da8bae2b-c733-4f11-b802-8cdffce7e52b -              message: 'Recreating unresponsive VMs. Alert @ 2019-10-02 15:38:40 UTC, severity 4:          -


                                                                                     Notifying Director to recreate instances: worker/33ff86c0-eb96-4d68-8c58-b136efe17713;


Broadcom Employee Shubham Sharma

Hello @yann s​ , can you share the task debug logs for task-id: 4330.

bosh task 4330 --debug &> 4330.debug

bosh task 4330 --cpi &> 4330.cpi

yann s

Hi ,

Attached output reauested

yann s

cpi output

yann s

debug output

Broadcom Employee Shubham Sharma

It looks like the add-on that rolls out kube-dns and coredns deployments failed. This article might help dig a little bit deeper - https://community.pivotal.io/s/article/How-to-troubleshoot-kube-dns-rollout-failure-when-apply-addons-errand-fails-to-start-all-system-specs-after-1200s .

From the events you shared earlier looks like nsx-node-agent failed to start so checking the hyperbus status will definitely help. The article covers those details as well.

yann s

Thank you Sharma .


The post help to find the

 /var/vcap/packages/kubernetes/bin/kubectl get all --all-namespaces


kube-system  pod/coredns-95489c5c9-2k9lx  0/1   ContainerCreating  0     31m

kube-system  pod/coredns-95489c5c9-2s5pd  1/1   Running       0     31m

kube-system  pod/coredns-95489c5c9-jfk4m  1/1   Running       0     31m



default    service/kubernetes  ClusterIP  <none>    443/TCP     60m

kube-system  service/kube-dns   ClusterIP  <none>    53/UDP,53/TCP  31m



kube-system  deployment.apps/coredns  2/3   3      2      31m



kube-system  replicaset.apps/coredns-95489c5c9  3     3     2    31m

/var/vcap/packages/kubernetes/bin/kubectl describe pod coredns-95489c5c9-2k9lx -n kube-system

Name:        coredns-95489c5c9-2k9lx

Namespace:     kube-system

Priority:      2000000000

PriorityClassName: system-cluster-critical

Node:        4af86293-823e-44b9-b8a3-f1ad640d655b/

Start Time:     Thu, 03 Oct 2019 09:38:35 +0000

Labels:       k8s-app=kube-dns


Annotations:    seccomp.security.alpha.kubernetes.io/pod: docker/default

Status:       Pending


Controlled By:   ReplicaSet/coredns-95489c5c9



  Container ID:

  Image:     vmware/coredns:1.3.1

  Image ID:

  Ports:     53/UDP, 53/TCP, 9153/TCP

  Host Ports:  0/UDP, 0/TCP, 0/TCP




  State:     Waiting

   Reason:    ContainerCreating

  Ready:     False

  Restart Count: 0


   memory: 170Mi


   cpu:    100m

   memory:   70Mi

  Liveness:   http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5

  Environment: <none>


   /etc/coredns from config-volume (ro)

   /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-mx9mx (ro)


 Type       Status

 Initialized    True

 Ready       False

 ContainersReady  False

 PodScheduled   True



  Type:   ConfigMap (a volume populated by a ConfigMap)

  Name:   coredns

  Optional: false


  Type:    Secret (a volume populated by a Secret)

  SecretName: coredns-token-mx9mx

  Optional:  false

QoS Class:    Burstable

Node-Selectors: <none>

Tolerations:   CriticalAddonsOnly

         node.kubernetes.io/not-ready:NoExecute for 300s

         node.kubernetes.io/unreachable:NoExecute for 300s


 Type   Reason         Age        From                      Message

 ----   ------         ----        ----                      -------

 Normal  Scheduled        32m        default-scheduler               Successfully assigned kube-system/coredns-95489c5c9-2k9lx to 4af86293-823e-44b9-b8a3-f1ad640d655b

 Warning FailedCreatePodSandBox 16s (x8 over 28m) kubelet, 4af86293-823e-44b9-b8a3-f1ad640d655b Failed create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded

 Normal  SandboxChanged     16s (x8 over 28m) kubelet, 4af86293-823e-44b9-b8a3-f1ad640d655b Pod sandbox changed, it will be killed and re-created.



Broadcom Employee Shubham Sharma

@yann s​ could you check the status of the node on which the pod is running on? kubelet logs from that node will be helpful to troubleshoot this further,

yann s

I 've found a workaround .

Create a cluster with one worker then resize the cluster .

Do not know if any one ran into the same issue , ops 2.7 + pks 1.5+nsx-t 2.5 , creating a cluster with 3 worker always failed , while it's working with 1 worker.