VMware Tanzu Kubernetes Grid Integrated Edition

 View Only

 pks cluster failed

yann s's profile image
yann s posted Oct 02, 2019 06:44 PM

 

yann s's profile image
yann s

how to fix the issue

yann s's profile image
yann s

pks cluster k8s-02

 

PKS Version:    1.5.0-build.32

Name:      k8s-02

K8s Version:    1.14.5

Plan Name:    pl3

UUID:      da8bae2b-c733-4f11-b802-8cdffce7e52b

Last Action:    CREATE

Last Action State:  failed

Last Action Description: Instance provisioning failed: There was a problem completing your request. Please contact your operations team providing the following information: service: p.pks, service-instance-guid: da8bae2b-c733-4f11-b802-8cdffce7e52b, broker-request-id: 663a97b1-5172-4a42-b4ab-94161d97d311, task-id: 4330, operation: create, error-message: 0 succeeded, 1 errored, 0 canceled

Kubernetes Master Host: k8s-cluster-02.res01.unicc.org

Kubernetes Master Port: 8443

Worker Nodes:    3

Kubernetes Master IP(s): In Progress

Network Profile Name:

 

 

Deployment 'service-instance_da8bae2b-c733-4f11-b802-8cdffce7e52b'

 

Instance          Process State AZ IPs   VM CID         VM Type  Active

master/59550f17-f971-4b88-a0ad-b554dbd7ace1 running  AZ-2 172.11.1.3 vm-74097b55-735f-444c-8103-ce0f237a762c medium.disk true

master/64f6da4c-32bb-46e2-a7d8-1a48c897daa1 running  AZ-1 172.11.1.2 vm-891f088e-10a9-42c2-8507-c11699dd49ae medium.disk true

master/ebd2c7e8-25d0-4209-9893-61311f4fa43d running  AZ-3 172.11.1.4 vm-47e82356-7b82-4408-89ca-a0d530879cad medium.disk true

worker/33ff86c0-eb96-4d68-8c58-b136efe17713 running  AZ-3 172.11.1.7 vm-659e0100-ce31-402e-b1e9-c5b750c8b246 medium.disk true

worker/4ba5644e-d2f1-4d87-9b28-cd327ea72869 running  AZ-2 172.11.1.6 vm-f766a7c1-6f96-4562-9e6e-4fd9fbdbc4ba medium.disk true

worker/9f01bb0b-ec2f-4cae-b906-8692f29f71a0 running  AZ-1 172.11.1.5 vm-56095c01-2296-4de9-aa1d-cf8a2cf78d4f medium.disk true

 

yann s's profile image
yann s

bosh events -d 'service-instance_da8bae2b-c733-4f11-b802-8cdffce7e52b'

 

Using environment 'boshvm.res01.unicc.org' as client 'ops_manager'

 

 

 

ID    Time       User           Action  Object Type Object Name                      Task ID Deployment            Instance            Context                             Error

 

30022   Wed Oct 2 15:43:03 UTC 2019 health_monitor         release  lock   lock:deployment:service-instance_da8bae2b-c733-4f11-b802-8cdffce7e52b       4359  service-instance_da8bae2b-c733-4f11-b802-8cdffce7e52b -              -                              -

 

30021   Wed Oct 2 15:42:29 UTC 2019 health_monitor         create  alert  1570030949.20359261@localhost                 -  service-instance_da8bae2b-c733-4f11-b802-8cdffce7e52b worker/33ff86c0-eb96-4d68-8c58-b136efe17713   message: 'nsx-node-agent (172.11.1.7) - Does not exist - restart. Alert @ 2019-10-02         -

 

                                                                                     15:42:29 UTC, severity 1: process is not running'

 

30020   Wed Oct 2 15:42:25 UTC 2019 health_monitor         create  alert  1570030945.384568188@localhost                 -  service-instance_da8bae2b-c733-4f11-b802-8cdffce7e52b worker/33ff86c0-eb96-4d68-8c58-b136efe17713   message: 'nsx-node-agent (172.11.1.7) - Execution failed - alert. Alert @ 2019-10-02         -

 

                                                                                     15:42:25 UTC, severity 1: failed to start'

 

30019   Wed Oct 2 15:41:50 UTC 2019 health_monitor         create  alert  1570030910.376023438@localhost                 -  service-instance_da8bae2b-c733-4f11-b802-8cdffce7e52b worker/33ff86c0-eb96-4d68-8c58-b136efe17713   message: 'docker (172.11.1.7) - Does not exist - restart. Alert @ 2019-10-02 15:41:50         -

 

                                                                                     UTC, severity 1: process is not running'

 

30018   Wed Oct 2 15:38:40 UTC 2019 health_monitor         create  alert  d084ce4f-9439-4a6d-88e2-fb7aa60ba735               -  service-instance_da8bae2b-c733-4f11-b802-8cdffce7e52b -              message: 'Recreating unresponsive VMs. Alert @ 2019-10-02 15:38:40 UTC, severity 4:          -

 

                                                                                     Notifying Director to recreate instances: worker/33ff86c0-eb96-4d68-8c58-b136efe17713;

 

Shubham Sharma's profile image
Broadcom Employee Shubham Sharma

Hello @yann s​ , can you share the task debug logs for task-id: 4330.

bosh task 4330 --debug &> 4330.debug

bosh task 4330 --cpi &> 4330.cpi

yann s's profile image
yann s

Hi ,

Attached output reauested

yann s's profile image
yann s

cpi output

yann s's profile image
yann s

debug output

Shubham Sharma's profile image
Broadcom Employee Shubham Sharma

It looks like the add-on that rolls out kube-dns and coredns deployments failed. This article might help dig a little bit deeper - https://community.pivotal.io/s/article/How-to-troubleshoot-kube-dns-rollout-failure-when-apply-addons-errand-fails-to-start-all-system-specs-after-1200s .

From the events you shared earlier looks like nsx-node-agent failed to start so checking the hyperbus status will definitely help. The article covers those details as well.

yann s's profile image
yann s

Thank you Sharma .

 

The post help to find the

 /var/vcap/packages/kubernetes/bin/kubectl get all --all-namespaces

NAMESPACE   NAME             READY  STATUS       RESTARTS  AGE

kube-system  pod/coredns-95489c5c9-2k9lx  0/1   ContainerCreating  0     31m

kube-system  pod/coredns-95489c5c9-2s5pd  1/1   Running       0     31m

kube-system  pod/coredns-95489c5c9-jfk4m  1/1   Running       0     31m

 

NAMESPACE   NAME         TYPE    CLUSTER-IP   EXTERNAL-IP  PORT(S)     AGE

default    service/kubernetes  ClusterIP  172.30.161.1  <none>    443/TCP     60m

kube-system  service/kube-dns   ClusterIP  172.30.161.2  <none>    53/UDP,53/TCP  31m

 

NAMESPACE   NAME           READY  UP-TO-DATE  AVAILABLE  AGE

kube-system  deployment.apps/coredns  2/3   3      2      31m

 

NAMESPACE   NAME                DESIRED  CURRENT  READY  AGE

kube-system  replicaset.apps/coredns-95489c5c9  3     3     2    31m

/var/vcap/packages/kubernetes/bin/kubectl describe pod coredns-95489c5c9-2k9lx -n kube-system

Name:        coredns-95489c5c9-2k9lx

Namespace:     kube-system

Priority:      2000000000

PriorityClassName: system-cluster-critical

Node:        4af86293-823e-44b9-b8a3-f1ad640d655b/172.11.1.5

Start Time:     Thu, 03 Oct 2019 09:38:35 +0000

Labels:       k8s-app=kube-dns

          pod-template-hash=95489c5c9

Annotations:    seccomp.security.alpha.kubernetes.io/pod: docker/default

Status:       Pending

IP:

Controlled By:   ReplicaSet/coredns-95489c5c9

Containers:

 coredns:

  Container ID:

  Image:     vmware/coredns:1.3.1

  Image ID:

  Ports:     53/UDP, 53/TCP, 9153/TCP

  Host Ports:  0/UDP, 0/TCP, 0/TCP

  Args:

   -conf

   /etc/coredns/Corefile

  State:     Waiting

   Reason:    ContainerCreating

  Ready:     False

  Restart Count: 0

  Limits:

   memory: 170Mi

  Requests:

   cpu:    100m

   memory:   70Mi

  Liveness:   http-get http://:8080/health delay=60s timeout=5s period=10s #success=1 #failure=5

  Environment: <none>

  Mounts:

   /etc/coredns from config-volume (ro)

   /var/run/secrets/kubernetes.io/serviceaccount from coredns-token-mx9mx (ro)

Conditions:

 Type       Status

 Initialized    True

 Ready       False

 ContainersReady  False

 PodScheduled   True

Volumes:

 config-volume:

  Type:   ConfigMap (a volume populated by a ConfigMap)

  Name:   coredns

  Optional: false

 coredns-token-mx9mx:

  Type:    Secret (a volume populated by a Secret)

  SecretName: coredns-token-mx9mx

  Optional:  false

QoS Class:    Burstable

Node-Selectors: <none>

Tolerations:   CriticalAddonsOnly

         node.kubernetes.io/not-ready:NoExecute for 300s

         node.kubernetes.io/unreachable:NoExecute for 300s

Events:

 Type   Reason         Age        From                      Message

 ----   ------         ----        ----                      -------

 Normal  Scheduled        32m        default-scheduler               Successfully assigned kube-system/coredns-95489c5c9-2k9lx to 4af86293-823e-44b9-b8a3-f1ad640d655b

 Warning FailedCreatePodSandBox 16s (x8 over 28m) kubelet, 4af86293-823e-44b9-b8a3-f1ad640d655b Failed create pod sandbox: rpc error: code = DeadlineExceeded desc = context deadline exceeded

 Normal  SandboxChanged     16s (x8 over 28m) kubelet, 4af86293-823e-44b9-b8a3-f1ad640d655b Pod sandbox changed, it will be killed and re-created.

 

 

Shubham Sharma's profile image
Broadcom Employee Shubham Sharma

@yann s​ could you check the status of the node on which the pod is running on? kubelet logs from that node will be helpful to troubleshoot this further,

yann s's profile image
yann s

I 've found a workaround .

Create a cluster with one worker then resize the cluster .

Do not know if any one ran into the same issue , ops 2.7 + pks 1.5+nsx-t 2.5 , creating a cluster with 3 worker always failed , while it's working with 1 worker.