VMware Tanzu Kubernetes Grid Integrated Edition

CoreDNS failing on new cluster

David Holder posted Jun 21, 2019 06:40 PM

Hi everyone,

I'm running PKS 1.4.1 (but observed this in 1.4 as well) leveraging NSX-T (2.4.1) and PKS has been run to include the NSX-T errands which have passed.

After building a cluster I'm noticing that the CoreDNS pods are failing quite regularly:

david@srv-jmp-01:~$ kubectl get pods --all-namespaces
 
NAMESPACE     NAME                                     READY   STATUS             RESTARTS   AGE
 
kube-system   coredns-54586579f6-lv6t9                 0/1     CrashLoopBackOff   7          21m
 
kube-system   coredns-54586579f6-wf2sl                 0/1     CrashLoopBackOff   8          21m
 
kube-system   coredns-54586579f6-xq7p7                 0/1     CrashLoopBackOff   7          21m

Logs:

david@srv-jmp-01:~$ kubectl logs coredns-54586579f6-lv6t9 -n kube-system
.:53
2019-06-21T15:33:10.120Z [INFO] CoreDNS-1.3.1
2019-06-21T15:33:10.120Z [INFO] linux/amd64, go1.11.4, 6b56a9c
CoreDNS-1.3.1
linux/amd64, go1.11.4, 6b56a9c
2019-06-21T15:33:10.120Z [INFO] plugin/reload: Running configuration MD5 = 71a0a36d5100fe9c474bb60e380cfd52
2019-06-21T15:33:30.124Z [ERROR] plugin/errors: 2 8972820080328324719.7794984999946480610. HINFO: unreachable backend: read udp 172.17.2.3:57845->169.254.0.2:53: i/o timeout
2019-06-21T15:33:31.870Z [ERROR] plugin/errors: 2 localhost. A: unreachable backend: read udp 172.17.2.3:49264->169.254.0.2:53: i/o timeout
2019-06-21T15:33:33.123Z [ERROR] plugin/errors: 2 8972820080328324719.7794984999946480610. HINFO: unreachable backend: read udp 172.17.2.3:35367->169.254.0.2:53: i/o timeout
2019-06-21T15:33:34.435Z [ERROR] plugin/errors: 2 localhost. AAAA: unreachable backend: read udp 172.17.2.3:44951->169.254.0.2:53: i/o timeout
2019-06-21T15:33:36.126Z [ERROR] plugin/errors: 2 8972820080328324719.7794984999946480610. HINFO: unreachable backend: read udp 172.17.2.3:56298->169.254.0.2:53: i/o timeout
2019-06-21T15:33:36.872Z [ERROR] plugin/errors: 2 localhost. AAAA: unreachable backend: read udp 172.17.2.3:53871->169.254.0.2:53: i/o timeout

Services:

david@srv-jmp-01:~$ kubectl get svc -n kube-system
NAME                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)         AGE
kube-dns               ClusterIP   10.100.200.2     <none>        53/UDP,53/TCP   28m
kubernetes-dashboard   NodePort    10.100.200.225   <none>        443:30504/TCP   28m
metrics-server         ClusterIP   10.100.200.118   <none>        443/TCP         28m

Testing DNS fails:

david@srv-jmp-01:~$ kubectl exec -ti busybox -- nslookup kubernetes.default
Server:    10.100.200.2
Address 1: 10.100.200.2 kube-dns.kube-system.svc.cluster.local
 
nslookup: can't resolve 'kubernetes.default'
command terminated with exit code 1

resolf.conf from a test pod:

david@srv-jmp-01:~$ kubectl exec busybox cat /etc/resolv.conf
nameserver 10.100.200.2
search default.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

configmap:

david@srv-jmp-01:~$ kubectl describe configmap coredns -n kube-system
Name:         coredns
Namespace:    kube-system
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"v1","data":{"Corefile":".:53 {\n    errors\n    health\n    kubernetes cluster.local in-addr.arpa ip6.arpa {\n      pods in...
 
Data
====
Corefile:
----
.:53 {
    errors
    health
    kubernetes cluster.local in-addr.arpa ip6.arpa {
      pods insecure
      upstream
      fallthrough in-addr.arpa ip6.arpa
    }
    prometheus :9153
    proxy . /etc/resolv.conf {
      policy sequential # needed for workloads to be able to use BOSH-DNS
    }
    cache 30
    loop
    reload
    loadbalance
}
 
Events:  <none>

Has anyone come across this before or has any suggestions on what to try?

Thanks,

David

Daniel Mikusa posted Jun 21, 2019 01:29 PM

>2019-06-21T15:33:31.870Z [ERROR] plugin/errors: 2 localhost. A: unreachable backend: read udp 172.17.2.3:49264->169.254.0.2:53: i/o timeout

Not exactly sure what's happening but this jumps out as a red flag. You shouldn't see these errors in the CoreDNS logs. You might want to do some troubleshooting to see why traffic on port 53 seems to be blocked or now allowed.

Daniel Mikusa posted Jun 21, 2019 02:06 PM

If there are no networking issues, the other cause could be that the service is not available/listening on the port. That IP could be for Bosh DNS. Did you check the Bosh DNS logs for errors? Make sure it's running and responding.

Daniel Mikusa posted Jun 21, 2019 02:07 PM

Also, probably worth trying to query DNS directly. From the VM and from a Core DNS container, run `dig @169.254.0.2 www.google.com` (or whatever DNS you want to try and look up).

David Holder posted Jun 21, 2019 06:51 PM

Just to add : DNS config from one of the master nodes:

master/11286437-3d71-46ed-aa60-76b2a7c63084:~$ cat /etc/resolv.conf 
# This file was automatically updated by bosh-dns
nameserver 169.254.0.2
 
nameserver 172.16.10.30

David Holder posted Jun 21, 2019 09:02 PM

Hi Daniel,

No firewalls are blocking traffic, NSX-T is currently configured to allow all currently. If I read that correctly, the pod is trying to access DNS services on 169.254.0.2:53, which is created by bosh-dns on the respective kubernetes nodes?

David Holder posted Jun 22, 2019 10:59 AM

User error on my part. I typo'd the pod CIDR range to 172.16.17.0/24 which cannot be used. Changed this and it's working as expected.