VMware Tanzu Kubernetes Grid Integrated Edition

 CoreDNS failing on new cluster

David Holder's profile image
David Holder posted Jun 21, 2019 06:40 PM

Hi everyone,

 

I'm running PKS 1.4.1 (but observed this in 1.4 as well) leveraging NSX-T (2.4.1) and PKS has been run to include the NSX-T errands which have passed.

 

After building a cluster I'm noticing that the CoreDNS pods are failing quite regularly:

 

david@srv-jmp-01:~$ kubectl get pods --all-namespaces   NAMESPACE NAME READY STATUS RESTARTS AGE   kube-system coredns-54586579f6-lv6t9 0/1 CrashLoopBackOff 7 21m   kube-system coredns-54586579f6-wf2sl 0/1 CrashLoopBackOff 8 21m   kube-system coredns-54586579f6-xq7p7 0/1 CrashLoopBackOff 7 21m

Logs:

 

david@srv-jmp-01:~$ kubectl logs coredns-54586579f6-lv6t9 -n kube-system .:53 2019-06-21T15:33:10.120Z [INFO] CoreDNS-1.3.1 2019-06-21T15:33:10.120Z [INFO] linux/amd64, go1.11.4, 6b56a9c CoreDNS-1.3.1 linux/amd64, go1.11.4, 6b56a9c 2019-06-21T15:33:10.120Z [INFO] plugin/reload: Running configuration MD5 = 71a0a36d5100fe9c474bb60e380cfd52 2019-06-21T15:33:30.124Z [ERROR] plugin/errors: 2 8972820080328324719.7794984999946480610. HINFO: unreachable backend: read udp 172.17.2.3:57845->169.254.0.2:53: i/o timeout 2019-06-21T15:33:31.870Z [ERROR] plugin/errors: 2 localhost. A: unreachable backend: read udp 172.17.2.3:49264->169.254.0.2:53: i/o timeout 2019-06-21T15:33:33.123Z [ERROR] plugin/errors: 2 8972820080328324719.7794984999946480610. HINFO: unreachable backend: read udp 172.17.2.3:35367->169.254.0.2:53: i/o timeout 2019-06-21T15:33:34.435Z [ERROR] plugin/errors: 2 localhost. AAAA: unreachable backend: read udp 172.17.2.3:44951->169.254.0.2:53: i/o timeout 2019-06-21T15:33:36.126Z [ERROR] plugin/errors: 2 8972820080328324719.7794984999946480610. HINFO: unreachable backend: read udp 172.17.2.3:56298->169.254.0.2:53: i/o timeout 2019-06-21T15:33:36.872Z [ERROR] plugin/errors: 2 localhost. AAAA: unreachable backend: read udp 172.17.2.3:53871->169.254.0.2:53: i/o timeout

Services:

david@srv-jmp-01:~$ kubectl get svc -n kube-system NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kube-dns ClusterIP 10.100.200.2 <none> 53/UDP,53/TCP 28m kubernetes-dashboard NodePort 10.100.200.225 <none> 443:30504/TCP 28m metrics-server ClusterIP 10.100.200.118 <none> 443/TCP 28m

Testing DNS fails:

 

david@srv-jmp-01:~$ kubectl exec -ti busybox -- nslookup kubernetes.default Server: 10.100.200.2 Address 1: 10.100.200.2 kube-dns.kube-system.svc.cluster.local   nslookup: can't resolve 'kubernetes.default' command terminated with exit code 1

 

resolf.conf from a test pod:

 

david@srv-jmp-01:~$ kubectl exec busybox cat /etc/resolv.conf nameserver 10.100.200.2 search default.svc.cluster.local svc.cluster.local cluster.local options ndots:5

 

configmap:

 

david@srv-jmp-01:~$ kubectl describe configmap coredns -n kube-system Name: coredns Namespace: kube-system Labels: <none> Annotations: kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"v1","data":{"Corefile":".:53 {\n errors\n health\n kubernetes cluster.local in-addr.arpa ip6.arpa {\n pods in...   Data ==== Corefile: ---- .:53 { errors health kubernetes cluster.local in-addr.arpa ip6.arpa { pods insecure upstream fallthrough in-addr.arpa ip6.arpa } prometheus :9153 proxy . /etc/resolv.conf { policy sequential # needed for workloads to be able to use BOSH-DNS } cache 30 loop reload loadbalance }   Events: <none>

Has anyone come across this before or has any suggestions on what to try?

 

Thanks,

 

David

 

Daniel Mikusa's profile image
Daniel Mikusa

>2019-06-21T15:33:31.870Z [ERROR] plugin/errors: 2 localhost. A: unreachable backend: read udp 172.17.2.3:49264->169.254.0.2:53: i/o timeout

 

Not exactly sure what's happening but this jumps out as a red flag. You shouldn't see these errors in the CoreDNS logs. You might want to do some troubleshooting to see why traffic on port 53 seems to be blocked or now allowed.

Daniel Mikusa's profile image
Daniel Mikusa

If there are no networking issues, the other cause could be that the service is not available/listening on the port. That IP could be for Bosh DNS. Did you check the Bosh DNS logs for errors? Make sure it's running and responding.

Daniel Mikusa's profile image
Daniel Mikusa

Also, probably worth trying to query DNS directly. From the VM and from a Core DNS container, run `dig @169.254.0.2 www.google.com` (or whatever DNS you want to try and look up).

David Holder's profile image
David Holder

Just to add : DNS config from one of the master nodes:

 

master/11286437-3d71-46ed-aa60-76b2a7c63084:~$ cat /etc/resolv.conf # This file was automatically updated by bosh-dns nameserver 169.254.0.2   nameserver 172.16.10.30

 

David Holder's profile image
David Holder

Hi Daniel,

 

No firewalls are blocking traffic, NSX-T is currently configured to allow all currently. If I read that correctly, the pod is trying to access DNS services on 169.254.0.2:53, which is created by bosh-dns on the respective kubernetes nodes?

David Holder's profile image
David Holder

User error on my part. I typo'd the pod CIDR range to 172.16.17.0/24 which cannot be used. Changed this and it's working as expected.