In a healthy Kubernetes environment, the CoreDNS service is the silent resolver. It ensures that when your application tries to reach orders-db, it is routed to the correct internal IP. But what happens during those critical first seconds of a cluster start if your primary, configured DNS server is unreachable?
Many engineers are surprised to find that instead of simply failing, the cluster often falls back to "Public Defaults" - specifically Google (8.8.8.8) and Cloudflare (1.1.1.1). While this keeps the cluster alive, it can create a “non critical” situation for security and internal service discovery.
On critical in the way that the cluster looks up and running, but fatal in the way that it won’t find internal applications resolved by the internal DNS.
The Chain of Command: How Kubernetes Resolves DNS
To understand the fallback, you have to look at the Kubelet on each node. By default, the Kubelet looks at the host’s /etc/resolv.conf to determine how to handle DNS for the pods it manages.
-
Pod Request: A pod requests myservice.namespace.svc.cluster.local.
-
CoreDNS Check: CoreDNS checks its local records.
-
Upstream Forwarding: If the address is external (e.g., google.com), CoreDNS looks at its forward plugin configuration.
The "Silent" Fallback Mechanism
If you have configured a custom upstream DNS server in your environment (common in corporate labs), Kubernetes expects that server to be available at the moment the CoreDNS pods initialize.
However, if the primary upstream server is down or blocked during the initial handshake:
-
The resolv.conf Inheritance: If the host node (the Ubuntu/WSL instance) has a multi-tiered DNS configuration, Kubelet inherits all of them.
-
Hardcoded Resilience: Many modern Linux distributions and cloud-init scripts come pre-configured with Google (8.8.8.8) and Cloudflare (1.1.1.1) as secondary or tertiary "safety nets."
-
CoreDNS Behavior: If CoreDNS cannot reach the first IP in its forward list, it will cycle to the next available one. If your corporate DNS is lagging, CoreDNS will quickly hop over to 1.1.1.1.
Why is this a Problem?
You might think, "As long as it works, who cares?" But for an ITOM professional, this fallback causes three major issues:
-
Loss of Internal Visibility: Public DNS servers have zero knowledge of your internal corporate lab IPs. If your cluster falls back to Google, it won't be able to resolve your internal repo1 or your eulab-vpn-itom endpoints.
-
Security Leaks: Your cluster starts sending "internal" lookup requests to the public internet. This leaks information about your internal service architecture to external providers.
Intermittent Performance "Ghosts": If the primary DNS recovers, the cluster might stay on the fallback until the CoreDNS pods are restarted, leading to inconsistent behavior where some services work and others don't.
Measured Impact: Performance vs. Reliability
Referring to our previous I/O and network speed tests, the delay in a failed DNS handshake can be interpreted by the application as a network timeout.
|
Scenario
|
DNS Provider
|
Result
|
Latency Impact
|
|
Healthy
|
Internal Lab DNS
|
Success
|
< 5ms
|
|
Delayed Start
|
Internal (Timeout)
|
Fallback Triggered
|
+2000ms (Wait time)
|
|
Fallback Active
|
Google/Cloudflare
|
External OK / Internal FAIL
|
High (Failures)
|
How to Fix It
To prevent your cluster from "escaping" to the public web during a DNS outage:
-
Explicit Forwarding: Manually define your CoreDNS Corefile to only forward to specific internal IPs. Remove any reference to 8.8.8.8 in your host’s /etc/resolv.conf.
-
Kubelet Flags: Use the --resolv-conf flag to point Kubelet to a specific, hardened DNS file that does not contain public fallbacks.
-
Health Checks: Implement a startup probe that ensures the internal DNS server is reachable before the application pods are allowed to go "Ready."