DX Operational Observability

 View Only

How Network I/O Delays Suffocate Your Containers

By Jorg Mertin posted 28 days ago

  

How Network I/O Delays Suffocate Your Containers in the world of microservices and Kubernetes, containers promise agility, scalability, and efficiency.

We package our applications, deploy them, and expect lightning-fast performance. But often, there's a silent killer lurking beneath the surface, especially when our applications start to scale: network-induced I/O delays.

It's easy to assume I/O is all about disk reads and writes. However, in a distributed container environment, a significant portion of "I/O" is actually network-bound. Every time your application needs to talk to another microservice, a database, or a persistent storage volume like NFS, it is performing network I/O.

The Real-World Performance Gap: Local vs. Network I/O

To understand the tangible impact of network-induced I/O delays, we compared local disk performance against NFS-attached storage. The data below illustrates how shifting from local bus communication to a network protocol can significantly degrade your IOPS (Input/Output Operations Per Second) and throughput. We made the test so that the differences are well visible.

Access Type

Metric

Broadcom Recommended

Local Disk (Raid 0)

NFS Test Performance

Random Read

IOPs

122k

163k

14k

Random Read

Speed

476 MiB/s

636 MiB/s

56 MiB/s

Random Write

IOPs

29k

75k

13k

Random Write

Speed

115 MiB/s

295 MiB/s

50 MiB/s

Seq. Read (64kb)

IOPs

82k

58k

9k

Seq. Read (64kb)

Speed

5172 MiB/s

3638 MiB/s

562 MiB/s

Seq. Write (64kb)

IOPs

41k

2k

8k

Seq. Write (64kb)

Speed

2622 MiB/s

174 MiB/s

505 MiB/s

You now need to take into account the impact this will have down the chain of applications. Each delay will propagate like in a car-traffic.

Key Observations from the Data

  • Devastating Random I/O Latency: In random read scenarios—typical for database workloads—NFS performance dropped to just 14k IOPs, a massive decrease compared to the 163k IOPs achieved by Local Disk Raid 0.

  • Failure to Meet Recommendations: While the local disk setup comfortably exceeds Broadcom’s recommended 122k IOPs for random reads, the NFS environment fails to reach even 12% of that target.

  • Sequential Read Impact: Sequential read speeds also suffer heavily, with NFS reaching only 562 MiB/s, compared to the 3638 MiB/s seen on Local Disk Raid 0 and the Broadcom recommendation of 5172 MiB/s.

Why Network I/O Becomes a Bottleneck in Containers

  1. Increased Hops & Indirection: Every microservice call or sidecar proxy in a service mesh adds processing time and latency.

  2. Shared Resources: Containers on a single node share the same physical network interface; a "noisy neighbor" can starve others of bandwidth.

  3. Remote Storage Latency: As shown in the data, moving storage to the network (NFS) transforms high-speed local disk operations into much slower network calls.

What Can You Do?

  • Local Caching: Use in-memory caching to reduce the frequency of network calls.

  • Network-Aware Scheduling: Use Kubernetes features to co-locate interdependent services on the same node.

  • When developing applications, make sure the data is requested only once through a remote connection and work with a local cache for further processing.
  • Observability is Key: Deploy robust full-stack observability solutions like Broadcom DX O2. By leveraging eBPF probes, you can track network latency and I/O wait times at the kernel level with zero-touch, identifying these hidden bottlenecks before they sabotage your applications.

0 comments
9 views

Permalink