A colleague recently pointed me to “11 Myths of RoCE.” Previously, RoCE’s version proliferation had put me in mind of a certain boxing movie hexology. But this article’s remarkable assertions brought to mind a more cultish classic, the Rocky Horror franchise, which is many things: a parody, a tribute, a stage musical and a movie, often with live performance art. Its characters are never quite what they seem.
Not all the article’s 11 myths seem like myths (as in, “does anyone really think that?) but, in fairness, once upon a time debunking 5 or 7 myths was plenty. Now, thanks to another cult parody tribute, you must go up to 11, spawning extra myths to debunk. For the sake of enjoyment, a willing suspension of disbelief comes in handy.
Also, to appreciate a good parody, familiarity with the original will help. I can only recap the species of RoCE I know of:
- RoCE, the original. “RDMA over Converged Ethernet” premiered in 2010, produced and directed by the IBTA. A link layer protocol that runs directly over “lossless” Ethernet using PFC (Priority Flow Control), RoCEv1 is definitely not routable. Since iWARP does RDMA over TCP, RoCE is faster. Okay, check.
- The first sequel, RoCEv2, a.k.a. “routable RoCE”, debuted in 2014, with UDP/IP added to the cast. V2 adds the optional “Congestion Notification Packet,” or CNP, to exploit the IETF’s Explicit Congestion Notification (ECN) end-to-end TCP flow control scheme. Using CNPs assumes RFC 3168 routers plus sender and receiver algorithms, but these apparently didn’t make the final editing cut.
- A 2015 SIGCOMM DCQCN paper (by Microsoft, Mellanox, and UC Santa Barbara) defines some CNP algorithms. It’s not really a sequel (more like a director’s cut reissue with “new unseen footage”?) but arguably a distinct version.
- Next up, another Microsoft (Azure) production, also presented in a SIGCOMM paper (that says RDMA needs lossless and PFC). The paper (which weirdly says the C in RoCE is “commodity”) seems pro-RoCE but also lists several RoCE-related problems, including “livelock” and “deadlock”. The unlikely solution involved plenty of special vendor code, plus setting Ethernet’s PFC field from IP’s DSCP field. It’s not a “layer violation”, it’s a “feature!” As an oft-referenced “large network,” it’s arguably a “de facto” standard. One competitor snarks that it’s “RoCEv4.” The IBTA disavows the term, but it’s a bit sticky, or at least tacky.
Well, that’s where I thought things stood as I studied the article… 2 to 4 versions of RoCE, all needing lossless Ethernet for good performance. So I was boggled that the first “myth” of RoCE was that it needs a lossless network. Wait… what? Without lossless, RoCE needs to retransmit, just like iWARP. Then the article soon says that RoCE beats other Ethernet-based RDMA like iWARP. With my “parody bit” still not turned on, I shook my head, and slogged through a dismissal of “deployment difficulties”. As if. Dell EMC’s Erik Smith posted “the level of complexity required to properly configure it to avoid issues with congestion spreading.” (Erik’s blog isn’t official, but this related video is.) “Interoperability between vendors is unreliable” is another supposed myth, though there’s no standard for CNP algorithms and vendors are free to choose their own yet they must interoperate?
As I grumbled about these pseudo-myths, I was startled to hear from another colleague about a quiet (art film?) new RoCE production, disavowing PFC. Whoa! Time to extend that earlier recap!
- This new “implementation” uses new CNP sender and receiver algorithms (and format?) to enable UDP-based RoCE to do “selective retransmit” (like TCP in RFC 2018). This oxymoronic RoCE runs on vanilla Ethernet, and outperforms RoCE over PFC-based lossless Ethernet.
Darn, they were right! RoCE has morphed again, and its need for lossless Ethernet is, bizarrely, now a myth. And I’d bought the myth! Hilarious! What a knee slapper!
In self-defense, the acronym itself says “converged Ethernet,” an old synonym for DCB, which uses PFC. This latest non-PFC RoCE is clearly a new version. I shall call him RoCEv5. (Side note: I asked a Broadcom contact, who told me that their RNICs cannot, ahem, interoperate with this new mode.) A brief Inigo Montoya moment is understandable among observers. What, exactly, does “RoCE” mean?
The “myth buster” article says:
- “RoCE” started in 2010: v1, directly on converged PFC Ethernet and can’t scale
- “RoCE” has been deployed at scale: v3 (or v4) emerged in 2015 (?), needs PFC Ethernet
- “RoCE” doesn’t need lossless Ethernet: v5, described in 2017, not (yet) deployed at scale
But each bullet is only true for one version, and they are all different! The ambiguous language glosses over RoCE’s lack of a stable, well-specified version and adds to confusion about the protocol.
In all honesty, this new PFC-free version is actually a good thing. I hope that the RoCE (re-)inventors can get the word out and make v5 sit still. Maybe they can even write a fully specified, interoperable standard and give it a less oxymoronic name!
It is worth recognizing, though, that when the RoCE crowd cries U.N.C.L.E. on lossless Ethernet, they are staking a claim for pretty good performance on non-deterministic, best-effort infrastructure. It’ll work great much of the time, but now and then it won’t work as well. That’s good stuff for a number of non-mission-critical applications. Mission critical Enterprise Storage is just not one of those applications.
“This is Spinal Tap” via Wikipedia, en.wikipedia.org/w/index.php?curid=29658499