VMware NSX

  • 1.  IPv6 Connections fail after changing NSX Routing (Global Interface) MTU; Benefits of setting Routing MTU to >1500?

    Posted Feb 17, 2025 07:47 AM
    Edited by left_right Feb 17, 2025 11:41 AM

    Hello,

    recently we have begun to roll out dual stack networks for our workload vms. Currently ipv6 addresses are provisioned using RA, by the NSX gateways; the default ND profile is used.

    The current MTU settings are:

    physical fabric: 9216

    vsphere dvs: 9000

    NSX Tunnel Endpoint: 9000, RTEP not used

    Following a few recommendations I have set the Global Interface MTU to 1700 (+200 from default). The MTU value on single Gateway Interfaces are not set, I want to use the global configuration.

    Those settings work fine for ipv4, however after running some tests with ipv6 issue regarding fragmentation occur.

    For tests I have deployed two Ubuntu Server 24 systems (src_system and dst_system), each placed in a different subnet.

    After deployment, the ip configuration looks like this:

    root@src_system:~# ip a | grep mtu
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    2: ens192: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
    
    root@src_system:~# ip -6 route
    <src_system_ip6_net>::/64 dev ens192 proto ra metric 1024 expires 2591994sec mtu 1700 hoplimit 64 pref medium
    

    For a simple connectivity test I have set up nginx running on one of the systems (dst system), which is listening on port 80 and 443 (ssl).

    The connection works for ipv4 http, ipv4 https and ipv6 http, but fails for ipv6 https:

    root@src_system:~# curl -k -v https://[<remote_ip6>]:443
    *   Trying [<remote_system_ip6>]:443...
    * Connected to <remote_system_ip6> (<remote_ip6>) port 443
    * ALPN: curl offers h2,http/1.1
    * TLSv1.3 (OUT), TLS handshake, Client hello (1):
    (timeout here)

    tcpdump on the remote systems shows received packets:

    root@<dst_system>:~# tcpdump -v -i ens192 src <src_system_ip6>
    tcpdump: listening on ens192, link-type EN10MB (Ethernet), snapshot length 262144 bytes
    12:28:32.763934 IP6 (flowlabel 0x676c2, hlim 63, next-header TCP (6) payload length: 40) <src_system_ip6>.57112 > <dst_system>.https: Flags [S], cksum 0x9479 (correct), seq 3800441539, win 63960, options [mss 1640,sackOK,TS val 1120252730 ecr 0,nop,wscale 7], length 0
    12:28:32.764197 IP6 (flowlabel 0x676c2, hlim 63, next-header TCP (6) payload length: 32) <src_system_ip6>.57112 > <dst_system>.https: Flags [.], cksum 0x6dc9 (correct), ack 4042874102, win 500, options [nop,nop,TS val 1120252731 ecr 69467120], length 0
    12:28:32.766115 IP6 (flowlabel 0x676c2, hlim 63, next-header TCP (6) payload length: 549) <src_system_ip6>.57112 > <dst_system>.https: Flags [P.], cksum 0x0cc0 (correct), seq 0:517, ack 1, win 500, options [nop,nop,TS val 1120252733 ecr 69467120], length 517

    Next step was to change the MTU on the announced route:

    root@<src_system>:~# ip -6 route replace <src_system_ip6_net>::/64 dev ens192 mtu 1500 hoplimit 64 pref medium
    root@<src_system>:~# ip -6 r
    <src_system_ip6_net>::/64 dev ens192 metric 1024 mtu 1500 hoplimit 64 pref medium
    fe80::/64 dev ens192 proto kernel metric 256 pref medium
    fe80::/64 dev ens224 proto kernel metric 256 pref medium
    default proto static metric 1024 pref medium
            nexthop via <src_system_ip6_net>::1 dev ens192 weight 1
            nexthop via fe80:<link_local> dev ens192 weight 1

    This change resulted curl being able to build a ssl connection to the destination server:

    root@<src_system>:~# curl -k -v https://[<dst_system_ip6>]:443
    *   Trying [<dst_system_ip6>]:443...
    * Connected to <dst_system_ip6> (<dst_system_ip6>) port 443
    * ALPN: curl offers h2,http/1.1
    * TLSv1.3 (OUT), TLS handshake, Client hello (1):
    * TLSv1.3 (IN), TLS handshake, Server hello (2):
    * TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
    * TLSv1.3 (IN), TLS handshake, Certificate (11):
    * TLSv1.3 (IN), TLS handshake, CERT verify (15):
    * TLSv1.3 (IN), TLS handshake, Finished (20):
    ...

    The setting is not permanent. After rebooting the announced MTU changed back, This time I changed the MTU setting of the network adapter on the source system. Per default the MTU is 1500, I changed it using netplan and rebooted the system:

    network:
      version: 2
      renderer: networkd
      ethernets:
        ens192:
          mtu: 1700
          dhcp4: no
          dhcp6: no
          addresses:
            - <src_system_ipv4>/24
            - <src_system_ipv6>/64
          routes:
            - to: default
              via: <src_system_ipv4_gw>
            - to: ::/0
              via: <src_system_ipv6_gw>

    Right after rebooting the curl command worked.

    The question is, what the optimal setting would be, hence the title of this thread:

    1. Let the Gateway MTU sit at the default value of 1500, since most systems use a default adapter mtu of 1500?
    2. Change the adapter MTU of deployed systems to a higher value? this would be distruptive though
    3. If possible, do not provide a gateway MTU at all and let the hosts handle the MTU size? I do not think this is possible though.

    I have read a great write up by @Francois Tallet on all the MTU setting in NSX: https://community.broadcom.com/applications-networking-security/blogs/francois-tallet/2024/11/21/mtus-in-nsx/?CommunityKey=b76535ef-c5a2-474d-8270-3e83685f020e

    In the past, as is recommended in the document, we have set the Gateway MTU to 8800 in one of our environments, however this was ipv4 only.

    In a paper on MTU path discovery Cisco recommends using 1500 on ipv6 links: ip6-mtu-path-disc.pdf

    Could anyone expand on this?

    -----------------------------------------------------------------------------------

    Edit:

    Forgot to add the relevant mtu system settings. The system should accept the RA MTU:

    net.ipv6.conf.all.accept_ra_mtu = 1
    net.ipv6.conf.all.mtu = 1280
    net.ipv6.conf.default.accept_ra_mtu = 1
    net.ipv6.conf.default.mtu = 1280
    net.ipv6.conf.ens192.accept_ra_mtu = 1
    net.ipv6.conf.ens192.mtu = 1500
    net.ipv6.conf.ens224.accept_ra_mtu = 1
    net.ipv6.conf.ens224.mtu = 1500
    net.ipv6.conf.lo.accept_ra_mtu = 1
    net.ipv6.conf.lo.mtu = 65536
    net.ipv6.route.mtu_expires = 600

    I did some more troubleshoting. For tests with ping6, it seems that the problem lies with the destination server not being able to reply with fragmentet packets, so that the reply does not even get sent out.

    Configuration:

    • src_server & dst_server: adapter MTU 1500, RA (SLAAC) MTU: 1700. Here it seems that the MTU sent by the router is ignored, the adapters are still on 1500.
    • gateway: MTU 1700 set on L3 interface
    root@<source_server>:~# ping6 -c 3 -s 1500 <dst_server_ip6>
    PING <dst_server_ip6> (<dst_server_ip6>) 1500 data bytes
    
    --- <dst_server_ip6> ping statistics ---
    3 packets transmitted, 0 received, 100% packet loss, time 2084ms

    You can see on the destination server, that the packets are fragmented and transported to the destination, but the reply is not fragmented, is bigger than the adapter MTU of 1508 and as a result is not sent out:

    root@<dst_server>:~# tcpdump -i ens192 -n host <src_server_ip6>
    tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
    listening on ens192, link-type EN10MB (Ethernet), snapshot length 262144 bytes
    16:44:22.003989 IP6 <src_server_ip6> > <dst_server_ip6>: frag (0|1448) ICMP6, echo request, id 1459, seq 1, length 1448
    16:44:22.004026 IP6 <src_server_ip6> > <dst_server_ip6>: frag (1448|60)
    16:44:22.004065 IP6 <dst_server_ip6> > <src_server_ip6>: ICMP6, echo reply, id 1459, seq 1, length 1508 << does not reach source server

    I suspect that on the source server the MTU advertised by the gateway is ignored, so that the server sends out fragmented packets. But, on the destination server, the advertised MTU it is applied somehow, hence the destination tries to respond with a 1508B replay and gets stuck on the 1500 adapter MTU.

    Similar to the above workaround, when the adapter mtu is set to be equal to the router interface MTU, I applied the MTU using 

     ip link set mtu 1700 dev ens192

    so that the adapter MTU is equal to the router MTU.

    In this configuarion, the echo request as well as the echo replies get properly fragmented, sent and returned:

    root@<src_server>:~# ping6 -c 3 -s 1900 <dst_server_ip6>
    PING <dst_server_ip6> (<dst_server_ip6>) 1900 data bytes
    1908 bytes from <dst_server_ip6>: icmp_seq=1 ttl=63 time=1.12 ms
    1908 bytes from <dst_server_ip6>: icmp_seq=2 ttl=63 time=0.592 ms
    1908 bytes from <dst_server_ip6>: icmp_seq=3 ttl=63 time=0.646 ms
    
    --- <dst_server_ip6> ping statistics ---
    3 packets transmitted, 3 received, 0% packet loss, time 2004ms
    rtt min/avg/max/mdev = 0.592/0.787/1.123/0.238 ms
    
    
    TCP dump on destination server:
    16:54:24.275703 IP6 <src_server_ip6> > <dst_server_ip6>: frag (0|1648) ICMP6, echo request, id 1475, seq 1, length 1648
    16:54:24.275723 IP6 <src_server_ip6> > <dst_server_ip6>: frag (1648|260)
    16:54:24.275756 IP6 <dst_server_ip6> > <src_server_ip6>: frag (0|1648) ICMP6, echo reply, id 1475, seq 1, length 1648
    16:54:24.275794 IP6 <dst_server_ip6> > <src_server_ip6>: frag (1648|260)
    

    The woraround is somewhat valid for newly deployed systems.

    However, should ipv6 addresses get provisioned using SLAAC in NSX with a MTU not matching the "default" MTU on workload servers, users would not be able to utilize the ipv6 addressing without making changes to their systems' configuration.

    -----------------------------------------

    Edit 2:

    In aother workaround I just set the adapters on both systems to static ipv6 addressing. To deactivate RA I used the "accept-ra: no" flag in netplan. After applaying the netplan config and rebooting, the systems came up with the static addresses and could reach each other, fragmenting worked fine on both sides. 

    I did an additional test, to see if the router link MTU gets discovered correctly. For this, I have set the MTU on the source server to 9000, and left the MTU on the router on 1700. Sending first echo request with MTU 1900 fragmented, second MTU 1900 with no fragmenting:

    root@Master-Project-41579:~# ping6 -c 3 -s 1900 <dst_server_ip6>
    PING <dst_server_ip6> (<dst_server_ip6>) 1900 data bytes
    1908 bytes from <dst_server_ip6>: icmp_seq=1 ttl=63 time=0.955 ms
    1908 bytes from <dst_server_ip6>: icmp_seq=2 ttl=63 time=0.508 ms
    1908 bytes from <dst_server_ip6>: icmp_seq=3 ttl=63 time=0.642 ms
    
    --- <dst_server_ip6> ping statistics ---
    3 packets transmitted, 3 received, 0% packet loss, time 2053ms
    rtt min/avg/max/mdev = 0.508/0.701/0.955/0.187 ms
    root@Master-Project-41579:~# ping6 -c 3 -s 1900 -M probe <dst_server_ip6>
    PING <dst_server_ip6> (<dst_server_ip6>) 1900 data bytes
    From fe80::50:56ff:fe56:4452%ens192 icmp_seq=1 Packet too big: mtu=1700
    From fe80::50:56ff:fe56:4452%ens192 icmp_seq=2 Packet too big: mtu=1700
    From fe80::50:56ff:fe56:4452%ens192 icmp_seq=3 Packet too big: mtu=1700
    
    --- <dst_server_ip6> ping statistics ---
    3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2030ms
    

    The question remains: why, when autoconfiguration is on, the MTU setting is not applied to the adapter.

    Switching to static ip addressing would mean more overhead with managing IPAM, but if it works that I guess it is better than headaches because of SLAAC.



  • 2.  RE: IPv6 Connections fail after changing NSX Routing (Global Interface) MTU; Benefits of setting Routing MTU to >1500?

    Posted Feb 18, 2025 08:23 AM

    Hi,

    It looks like you're hitting this networkd issue:

    https://github.com/systemd/systemd/issues/33160

    Hope that help.




  • 3.  RE: IPv6 Connections fail after changing NSX Routing (Global Interface) MTU; Benefits of setting Routing MTU to >1500?

    Posted Feb 25, 2025 09:57 AM
    Thank you for the suggestion, but would the linked issue affect how the mtu is applied on the interface, based on the router advertisement? "networkd does not include the MTU of the interface in emitted RAs." - in my case the affected system does not send RAs, the router does.

    The currently used systemd version is: systemd 255 (255.4-1ubuntu8.4)
    I did not upgrade systemd directly, instead applied a release upgrade to 24.10, which comes with systemd version 256.5-2ubuntu3.1.
    The router advertisement comes with a correct AdvLinkMTU of 9000.

    The issue remains, like in the original post - I can send an ICMP ping6 with -s 9000 from this system to another on a different subnet. The remote system responds with a packet size that is based on the announced MTU of 9000:
    15:51:25.328998 IP6 (flowlabel 0x87f98, hlim 2, next-header Fragment (44) payload length: 1456) <ubuntu_system> > <remote_ubuntu_system>: frag (0x4c667267:0|1448) ICMP6, echo request, id 1781, seq 1
    15:51:25.329018 IP6 (flowlabel 0x87f98, hlim 2, next-header Fragment (44) payload length: 1456) <ubuntu_system> > <remote_ubuntu_system>: frag (0x4c667267:1448|1448)
    15:51:25.329020 IP6 (flowlabel 0x87f98, hlim 2, next-header Fragment (44) payload length: 1456) <ubuntu_system> > <remote_ubuntu_system>: frag (0x4c667267:2896|1448)
    15:51:25.329023 IP6 (flowlabel 0x87f98, hlim 2, next-header Fragment (44) payload length: 1456) <ubuntu_system> > <remote_ubuntu_system>: frag (0x4c667267:4344|1448)
    15:51:25.329026 IP6 (flowlabel 0x87f98, hlim 2, next-header Fragment (44) payload length: 1456) <ubuntu_system> > <remote_ubuntu_system>: frag (0x4c667267:5792|1448)
    15:51:25.329029 IP6 (flowlabel 0x87f98, hlim 2, next-header Fragment (44) payload length: 1456) <ubuntu_system> > <remote_ubuntu_system>: frag (0x4c667267:7240|1448)
    15:51:25.329033 IP6 (flowlabel 0x87f98, hlim 2, next-header Fragment (44) payload length: 328) <ubuntu_system> > <remote_ubuntu_system>: frag (0x4c667267:8688|320)
    15:51:25.329089 IP6 (flowlabel 0xde8e2, hlim 3, next-header Fragment (44) payload length: 8960) <remote_ubuntu_system> > <ubuntu_system>: frag (0x397c6877:0|8952) ICMP6, echo reply, id 1781, seq 1
    15:51:25.329120 IP6 (flowlabel 0xde8e2, hlim 3, next-header Fragment (44) payload length: 64) <remote_ubuntu_system> > <ubuntu_system>: frag (0x397c6877:8952|56)

    The source system properly handles fragmentation, the remote system applies a packet size equal to the MTU recevied with the RA.

    I wanted also to recreate the issue on Windows Server 2019, however it does not seems to be affected.



  • 4.  RE: IPv6 Connections fail after changing NSX Routing (Global Interface) MTU; Benefits of setting Routing MTU to >1500?

    Posted Feb 26, 2025 07:24 AM

    If you scroll down the systemd-networkd issue mentioned #33160, there is a pull request https://github.com/systemd/systemd/pull/33360 that close the issue,

    but it went in july 2024 to being tagged from good-to-merge  to needs-rework on august 2024,

    so it is not merged yet and readily available to upgrade as a component and needless to say available in new distribution yet.
    Tbh i haven't lookup much into the issue but if it really is your case, you want keep track of this PR status.