VMware NSX

  • 1.  BGP timer issue I think

    Posted Mar 26, 2023 02:50 AM

    This is a lab setup working on some different designs. Have 2 Cisco switches connected with BGP peering. Have  2 hypervisors with 2 uplinks for NSX, each uplink to a different switch. I have an Edge gateway on each hypervisor. I am using the Edge workflow design where you pin each uplink to a switch VLAN, which is not spanned to the other switch. Everything works and my T0 shows all peers are good. This is a Federated environment, so I am stretching a T0 and T1, but this is not a Federation issue. There are some other Cisco switches for the other "site" as well. 

    If take one of the Edge VM down, I only lose several pings to my VMs. My bgp hold and keepalive and 4 and 1. 

    However if I just unplug an uplink from a switch, I only lose a couple of pings. However when I plug the uplink back in the switch I lost pings for about 30 seconds, even to the second site, so my RTEPs are hosed too. I cannot understand why. On each Cisco switch I have set the BGP advertisement to 5 seconds as well.

    This will probably not be enough information, but I am just learning. Hoping someone can give me some place to try to look on why the behavior when plugging an uplink back in a switch. Thank you.



  • 2.  RE: BGP timer issue I think

    Posted Mar 27, 2023 11:32 AM

    Hi,

    Enable BFD on T0 & Cisco Switches & then try again.



  • 3.  RE: BGP timer issue I think

    Posted Mar 27, 2023 12:56 PM

    Thank you. I cannot remember if had that set. I should also clarify that I found out after the post, that actually the Segment gateway is still pingable, just no VM on it until after about 30 seconds. I am not sure why I never tested that while looking at this for several days. 



  • 4.  RE: BGP timer issue I think

    Posted Aug 17, 2023 07:09 AM

    When taking down one Top of Rack switch, everything fails over to the other physical NIC and we usually see one lost ping. When the switch is taken back up, everything fails back to the recovered physical NIC again, but this time we get a huge amount of packet loss. Why? Because when the switch brings the link back up, ESXi starts failing back after 100 ms, but the switch isn’t ready to forward traffic. How long this takes varies depending on vendor and switch type. We can change the network teaming failback delay to avoid this problem. Normally we change it to 30 000 or 40 000 ms.
     
    To modify the TeamPolicyUpDelay, from the vSphere Client go to each ESXi host, Configure > Advanced System Settings > Edit.  Change Net.TeamPolicyUpDelay to 30 000 and test again to see if it works better in your environment.