VMware NSX

 View Only

  • 1.  NSX Upgrade from 4.2.1.3 - NSX Managers become unresponsive

    Posted Mar 29, 2026 05:10 AM
    Edited by Mark Koh Mar 30, 2026 04:55 AM

    Hi all,

    We were upgrading NSX from 4.2.1.3 to 4.2.3.3.

    1. 18 clusters of Edge Nodes were upgraded successfully.

    2. When upgrading Transport Nodes NSX component, TN upgrade was successfull until NSX managers were moved to already upgraded hosts(by DRS because of maintainance mode of another host). NSX managers lost connection to the network. NSX Upgrade was automatically paused when this happened, because all managers were unresponsive. Even when vMotionto hosts that were not upgraded yet, same issue persists. Broadcom support checked netstat on Hosts and NSX Manager was showing MAC address 00:00:00:00:00. (See attached screenshot).

    NSX managers are part of non NSX vDS and VLAN segment.

    We migrated NSX Managers to vSwitch, but still issues persists. When NSX Manager was rebootet then the connection worked again. NSX Managers were migrated back to vDS and issue was gone. 

    However we are in the middle of the upgrade now (2 Transport nodes to be upgraded and 3 NSX Managers cluster). We are waiting for support to analyze logs.

    While the support were collecting logs on our ESXi hosts it triggered PSOD. When rebooting ESXi everything is back to operational state.

    Did any of you ran into similar issue and could it be connected to JDK bug affecting current version of NSX Managers (4.2.1.3) ? 

    Best regards 



    -------------------------------------------



  • 2.  RE: NSX Upgrade from 4.2.1.3 - NSX Managers become unresponsive

    Posted Mar 31, 2026 03:03 AM

    Hi Matic,

    About six months ago, we upgraded NSX from version 4.2.0.2 to 4.2.3 without any issues. However, we followed the KB recommendations and restarted all NSX Manager nodes before starting the upgrade procedure and at each stage of the upgrade.

    Additionally, as a precaution, we also rebooted all ESXi hosts before upgrading the VIBs.

    I hope you are able to identify the root cause and share your findings here.

    We are currently in the middle of an upgrade from 4.2.3 to 4.2.3.3. So far, everything is going smoothly with the Edge cluster upgrades.

    Also, please note my recent post:
    https://community.broadcom.com/discussion/upgrading-to-nsx-4233-any-gotchas-with-the-newly-added-known-issue-3626240-for-edge-teps

    I wouldn't be surprised if this turns out to be more than just a documentation issue.

    Best regards,
    Serhii

    -------------------------------------------



  • 3.  RE: NSX Upgrade from 4.2.1.3 - NSX Managers become unresponsive

    Posted Apr 14, 2026 01:45 AM

    Hi all,

    After this happened support have checked the logs of NSX Managers, EDGE Nodes and ESXi hosts and found 2 issues that are already in Broadcom KB. Below is response from Broadcom support. Our NSX Managers did not get any memorz region once migrated to another host because of Memory exhaustion issue. Accordingly NSX Managers couldnt set MGMT interface in UP state within OS.

    Total memory regions used by all vNICs is 8516108800 bytes which is around 8 GB. But its actually more in this case (3 VNICs per VM), and due to multiple vNICs of Edge VM requesting the same memory region from edge is leading to XMAP exhaustion.
    XMAP space which is 32GB by default is nearing exhaustion due to Edge memory regions taking more memory due to a known issue (Fixed in 9.1) when the memory regions from multiple vNICs overlap.

    Considering the above facts, below workaround needs to be implemented in the environment on all hosts that will be hosting Edge VMs, irrespective of NSX version:

    1. Increase the default XMAP space to 64GB.
      This would require host reboot after applying this configuration. You can implement step 1 of the below KB:
      https://knowledge.broadcom.com/external/article/398914/power-on-vm-fails-with-the-errror-failed.html
    2. Increase P2M Buffer from 5 to 32 at host level:
      This is to avoid any other P2M related issues during vmotion.
      When multiple VMs are migrated together to a host where this is not set to the max buffer limit, issues such as below KB can occur; which is caused due to a crunch of p2m buffer slots and we might get a map failure: https://knowledge.broadcom.com/external/article?legacyId=76387

    Kind regards,

    Matic 

    -------------------------------------------



  • 4.  RE: NSX Upgrade from 4.2.1.3 - NSX Managers become unresponsive

    Posted Apr 14, 2026 05:20 AM

    Hi Matic,

    Thank you for sharing your experience and the response from Broadcom support.

    Could you please share more details about your environment, such as the number of VMs per host, the presence of "monster" VMs, the number of VMs with more than 2–3 vNICs, the number and size of your Edge nodes, and whether you are using Enhanced Data Path (EDP) (and in which mode)?

    This would help us better understand the issue and assess whether it could potentially occur in our infrastructure. Based on the KB, it is unclear in which specific environments this issue can be reproduced.

    Thank you.

    -------------------------------------------



  • 5.  RE: NSX Upgrade from 4.2.1.3 - NSX Managers become unresponsive

    Posted Apr 15, 2026 06:24 AM

    Hi Serhii,

    Happy to share more details to help the community avoid this "perfect storm."

    Our environment is quite dense, which I believe was the main trigger. We have around 20 Edge Clusters (approx. 40 Edge Nodes total), and in our design, we can have a maximum of 30 Edge Nodes per ESXi host during maintenance or failover scenarios.

    Here are the specifics that led to the XMAP exhaustion:

    • Edge Node Size: Large

    • vNIC Configuration: Each Edge Node has 2 vNICs (Management is on a separate vSwitch, so only 2 vNICs are requesting XMAP on the vDS)

    • The "Math" behind the crash: Support clarified that with a Ring Size of 2048, each vNIC requests a memory region of exactly 425,805,440 bytes (~400 MB) in the XMAP table.

      • With 2 vNICs per Edge, that's 800 MB per Node.

      • With 30 Nodes on a single host, the requirement is ~24 GB just for the Edges.

      • Add the NSX Managers, other infrastructure VMs, and the kernel bug (where regions don't overlap correctly), and you quickly exceed the default 32 GB XMAP limit.

    • The Bug: A known issue in the kernel prevents these memory regions from overlapping/sharing correctly. During the upgrade, as vNICs are detached and re-attached, "zombie" entries or lack of overlapping caused the XMAP table to hit the limit.

    • Management Impact: This is why our NSX Managers lost connectivity. When they migrated to a host that was already "XMAP-starved" by the Edge Nodes, the kernel couldn't allocate the required memory region for the Manager's vNIC, resulting in the MAC 00:00:00:00:00 state.

    Current Action Plan: We are implementing the workaround (64 GB XMAP and 32 P2M buffers) exclusively on the hosts that are designated to host Edge Nodes. We don't see a need to apply this to the general compute clusters at this time.

    If you have high-density Edge placements or use the 2048 Ring Size, I would strongly recommend checking your XMAP usage before triggering the upgrade.

    Best regards, Matic

    -------------------------------------------