ESXi-Arm Fling

 View Only
Expand all | Collapse all

CPU hang with Fling 2.1

  • 1.  CPU hang with Fling 2.1

    Posted Jan 06, 2025 09:28 AM

    Upgraded from last 1.0 fling to latest 2.1 (in-place)

    We're getting a lot of CPU hang and stacktraces on Debian 12 / testing.

    Is there a known issue?

    3 x RPI4 8GB, 07.12.24 EEPROM, RPI 1.38 EFI

    [ 5736.409823] watchdog: Watchdog detected hard LOCKUP on cpu 2
    [ 5736.410110] Modules linked in: veth nf_conntrack_netlink xt_nat xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge stp llc xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables libcrc32c overlay nls_ascii nls_cp437 crct10dif_ce vfat vmwgfx fat drm_ttm_helper ttm drm_kms_helper sg drm efi_pstore configfs nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vsock efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic sr_mod sd_mod cdrom ahci libahci libata scsi_mod scsi_common vmxnet3
    [ 5736.410240] Sending NMI from CPU 1 to CPUs 2:
    [ 5736.410292] NMI backtrace for cpu 2
    [ 5736.410337] CPU: 2 UID: 996 PID: 5205 Comm: postgres Not tainted 6.12.6-arm64 #1  Debian 6.12.6-1
    [ 5736.410346] Hardware name: VMware, Inc. VMware20,1/VBSA, BIOS VMW201.00V.24405116.BA64.2411261552 11/26/2024
    [ 5736.410350] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    [ 5736.410359] pc : __rmqueue_pcplist+0x58c/0xd70
    [ 5736.410405] lr : __rmqueue_pcplist+0x548/0xd70
    [ 5736.410411] sp : ffff8000832e35f0
    [ 5736.410414] x29: ffff8000832e36a0 x28: 000000000000003f x27: ffff00013f587f30
    [ 5736.410422] x26: fffffdffc4822980 x25: 0000000000000000 x24: ffff00013f603640
    [ 5736.410430] x23: ffff00013f587f00 x22: 0000000000000001 x21: fffffdffc2e78788
    [ 5736.410437] x20: 0000000000000000 x19: 0000000000000000 x18: 0000000000000000
    [ 5736.410458] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
    [ 5736.410465] x14: 0000000000000100 x13: 0000000000000000 x12: 0000000000000000
    [ 5736.410472] x11: 0000000000000040 x10: 000000000000003f x9 : ffff8000803a8ea8
    [ 5736.410479] x8 : 00000000ffffffff x7 : ffff00013f587f30 x6 : ffff8000832e35f0
    [ 5736.410486] x5 : ffff00013f603640 x4 : ffff00013f587f30 x3 : 000000000000001a
    [ 5736.410492] x2 : fffffdffc2e78788 x1 : ffff00013f587f30 x0 : ffff00013f603bc0
    [ 5736.410500] Call trace:
    [ 5736.410503]  __rmqueue_pcplist+0x58c/0xd70
    [ 5736.410522]  get_page_from_freelist+0x6b0/0x1b30
    [ 5736.410526]  __alloc_pages_noprof+0x170/0xf20
    [ 5736.410529]  alloc_pages_mpol_noprof+0x98/0x208
    [ 5736.410547]  alloc_pages_noprof+0x50/0xd0
    [ 5736.410551]  folio_alloc_noprof+0x1c/0x70
    [ 5736.410556]  filemap_alloc_folio_noprof+0x144/0x160
    [ 5736.410567]  __filemap_get_folio+0x21c/0x3f0
    [ 5736.410572]  ext4_da_write_begin+0x118/0x2a8 [ext4]
    [ 5736.410632]  generic_perform_write+0xd8/0x268
    [ 5736.410636]  ext4_buffered_write_iter+0x74/0x140 [ext4]
    [ 5736.410657]  ext4_file_write_iter+0x70/0x8c0 [ext4]
    [ 5736.410676]  vfs_write+0x24c/0x3b8
    [ 5736.410689]  __arm64_sys_pwrite64+0xb4/0xf0
    [ 5736.410693]  invoke_syscall+0x6c/0x100
    [ 5736.410716]  el0_svc_common.constprop.0+0x48/0xf0
    [ 5736.410722]  do_el0_svc+0x24/0x38
    [ 5736.410727]  el0_svc+0x38/0x120
    [ 5736.410752]  el0t_64_sync_handler+0x120/0x130
    [ 5736.410758]  el0t_64_sync+0x190/0x198
     
    getconf PAGE_SIZE
    4096
     
    6.12.6-arm64 #1 SMP Debian 6.12.6-1 (2024-12-21) aarch64 GNU/Linux


  • 2.  RE: CPU hang with Fling 2.1

    Posted Jan 07, 2025 08:48 AM

    Same on Ubuntu 5.15.x

    [30392.792642] watchdog: BUG: soft lockup - CPU#0 stuck for 38s! [kcompactd0:32]
    [30392.797306] Modules linked in: tls xt_nat xt_tcpudp nf_conntrack_netlink veth xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge stp llc xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink overlay vmw_vsock_vmci_transport vsock binfmt_misc nls_iso8859_1 joydev input_leds vmw_vmci sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid vmwgfx ttm crct10dif_ce drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm xhci_pci ahci xhci_pci_renesas vmxnet3 aes_neon_bs aes_neon_blk crypto_simd cryptd
    [30392.797619] CPU: 0 PID: 32 Comm: kcompactd0 Not tainted 5.15.0-130-generic #140-Ubuntu
    [30392.797629] Hardware name: VMware, Inc. VMware20,1/VBSA, BIOS VMW201.00V.24405116.BA64.2411261552 11/26/2024
    [30392.797633] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    [30392.797643] pc : isolate_freepages_block+0x3ac/0x4b0
    [30392.797688] lr : isolate_freepages_block+0x328/0x4b0
    [30392.797692] sp : ffff80000afab9f0
    [30392.797694] x29: ffff80000afab9f0 x28: 0000000000000800 x27: ffff80000afabd38
    [30392.797701] x26: 0000000000095800 x25: 0000000000000001 x24: 0000000000000000
    [30392.797708] x23: 0000000000000001 x22: ffff80000afabb30 x21: 0000000000000006
    [30392.797714] x20: fffffc000055a800 x19: 00000000000956a1 x18: 0000000000000000
    [30392.797720] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
    [30392.797725] x14: ffff80000a96be60 x13: ffff80000a96b948 x12: ffff00007fbfdf80
    [30392.797731] x11: ffff00007fbfdf80 x10: 0000000000000001 x9 : 00000000f0000080
    [30392.797737] x8 : ffff80000aa95cf8 x7 : 0000000000000020 x6 : ffff0000001fa080
    [30392.797743] x5 : ffff80000aa95a70 x4 : 0000000000000001 x3 : ffff80000afabd38
    [30392.797749] x2 : 0000000000000000 x1 : ffff00007fbfe5d0 x0 : 0000000000000000
    [30392.797755] Call trace:
    [30392.797759]  isolate_freepages_block+0x3ac/0x4b0
    [30392.797764]  isolate_freepages+0x1c4/0x360
    [30392.797767]  compaction_alloc+0x74/0x90
    [30392.797771]  unmap_and_move+0x6c/0x3fc
    [30392.797779]  migrate_pages+0x364/0x61c
    [30392.797783]  compact_zone+0x2b8/0x684
    [30392.797787]  proactive_compact_node+0x90/0xdc
    [30392.797791]  kcompactd+0x208/0x4d4
    [30392.797794]  kthread+0x110/0x114
    [30392.797799]  ret_from_fork+0x10/0x20



  • 3.  RE: CPU hang with Fling 2.1

    Broadcom Employee
    Posted Jan 08, 2025 10:51 AM

    Hi Xeroxxx,

    I will have to try reproduce it to get a better idea of what's going on. How many vCPUs and how much memory does your VM have?

    I suppose the linux distribution doesn't matter. It was ok with Fling v1, and it is happening on debian and ubuntu.

    Cyprien




  • 4.  RE: CPU hang with Fling 2.1

    Posted Jan 08, 2025 04:29 PM

    Hello Cyprien,

    thanks for moving from X to here. I sent you additional logs on pastebin showing non Filesystem related errors.

    Currently 3 x RPI4 Hosts on 2000GHz, EEPROM 07.12.2024, EFI 1.38. Advanced Settings Mem.ShareForceSalting 0, Mem.MemZipMaxPct 20.

    There are VMs with 1 vCPU, 384MB, 2 vCPU1024 MB, 2 vCPU2048 MB and one 4 vCPU 6144 MB on one Host running Gitlab in Docker. The last one actually runs for 10 minutes and freezes, it continues running with 5GB though.

    Mainly Debian 12 testing, one VM with USB Passthrough of Card Reader, some Ubuntu and Arch.

    All VMs stored on a shared iSCSI LUN utilizing a SSD (Queue Depth 128). Localstorage only for Host cache and Local Swap.

    Thank you
    Xeroxxx



  • 5.  RE: CPU hang with Fling 2.1

    Posted Jan 14, 2025 09:08 AM
    Edited by Broadcom Platform Admin Jan 16, 2025 08:47 AM

    I might found the issue.

    It seems to be related to Mem.ShareForceSalting = 0. Settings it back to 2 on one host solves the issue.

    I used it to save on precious memory with a lot of small machines.

    Can you replicate the issue with setting it to 0?

    EDIT: Nevermind still happens.

    [102941.193995] CPU: 0 UID: 0 PID: 8465 Comm: kworker/0:0 Tainted: G             L     6.12.6-arm64 #1  Debian 6.12.6-1
    [102941.194004] Tainted: [L]=SOFTLOCKUP
    [102941.194006] Hardware name: VMware, Inc. VMware20,1/VBSA, BIOS VMW201.00V.24405116.BA64.2411261552 11/26/2024
    [102941.194012] Workqueue: events drm_fb_helper_damage_work [drm_kms_helper]
    [102941.194036] pstate: a0000005 (NzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    [102941.194040] pc : __memcpy+0x128/0x240
    [102941.194053] lr : vmw_diff_memcpy+0x348/0x670 [vmwgfx]
    
    [100163.653874] CPU: 1 PID: 227739 Comm: ib_tpool_worker Not tainted 5.15.0-130-generic #140-Ubuntu
    [100163.653883] Hardware name: VMware, Inc. VMware20,1/VBSA, BIOS VMW201.00V.24405116.BA64.2411261552 11/26/2024
    [100163.653886] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    [100163.653910] pc : arch_local_irq_enable+0xc/0x2c
    [100163.653956] lr : copy_process+0xb3c/0x12b0

    Cheers

    Xeroxxx




  • 6.  RE: CPU hang with Fling 2.1

    Broadcom Employee
    Posted Jun 16, 2025 05:19 PM
    Edited by Cyprien Laplace 10 days ago

    Hi Xeroxxx, can you try adding monitor_control.disable_mmu_largepages = "TRUE" in your .vmx (or /etc/vmware/config)?

    EDIT: fixed the global config file path.




  • 7.  RE: CPU hang with Fling 2.1

    Posted Jun 17, 2025 05:01 PM

    Hello Cyprien,

    I set in the VMX file of the virtual machine while stopped. It did not solve the problem.

    watchdog: BUG: soft lockup - CPU#2 stuck for 26s! [containerd-shim:2848]




  • 8.  RE: CPU hang with Fling 2.1

    Posted Jun 18, 2025 01:49 PM

    I was able to workaround the issue by reserving cpu resource. 




  • 9.  RE: CPU hang with Fling 2.1

    Broadcom Employee
    Posted 11 days ago
    Edited by Cyprien Laplace 11 days ago

    Hi all, I still think those soft lockups are related to memory allocations. To understand more the state of the VMs, can you ssh into the ESXi shell and run the following command:

    (for i in $(vsish -e ls /vm) ; do echo; echo VM $i:; vsish -e cat /memory/lpage/vmLPage/$i ; done; for i in $(vsish -e ls /memory/buddy/) ; do echo buddy $i; vsish -e cat /memory/buddy/$i ; done) > output.txt

    It will dump memory allocation state and allocations statistics about each running VM in the output.txt file. Please attach the output here (or send to my <first>.<last>@broadcom.com).

    Note: none of the Arm guests have a balloon driver yet, so each VM will at some point try to use its whole memory. If the system is overcommitted, it can cause some VM to stall waiting for memory to be available (using various mechanisms).

    Cheers,
    Cyprien




  • 10.  RE: CPU hang with Fling 2.1

    Posted Jun 18, 2025 01:52 PM

    Apologies, not trying to hijack this post, and I can create a new post if needed, but I think I have the same issue.

    1x RPi 4 running ESX 7 (Build 18175197) - Been rock solid and stable for over a year

    1x RPi 4 running ESX 8 (Build 24449057).  All VMs lock-up at some point (OpenBSD, RHEL 10 & Debian)

    I also applied the MMU  configuration change to each VM and rebooted the ESXi host. Still locking up. Attached 2x screenshots.




  • 11.  RE: CPU hang with Fling 2.1

    Community Manager
    Posted Jun 18, 2025 01:55 PM

    Your uploaded attachments did make it, but not in this view.  We are working with the vendor to deploy the patch - you and see all attatchments on the threads tab here  >>>  https://community.broadcom.com/vmware-cloud-foundation/communities/community-home/digestviewer?communitykey=b75c6afd-0c0c-4f39-89a4-018ed3a892d3   



    ------------------------------
    Thank you
    Jason
    Broadcom Community Platform Admin, IT
    ------------------------------



  • 12.  RE: CPU hang with Fling 2.1

    Posted Jun 18, 2025 01:49 PM

    I get the same issue on my Pi4. I was able to workaround it by reserving cpu for a VM based on the number vcpu it was configured with. At least that seems to working for me.




  • 13.  RE: CPU hang with Fling 2.1

    Posted Jun 23, 2025 09:10 AM

    Sadly reserving cpu resources matching the vcpus does not solve it for me.

    Got it nearly instant after startup:

    [  272.019825] watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [kswapd0:91]



    You've configured CPUs as Core on 1 Socket or multiple Sockets?




  • 14.  RE: CPU hang with Fling 2.1

    Posted Jun 23, 2025 09:11 AM

    Reserving CPU didnt work for me. Just had a VM running Debian lockup (again). Think I'll need to revert back to ESXi 7.x if Broadcom can't figure out why this is happening




  • 15.  RE: CPU hang with Fling 2.1

    Posted Jun 25, 2025 09:00 AM

    Let me backup in my setup (Rpi 8GB, USB to SATA SSD) I have a Home Assistant VM which is the one that I keep running 24/7. The others come up and down as part of my home lab. After I upgraded I had two issues with things working.The Home assistant VM would do the same thing as reported here but I also had an additional issue of USB passthrough (USB for UPS) randomly stop working . The only way to make the Home assistant VM stable was to reserve vCPU (1 core 2 socket) and then I reserved all the memory. one fixed the VM CPU issue the other fixed the USB passthrough issue. I didn't write down which fixed which for my setup.

    As of right now the VM has been running well over 2 months without either issue. 




  • 16.  RE: CPU hang with Fling 2.1

    Posted 30 days ago

    Still locks up for me after reserving all memory and CPU. Not a great experience 😥