ESXi-Arm Fling

 View Only
  • 1.  CPU hang with Fling 2.1

    Posted Jan 06, 2025 09:28 AM

    Upgraded from last 1.0 fling to latest 2.1 (in-place)

    We're getting a lot of CPU hang and stacktraces on Debian 12 / testing.

    Is there a known issue?

    3 x RPI4 8GB, 07.12.24 EEPROM, RPI 1.38 EFI

    [ 5736.409823] watchdog: Watchdog detected hard LOCKUP on cpu 2
    [ 5736.410110] Modules linked in: veth nf_conntrack_netlink xt_nat xt_tcpudp xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge stp llc xfrm_user xfrm_algo xt_addrtype nft_compat nf_tables libcrc32c overlay nls_ascii nls_cp437 crct10dif_ce vfat vmwgfx fat drm_ttm_helper ttm drm_kms_helper sg drm efi_pstore configfs nfnetlink vsock_loopback vmw_vsock_virtio_transport_common vsock efivarfs ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 crc32c_generic sr_mod sd_mod cdrom ahci libahci libata scsi_mod scsi_common vmxnet3
    [ 5736.410240] Sending NMI from CPU 1 to CPUs 2:
    [ 5736.410292] NMI backtrace for cpu 2
    [ 5736.410337] CPU: 2 UID: 996 PID: 5205 Comm: postgres Not tainted 6.12.6-arm64 #1  Debian 6.12.6-1
    [ 5736.410346] Hardware name: VMware, Inc. VMware20,1/VBSA, BIOS VMW201.00V.24405116.BA64.2411261552 11/26/2024
    [ 5736.410350] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    [ 5736.410359] pc : __rmqueue_pcplist+0x58c/0xd70
    [ 5736.410405] lr : __rmqueue_pcplist+0x548/0xd70
    [ 5736.410411] sp : ffff8000832e35f0
    [ 5736.410414] x29: ffff8000832e36a0 x28: 000000000000003f x27: ffff00013f587f30
    [ 5736.410422] x26: fffffdffc4822980 x25: 0000000000000000 x24: ffff00013f603640
    [ 5736.410430] x23: ffff00013f587f00 x22: 0000000000000001 x21: fffffdffc2e78788
    [ 5736.410437] x20: 0000000000000000 x19: 0000000000000000 x18: 0000000000000000
    [ 5736.410458] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
    [ 5736.410465] x14: 0000000000000100 x13: 0000000000000000 x12: 0000000000000000
    [ 5736.410472] x11: 0000000000000040 x10: 000000000000003f x9 : ffff8000803a8ea8
    [ 5736.410479] x8 : 00000000ffffffff x7 : ffff00013f587f30 x6 : ffff8000832e35f0
    [ 5736.410486] x5 : ffff00013f603640 x4 : ffff00013f587f30 x3 : 000000000000001a
    [ 5736.410492] x2 : fffffdffc2e78788 x1 : ffff00013f587f30 x0 : ffff00013f603bc0
    [ 5736.410500] Call trace:
    [ 5736.410503]  __rmqueue_pcplist+0x58c/0xd70
    [ 5736.410522]  get_page_from_freelist+0x6b0/0x1b30
    [ 5736.410526]  __alloc_pages_noprof+0x170/0xf20
    [ 5736.410529]  alloc_pages_mpol_noprof+0x98/0x208
    [ 5736.410547]  alloc_pages_noprof+0x50/0xd0
    [ 5736.410551]  folio_alloc_noprof+0x1c/0x70
    [ 5736.410556]  filemap_alloc_folio_noprof+0x144/0x160
    [ 5736.410567]  __filemap_get_folio+0x21c/0x3f0
    [ 5736.410572]  ext4_da_write_begin+0x118/0x2a8 [ext4]
    [ 5736.410632]  generic_perform_write+0xd8/0x268
    [ 5736.410636]  ext4_buffered_write_iter+0x74/0x140 [ext4]
    [ 5736.410657]  ext4_file_write_iter+0x70/0x8c0 [ext4]
    [ 5736.410676]  vfs_write+0x24c/0x3b8
    [ 5736.410689]  __arm64_sys_pwrite64+0xb4/0xf0
    [ 5736.410693]  invoke_syscall+0x6c/0x100
    [ 5736.410716]  el0_svc_common.constprop.0+0x48/0xf0
    [ 5736.410722]  do_el0_svc+0x24/0x38
    [ 5736.410727]  el0_svc+0x38/0x120
    [ 5736.410752]  el0t_64_sync_handler+0x120/0x130
    [ 5736.410758]  el0t_64_sync+0x190/0x198
     
    getconf PAGE_SIZE
    4096
     
    6.12.6-arm64 #1 SMP Debian 6.12.6-1 (2024-12-21) aarch64 GNU/Linux


  • 2.  RE: CPU hang with Fling 2.1

    Posted Jan 07, 2025 08:48 AM

    Same on Ubuntu 5.15.x

    [30392.792642] watchdog: BUG: soft lockup - CPU#0 stuck for 38s! [kcompactd0:32]
    [30392.797306] Modules linked in: tls xt_nat xt_tcpudp nf_conntrack_netlink veth xt_conntrack nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 bridge stp llc xfrm_user xfrm_algo nft_counter xt_addrtype nft_compat nf_tables nfnetlink overlay vmw_vsock_vmci_transport vsock binfmt_misc nls_iso8859_1 joydev input_leds vmw_vmci sch_fq_codel dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua efi_pstore ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid vmwgfx ttm crct10dif_ce drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm xhci_pci ahci xhci_pci_renesas vmxnet3 aes_neon_bs aes_neon_blk crypto_simd cryptd
    [30392.797619] CPU: 0 PID: 32 Comm: kcompactd0 Not tainted 5.15.0-130-generic #140-Ubuntu
    [30392.797629] Hardware name: VMware, Inc. VMware20,1/VBSA, BIOS VMW201.00V.24405116.BA64.2411261552 11/26/2024
    [30392.797633] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    [30392.797643] pc : isolate_freepages_block+0x3ac/0x4b0
    [30392.797688] lr : isolate_freepages_block+0x328/0x4b0
    [30392.797692] sp : ffff80000afab9f0
    [30392.797694] x29: ffff80000afab9f0 x28: 0000000000000800 x27: ffff80000afabd38
    [30392.797701] x26: 0000000000095800 x25: 0000000000000001 x24: 0000000000000000
    [30392.797708] x23: 0000000000000001 x22: ffff80000afabb30 x21: 0000000000000006
    [30392.797714] x20: fffffc000055a800 x19: 00000000000956a1 x18: 0000000000000000
    [30392.797720] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
    [30392.797725] x14: ffff80000a96be60 x13: ffff80000a96b948 x12: ffff00007fbfdf80
    [30392.797731] x11: ffff00007fbfdf80 x10: 0000000000000001 x9 : 00000000f0000080
    [30392.797737] x8 : ffff80000aa95cf8 x7 : 0000000000000020 x6 : ffff0000001fa080
    [30392.797743] x5 : ffff80000aa95a70 x4 : 0000000000000001 x3 : ffff80000afabd38
    [30392.797749] x2 : 0000000000000000 x1 : ffff00007fbfe5d0 x0 : 0000000000000000
    [30392.797755] Call trace:
    [30392.797759]  isolate_freepages_block+0x3ac/0x4b0
    [30392.797764]  isolate_freepages+0x1c4/0x360
    [30392.797767]  compaction_alloc+0x74/0x90
    [30392.797771]  unmap_and_move+0x6c/0x3fc
    [30392.797779]  migrate_pages+0x364/0x61c
    [30392.797783]  compact_zone+0x2b8/0x684
    [30392.797787]  proactive_compact_node+0x90/0xdc
    [30392.797791]  kcompactd+0x208/0x4d4
    [30392.797794]  kthread+0x110/0x114
    [30392.797799]  ret_from_fork+0x10/0x20



  • 3.  RE: CPU hang with Fling 2.1

    Broadcom Employee
    Posted Jan 08, 2025 10:51 AM

    Hi Xeroxxx,

    I will have to try reproduce it to get a better idea of what's going on. How many vCPUs and how much memory does your VM have?

    I suppose the linux distribution doesn't matter. It was ok with Fling v1, and it is happening on debian and ubuntu.

    Cyprien




  • 4.  RE: CPU hang with Fling 2.1

    Posted Jan 08, 2025 04:29 PM

    Hello Cyprien,

    thanks for moving from X to here. I sent you additional logs on pastebin showing non Filesystem related errors.

    Currently 3 x RPI4 Hosts on 2000GHz, EEPROM 07.12.2024, EFI 1.38. Advanced Settings Mem.ShareForceSalting 0, Mem.MemZipMaxPct 20.

    There are VMs with 1 vCPU, 384MB, 2 vCPU1024 MB, 2 vCPU2048 MB and one 4 vCPU 6144 MB on one Host running Gitlab in Docker. The last one actually runs for 10 minutes and freezes, it continues running with 5GB though.

    Mainly Debian 12 testing, one VM with USB Passthrough of Card Reader, some Ubuntu and Arch.

    All VMs stored on a shared iSCSI LUN utilizing a SSD (Queue Depth 128). Localstorage only for Host cache and Local Swap.

    Thank you
    Xeroxxx



  • 5.  RE: CPU hang with Fling 2.1

    Posted Jan 14, 2025 09:08 AM
    Edited by Jason McClellan Jan 16, 2025 08:47 AM

    I might found the issue.

    It seems to be related to Mem.ShareForceSalting = 0. Settings it back to 2 on one host solves the issue.

    I used it to save on precious memory with a lot of small machines.

    Can you replicate the issue with setting it to 0?

    EDIT: Nevermind still happens.

    [102941.193995] CPU: 0 UID: 0 PID: 8465 Comm: kworker/0:0 Tainted: G             L     6.12.6-arm64 #1  Debian 6.12.6-1
    [102941.194004] Tainted: [L]=SOFTLOCKUP
    [102941.194006] Hardware name: VMware, Inc. VMware20,1/VBSA, BIOS VMW201.00V.24405116.BA64.2411261552 11/26/2024
    [102941.194012] Workqueue: events drm_fb_helper_damage_work [drm_kms_helper]
    [102941.194036] pstate: a0000005 (NzCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    [102941.194040] pc : __memcpy+0x128/0x240
    [102941.194053] lr : vmw_diff_memcpy+0x348/0x670 [vmwgfx]
    
    [100163.653874] CPU: 1 PID: 227739 Comm: ib_tpool_worker Not tainted 5.15.0-130-generic #140-Ubuntu
    [100163.653883] Hardware name: VMware, Inc. VMware20,1/VBSA, BIOS VMW201.00V.24405116.BA64.2411261552 11/26/2024
    [100163.653886] pstate: 40400005 (nZcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    [100163.653910] pc : arch_local_irq_enable+0xc/0x2c
    [100163.653956] lr : copy_process+0xb3c/0x12b0

    Cheers

    Xeroxxx