VMware vSphere

 View Only
Expand all | Collapse all

TKG Supervisor Cluster Constant Disk IO

  • 1.  TKG Supervisor Cluster Constant Disk IO

    Posted Jan 25, 2021 06:58 AM

    Is it expected for the supervisor cluster VMs to constantly be writing to disk at roughly 8-16MB/s? Each one is writing at that rate.

    Spun up workload management in my homelab, and since the supervisor cluster was deployed there's constant writing to disk, with VMs reporting 4,000 write ops/sec and >30MB/s being written to disk by just these 3 VMs.

    root@420631715fd538d484265a9a44e451cb [ /usr ]# /usr/bin/python3 /usr/bin/dstat
    You did not select any stats, using -cdngy by default.
    Color support is disabled as python-curses is not installed.
    /usr/bin/dstat:515: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    if isinstance(self.val[name], collections.Sequence) and not isinstance(self.val[name], six.string_types):
    --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system--
    usr sys idl wai stl| read writ| recv send| in out | int csw
    14 10 41 35 0| 210k 16M| 0 0 | 0 0 | 17k 47k
    10 9 43 37 0| 0 16M| 27k 59k| 0 0 | 15k 50k
    13 9 42 35 0| 0 14M| 39k 100k| 0 0 | 16k 45k
    8 11 45 35 0| 0 15M| 13k 38k| 0 0 | 18k 48k
    12 9 45 34 0| 0 15M| 35k 95k| 0 0 | 17k 49k
    11 12 43 34 0| 0 15M| 29k 63k| 0 0 | 21k 43k

     

    root@42068944986ef297cc64a8a2a234c0dd [ /usr/bin ]# /usr/bin/python3 /usr/bin/dstat
    You did not select any stats, using -cdngy by default.
    Color support is disabled as python-curses is not installed.
    /usr/bin/dstat:515: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    if isinstance(self.val[name], collections.Sequence) and not isinstance(self.val[name], six.string_types):
    --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system--
    usr sys idl wai stl| read writ| recv send| in out | int csw
    10 9 44 37 0| 162k 16M| 0 0 | 0 0 | 15k 41k
    12 10 39 38 0| 0 16M| 16k 7118B| 0 0 | 16k 38k
    7 7 44 41 0| 0 16M| 147k 58k| 0 0 | 17k 40k
    8 8 45 39 0| 0 15M| 22k 9355B| 0 0 | 19k 41k
    9 9 45 38 0| 0 16M| 137k 38k| 0 0 | 19k 40k
    9 10 42 39 0| 0 16M| 28k 11k| 0 0 | 17k 36k
    7 8 44 41 0| 0 16M| 56k 26k| 0 0 | 17k 39k
    8 9 43 39 0| 0 16M| 22k 10k| 0 0 | 17k 37k

     

    root@4206a24f8bfc148c68b68e56e223fa7f [ ~ ]# /usr/bin/python3 /usr/bin/dstat
    You did not select any stats, using -cdngy by default.
    Color support is disabled as python-curses is not installed.
    /usr/bin/dstat:515: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    if isinstance(self.val[name], collections.Sequence) and not isinstance(self.val[name], six.string_types):
    --total-cpu-usage-- -dsk/total- -net/total- ---paging-- ---system--
    usr sys idl wai stl| read writ| recv send| in out | int csw
    20 16 38 26 0| 163k 9567k| 0 0 | 0 0 | 10k 27k
    18 15 40 27 0| 0 10M| 50k 21k| 0 0 |7850 26k
    18 17 38 27 0| 0 9292k| 23k 11k| 0 0 |8936 25k
    15 15 41 29 0| 0 8816k| 52k 23k| 0 0 | 10k 26k
    14 14 41 31 0| 0 9464k| 29k 15k| 0 0 | 11k 26k
    23 15 34 28 0| 0 9700k| 50k 22k| 0 0 | 12k 27k
    17 17 38 28 0| 0 9024k| 25k 10k| 0 0 | 11k 26k
    18 12 40 30 0| 0 9696k| 52k 24k| 0 0 | 11k 26k
    18 17 38 27 0| 0 8968k| 25k 13k| 0 0 | 11k 26k
    21 18 33 27 0| 0 9312k| 42k 18k| 0 0 | 13k 26k
    36 24 20 20 0| 0 10M| 22k 10k| 0 0 | 15k 33k
    15 13 43 28 0| 0 8524k| 55k 24k| 0 0 | 10k 25k
    17 19 37 27 0| 0 8888k| 27k 15k| 0 0 | 11k 27k
    14 16 45 25 0| 0 9880k| 46k 20k| 0 0 | 11k 33k
    13 17 45 26 0| 0 10M|6903B 4904B| 0 0 |9881 32k

    Validated by reviewing statistics on my storage and VM statistics.

    https://imgur.com/a/7xqKz1Q

    Have not yet figured out how to determine what/where it's writing to, but... that's seems a bit much no? 16MB/sec is a lot of text if it's just logging, and it's all in writes.

    I've also deployed these numerous times with the same results. When I used "large" deployment I think it was a touch lower.



  • 2.  RE: TKG Supervisor Cluster Constant Disk IO

    Posted Jan 25, 2021 07:46 PM

    Yes, I also experience about 500TBW per year (2.5PBW over 5 years, that will chew through your vSAN Cache Tier) per supervisorcontrolplaneVM in 7.0U1

    About 90-95% of the traffic is caused by

    /fluent-bit/bin/fluent-bit -c /fluent-bit/etc/fluent-bit.conf --plugin /fluent-bit/bin/out_syslog.so
     
    You can check it by running pidstat -d
     
    Also the CPU load on the supervisorcontrolplanevm is sometimes very high but immediately goes down when loggin in via SSH:
     
    Screenshot 2021-01-25 203947.png

     

    Edit:

    When checking the logs of the fluentbit Pods I can see that they each generate about 700 lines per second logs like this:

    [2021/01/25 19:52:04] [debug] [input:tail:tail.0] inode=262901 events: IN_MODIFY
    [2021/01/25 19:52:04] [debug] [input:tail:tail.0] inode=262901 events: IN_MODIFY
    [2021/01/25 19:52:04] [debug] [input:tail:tail.0] inode=262901 events: IN_MODIFY
    [2021/01/25 19:52:04] [debug] [input:tail:tail.0] inode=262901 events: IN_MODIFY
    [2021/01/25 19:52:04] [debug] [input:tail:tail.0] inode=262901 events: IN_MODIFY
    [2021/01/25 19:52:04] [debug] [input:tail:tail.0] inode=262901 events: IN_MODIFY



  • 3.  RE: TKG Supervisor Cluster Constant Disk IO

    Posted Jan 25, 2021 08:31 PM

    I found the file that is written is /var/log/vmware/fluentbit/consolidated.log and the error is as follows:

    kube.var.log.containers.vsphere-csi-controller-7c646978db-5q5g9_vmware-system-csi_vsphere-syncer-d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359.log: [1611607763.935094, {"log":"2021-01-25T20:49:23.935062778Z stderr F {\"level\":\"error\",\"time\":\"2021-01-25T20:49:23.934908315Z\",\"caller\":\"storagepool/listener.go:98\",\"msg\":\"Not able to establish connection with VC. Restarting property collector.\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.initListener.func1\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:98\"}","kubernetes":{"pod_name":"vsphere-csi-controller-7c646978db-5q5g9","namespace_name":"vmware-system-csi","pod_id":"429e3101-682d-4e45-9e70-9f2c66641b32","labels":{"app":"vsphere-csi-controller","pod-template-hash":"7c646978db","role":"vsphere-csi"},"annotations":{"kubernetes.io/psp":"wcp-privileged-psp"},"host":"420de87c5115994bb75310e8bc809bed","container_name":"vsphere-syncer","docker_id":"d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359","container_hash":"localhost:5000/vmware/syncer@sha256:9cb2cca0700423dccd393bbc32040280b67bf7b111736c04bc7515c286e7d0b1","container_image":"localhost:5000/vmware/syncer:v2.1.0-32c3ebd"}}]
    kube.var.log.containers.vsphere-csi-controller-7c646978db-5q5g9_vmware-system-csi_vsphere-syncer-d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359.log: [1611607763.935898, {"log":"2021-01-25T20:49:23.93586061Z stderr F {\"level\":\"error\",\"time\":\"2021-01-25T20:49:23.935757796Z\",\"caller\":\"vsphere/virtualcenter.go:232\",\"msg\":\"failed to obtain user session with err: Post \\\"https://vc.local:443/sdk\\\": dial tcp: lookup vc.local: no such host\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.(*VirtualCenter).connect\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/virtualcenter.go:232\\nsigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.(*VirtualCenter).Connect\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/virtualcenter.go:205\\nsigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.initListener.func1\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:96\"}","kubernetes":{"pod_name":"vsphere-csi-controller-7c646978db-5q5g9","namespace_name":"vmware-system-csi","pod_id":"429e3101-682d-4e45-9e70-9f2c66641b32","labels":{"app":"vsphere-csi-controller","pod-template-hash":"7c646978db","role":"vsphere-csi"},"annotations":{"kubernetes.io/psp":"wcp-privileged-psp"},"host":"420de87c5115994bb75310e8bc809bed","container_name":"vsphere-syncer","docker_id":"d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359","container_hash":"localhost:5000/vmware/syncer@sha256:9cb2cca0700423dccd393bbc32040280b67bf7b111736c04bc7515c286e7d0b1","container_image":"localhost:5000/vmware/syncer:v2.1.0-32c3ebd"}}]
    kube.var.log.containers.vsphere-csi-controller-7c646978db-5q5g9_vmware-system-csi_vsphere-syncer-d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359.log: [1611607763.936509, {"log":"2021-01-25T20:49:23.935989826Z stderr F {\"level\":\"error\",\"time\":\"2021-01-25T20:49:23.935803811Z\",\"caller\":\"vsphere/virtualcenter.go:207\",\"msg\":\"Cannot connect to vCenter with err: Post \\\"https://vc.local:443/sdk\\\": dial tcp: lookup vc.local: no such host\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.(*VirtualCenter).Connect\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/virtualcenter.go:207\\nsigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.initListener.func1\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:96\"}","kubernetes":{"pod_name":"vsphere-csi-controller-7c646978db-5q5g9","namespace_name":"vmware-system-csi","pod_id":"429e3101-682d-4e45-9e70-9f2c66641b32","labels":{"app":"vsphere-csi-controller","pod-template-hash":"7c646978db","role":"vsphere-csi"},"annotations":{"kubernetes.io/psp":"wcp-privileged-psp"},"host":"420de87c5115994bb75310e8bc809bed","container_name":"vsphere-syncer","docker_id":"d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359","container_hash":"localhost:5000/vmware/syncer@sha256:9cb2cca0700423dccd393bbc32040280b67bf7b111736c04bc7515c286e7d0b1","container_image":"localhost:5000/vmware/syncer:v2.1.0-32c3ebd"}}]

     

    And the root cause is this:

    Cannot connect to vCenter with err: Post \\\"https://vc.local:443/sdk\\\": dial tcp: lookup vc.local: no such host

     

    Photon OS 3.0 is not able to resolve .local domains: DNS Related Troubleshooting With Unified Access Gateway (50120424) (vmware.com) and DNS Related Troubleshooting With Unified Access Gateway 3.7 and newer which is based on Photon 3 (78611) (vmware.com)

    So I will have to change my homelab's domain.

    Do you use .local too as your domain?



  • 4.  RE: TKG Supervisor Cluster Constant Disk IO

    Posted Jan 25, 2021 08:53 PM

    That was a red herring, the IOPS just are on another controllerVM now

    But I found the file that is written is /var/log/vmware/fluentbit/consolidated.log and the error is as follows:

    kube.var.log.containers.vsphere-csi-controller-7c646978db-5q5g9_vmware-system-csi_vsphere-syncer-d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359.log: [1611607763.935094, {"log":"2021-01-25T20:49:23.935062778Z stderr F {\"level\":\"error\",\"time\":\"2021-01-25T20:49:23.934908315Z\",\"caller\":\"storagepool/listener.go:98\",\"msg\":\"Not able to establish connection with VC. Restarting property collector.\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.initListener.func1\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:98\"}","kubernetes":{"pod_name":"vsphere-csi-controller-7c646978db-5q5g9","namespace_name":"vmware-system-csi","pod_id":"429e3101-682d-4e45-9e70-9f2c66641b32","labels":{"app":"vsphere-csi-controller","pod-template-hash":"7c646978db","role":"vsphere-csi"},"annotations":{"kubernetes.io/psp":"wcp-privileged-psp"},"host":"420de87c5115994bb75310e8bc809bed","container_name":"vsphere-syncer","docker_id":"d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359","container_hash":"localhost:5000/vmware/syncer@sha256:9cb2cca0700423dccd393bbc32040280b67bf7b111736c04bc7515c286e7d0b1","container_image":"localhost:5000/vmware/syncer:v2.1.0-32c3ebd"}}]
    kube.var.log.containers.vsphere-csi-controller-7c646978db-5q5g9_vmware-system-csi_vsphere-syncer-d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359.log: [1611607763.935898, {"log":"2021-01-25T20:49:23.93586061Z stderr F {\"level\":\"error\",\"time\":\"2021-01-25T20:49:23.935757796Z\",\"caller\":\"vsphere/virtualcenter.go:232\",\"msg\":\"failed to obtain user session with err: Post \\\"https://vc.local:443/sdk\\\": dial tcp: lookup vc.local: no such host\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.(*VirtualCenter).connect\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/virtualcenter.go:232\\nsigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.(*VirtualCenter).Connect\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/virtualcenter.go:205\\nsigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.initListener.func1\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:96\"}","kubernetes":{"pod_name":"vsphere-csi-controller-7c646978db-5q5g9","namespace_name":"vmware-system-csi","pod_id":"429e3101-682d-4e45-9e70-9f2c66641b32","labels":{"app":"vsphere-csi-controller","pod-template-hash":"7c646978db","role":"vsphere-csi"},"annotations":{"kubernetes.io/psp":"wcp-privileged-psp"},"host":"420de87c5115994bb75310e8bc809bed","container_name":"vsphere-syncer","docker_id":"d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359","container_hash":"localhost:5000/vmware/syncer@sha256:9cb2cca0700423dccd393bbc32040280b67bf7b111736c04bc7515c286e7d0b1","container_image":"localhost:5000/vmware/syncer:v2.1.0-32c3ebd"}}]
    kube.var.log.containers.vsphere-csi-controller-7c646978db-5q5g9_vmware-system-csi_vsphere-syncer-d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359.log: [1611607763.936509, {"log":"2021-01-25T20:49:23.935989826Z stderr F {\"level\":\"error\",\"time\":\"2021-01-25T20:49:23.935803811Z\",\"caller\":\"vsphere/virtualcenter.go:207\",\"msg\":\"Cannot connect to vCenter with err: Post \\\"https://vc.local:443/sdk\\\": dial tcp: lookup vc.local: no such host\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.(*VirtualCenter).Connect\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/virtualcenter.go:207\\nsigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.initListener.func1\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:96\"}","kubernetes":{"pod_name":"vsphere-csi-controller-7c646978db-5q5g9","namespace_name":"vmware-system-csi","pod_id":"429e3101-682d-4e45-9e70-9f2c66641b32","labels":{"app":"vsphere-csi-controller","pod-template-hash":"7c646978db","role":"vsphere-csi"},"annotations":{"kubernetes.io/psp":"wcp-privileged-psp"},"host":"420de87c5115994bb75310e8bc809bed","container_name":"vsphere-syncer","docker_id":"d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359","container_hash":"localhost:5000/vmware/syncer@sha256:9cb2cca0700423dccd393bbc32040280b67bf7b111736c04bc7515c286e7d0b1","container_image":"localhost:5000/vmware/syncer:v2.1.0-32c3ebd"}}]

    And the root cause is this:

    Cannot connect to vCenter with err: Post \\\"https://vc.local:443/sdk\\\": dial tcp: lookup vc.local: no such host

    But now I am not sure why it cannot resolve this, all DNS I use can indeed resolve vc.local

    I will look at this more in-depth now



  • 5.  RE: TKG Supervisor Cluster Constant Disk IO

    Posted Jan 25, 2021 09:08 PM

    That was a red herring, the IOPS just are on another controllerVM now

    But I found the file that is written is /var/log/vmware/fluentbit/consolidated.log and the error is as follows:

    kube.var.log.containers.vsphere-csi-controller-7c646978db-5q5g9_vmware-system-csi_vsphere-syncer-d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359.log: [1611607763.935094, {"log":"2021-01-25T20:49:23.935062778Z stderr F {\"level\":\"error\",\"time\":\"2021-01-25T20:49:23.934908315Z\",\"caller\":\"storagepool/listener.go:98\",\"msg\":\"Not able to establish connection with VC. Restarting property collector.\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.initListener.func1\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:98\"}","kubernetes":{"pod_name":"vsphere-csi-controller-7c646978db-5q5g9","namespace_name":"vmware-system-csi","pod_id":"429e3101-682d-4e45-9e70-9f2c66641b32","labels":{"app":"vsphere-csi-controller","pod-template-hash":"7c646978db","role":"vsphere-csi"},"annotations":{"kubernetes.io/psp":"wcp-privileged-psp"},"host":"420de87c5115994bb75310e8bc809bed","container_name":"vsphere-syncer","docker_id":"d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359","container_hash":"localhost:5000/vmware/syncer@sha256:9cb2cca0700423dccd393bbc32040280b67bf7b111736c04bc7515c286e7d0b1","container_image":"localhost:5000/vmware/syncer:v2.1.0-32c3ebd"}}]
    kube.var.log.containers.vsphere-csi-controller-7c646978db-5q5g9_vmware-system-csi_vsphere-syncer-d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359.log: [1611607763.935898, {"log":"2021-01-25T20:49:23.93586061Z stderr F {\"level\":\"error\",\"time\":\"2021-01-25T20:49:23.935757796Z\",\"caller\":\"vsphere/virtualcenter.go:232\",\"msg\":\"failed to obtain user session with err: Post \\\"https://vc.local:443/sdk\\\": dial tcp: lookup vc.local: no such host\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.(*VirtualCenter).connect\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/virtualcenter.go:232\\nsigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.(*VirtualCenter).Connect\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/virtualcenter.go:205\\nsigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.initListener.func1\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:96\"}","kubernetes":{"pod_name":"vsphere-csi-controller-7c646978db-5q5g9","namespace_name":"vmware-system-csi","pod_id":"429e3101-682d-4e45-9e70-9f2c66641b32","labels":{"app":"vsphere-csi-controller","pod-template-hash":"7c646978db","role":"vsphere-csi"},"annotations":{"kubernetes.io/psp":"wcp-privileged-psp"},"host":"420de87c5115994bb75310e8bc809bed","container_name":"vsphere-syncer","docker_id":"d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359","container_hash":"localhost:5000/vmware/syncer@sha256:9cb2cca0700423dccd393bbc32040280b67bf7b111736c04bc7515c286e7d0b1","container_image":"localhost:5000/vmware/syncer:v2.1.0-32c3ebd"}}]
    kube.var.log.containers.vsphere-csi-controller-7c646978db-5q5g9_vmware-system-csi_vsphere-syncer-d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359.log: [1611607763.936509, {"log":"2021-01-25T20:49:23.935989826Z stderr F {\"level\":\"error\",\"time\":\"2021-01-25T20:49:23.935803811Z\",\"caller\":\"vsphere/virtualcenter.go:207\",\"msg\":\"Cannot connect to vCenter with err: Post \\\"https://vc.local:443/sdk\\\": dial tcp: lookup vc.local: no such host\",\"stacktrace\":\"sigs.k8s.io/vsphere-csi-driver/pkg/common/cns-lib/vsphere.(*VirtualCenter).Connect\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/common/cns-lib/vsphere/virtualcenter.go:207\\nsigs.k8s.io/vsphere-csi-driver/pkg/syncer/storagepool.initListener.func1\\n\\t/build/mts/release/bora-17165480/cayman_vsphere_csi_driver/vsphere_csi_driver/src/pkg/syncer/storagepool/listener.go:96\"}","kubernetes":{"pod_name":"vsphere-csi-controller-7c646978db-5q5g9","namespace_name":"vmware-system-csi","pod_id":"429e3101-682d-4e45-9e70-9f2c66641b32","labels":{"app":"vsphere-csi-controller","pod-template-hash":"7c646978db","role":"vsphere-csi"},"annotations":{"kubernetes.io/psp":"wcp-privileged-psp"},"host":"420de87c5115994bb75310e8bc809bed","container_name":"vsphere-syncer","docker_id":"d5cb3d3dca1abff78161f01569ad7c3971f0cdc62b98029cda99890b412d4359","container_hash":"localhost:5000/vmware/syncer@sha256:9cb2cca0700423dccd393bbc32040280b67bf7b111736c04bc7515c286e7d0b1","container_image":"localhost:5000/vmware/syncer:v2.1.0-32c3ebd"}}]

    And the root cause is this:

    Cannot connect to vCenter with err: Post \\\"https://vc.local:443/sdk\\\": dial tcp: lookup vc.local: no such host

    But now I am not sure why it cannot resolve this, all DNS I use can indeed resolve vc.local

    I will look at this more in-depth now

    Edit:

    Photon OS 3.0 is not able to resolve .local domains: DNS Related Troubleshooting With Unified Access Gateway (50120424) (vmware.com) and DNS Related Troubleshooting With Unified Access Gateway 3.7 and newer which is based on Photon 3 (78611) (vmware.com)

    So I will have to change my homelab's domain.

    Do you use .local too as your domain?



  • 6.  RE: TKG Supervisor Cluster Constant Disk IO

    Posted Jan 25, 2021 09:25 PM

    My torough answer always gets deleted but it boils down to:

    Do not use a .local domain for your vCenter, Photon OS 3.0 fails resolving .local and the controller tries connecting to your vcenter thousands of times which creates many logs in /var/log/vmware/fluentbit/consolidated.log



  • 7.  RE: TKG Supervisor Cluster Constant Disk IO

    Posted Jan 25, 2021 10:16 PM

    I am using a .local

    Took a look at that log, and can find no references to my vcenter by name or FQDN, no DNS errors.

    The VM can also resolve the hostname just fine. I provided the DNS Search Domain during workload management deployment.

    root@42068944986ef297cc64a8a2a234c0dd [ ~ ]# cat /var/log/vmware/fluentbit/consolidated.log | grep vcenter.incendiary.local
    root@42068944986ef297cc64a8a2a234c0dd [ ~ ]# cat /var/log/vmware/fluentbit/consolidated.log | grep vcenter

    root@42068944986ef297cc64a8a2a234c0dd [ ~ ]# cat /var/log/vmware/fluentbit/consolidated.log | grep 192.168.1.100
    root@42068944986ef297cc64a8a2a234c0dd [ ~ ]# ping vcenter
    PING vcenter.incendiary.local (192.168.1.100) 56(84) bytes of data.
    64 bytes from vcenter.incendiary.local (192.168.1.100): icmp_seq=1 ttl=62 time=4.51 ms
    64 bytes from vcenter.incendiary.local (192.168.1.100): icmp_seq=2 ttl=62 time=3.85 ms
    64 bytes from vcenter.incendiary.local (192.168.1.100): icmp_seq=3 ttl=62 time=4.72 ms
    64 bytes from vcenter.incendiary.local (192.168.1.100): icmp_seq=4 ttl=62 time=3.79 ms

    Only errors I can find are (primarily) related to certificates.

    1:57102: remote error: tls: bad certificate","kubernetes":{"pod_name":"vmware-system-license-operator-controller-manager-7d8c9679l8dcq","namespace_name":"vmware-system-license-operator","pod_id":"c9941aeb-e0f3-470e-b4ed-1088226e751d","labels":{"control-plane":"license-operator","pod-template-hash":"7d8c9679b8"},"annotations":{"kubernetes.io/psp":"wcp-privileged-psp"},"host":"42068944986ef297cc64a8a2a234c0dd","container_name":"manager","docker_id":"1dc94e93d423362e797945988248e7df7c8fcb9d5b79124ec57a95a79644a446","container_hash":"localhost:5000/vmware/license-operator@sha256:ada038288ec9b5b752bfc0977cbbc3de694bf4e901043dce9620a06ef8fe0a23","container_image":"localhost:5000/vmware/license-operator:031b086"}}]

    Though there are failures calling webhook

    kube.var.log.containers.kube-apiserver-42068944986ef297cc64a8a2a234c0dd_kube-system_kube-apiserver-22b56fc4db84352c49181d41d260294ffa395a38af5bbadcacd3891653ea7d1d.log: [1611612175.872795, {"log":"2021-01-25T22:02:55.872387647Z stderr F W0125 22:02:55.872299 1 dispatcher.go:128] Failed calling webhook, failing open default.validating.license.supervisor.vmware.com: failed calling webhook \"default.validating.license.supervisor.vmware.com\": Post https://vmware-system-license-operator-webhook-service.vmware-system-license-operator.svc:443/supervisor-license-validate-licenseoperator-vmware-com-v1alpha1-kuberneteslicense?timeout=30s: x509: certificate signed by unknown authority","kubernetes":{"pod_name":"kube-apiserver-42068944986ef297cc64a8a2a234c0dd","namespace_name":"kube-system","pod_id":"4d13c00e-cef1-40c7-b5eb-193377285045","labels":{"component":"kube-apiserver","tier":"control-plane"},"annotations":{"kubeadm.kubernetes.io/kube-apiserver.advertise-address.endpoint":"10.0.13.130:6443","kubernetes.io/config.hash":"edadeb83c3d11c3a33abe5011f6cc5d4","kubernetes.io/config.mirror":"edadeb83c3d11c3a33abe5011f6cc5d4","kubernetes.io/config.seen":"2021-01-25T04:06:51.110073647Z","kubernetes.io/config.source":"file","kubernetes.io/psp":"wcp-privileged-psp"},"host":"42068944986ef297cc64a8a2a234c0dd","container_name":"kube-apiserver","docker_id":"c5d9a31d70ab45347fbc983aa8b854d99385791d28220557502382d0e0e3267c"}}]

    What specifically would I need to look for in here to determine or validate if it is an issue with a .local?

    Thanks for the tip on pidstat; a tool I had not yet found. I do see that it is caused by fluent-bit



  • 8.  RE: TKG Supervisor Cluster Constant Disk IO

    Posted Jan 26, 2021 03:27 AM

    Ok so not sure why grep didn't find it, but I found the error lines where it lists my vcenter. But, out of the 63,645 lines of log, vcenter is only found 683 times.

    Then pulled the /var/log/pods/vmware-system-csi_vsphere-csi-controller-7c646978db-dv8tn_4b4f30e2-85d2-4d2a-8d39-908f1127d51a/vsphere-syncer/3.log

    It's getting an error on post.

    If I curl on URL without HTTPS it errors with recv failure. If I CURL with HTTPS it returns nothing.

    root@42068944986ef297cc64a8a2a234c0dd [ ~ ]# curl vcenter.incendiary.local:443/sdk
    curl: (56) Recv failure: Connection reset by peer
    root@42068944986ef297cc64a8a2a234c0dd [ ~ ]# curl https://vcenter.incendiary.local:443/sdk
    root@42068944986ef297cc64a8a2a234c0dd [ ~ ]#

    Yet it does still find the datastores.



  • 9.  RE: TKG Supervisor Cluster Constant Disk IO

    Posted Jan 26, 2021 03:28 AM

    Looks like /var/log/vmware/audit/kube-apiserver.log logs a lot too. 1GB in 6 hours, though ls doesn't display 3 of the files. Averaging 1 new log every 30-45 minutes, at 100MB each. Time provided by ls doesn't match up with actual log lines or date/time changed as displayed in WinSCP.
    12:30-18:30 is 11 files @ 1.04GB

    Also seeing this every 1 minute in coredns log
    /var/log/pods/kube-system_coredns-5dc469f6f-8bdfl_7d43881e-8f67-4958-a628-b3ad9ee2f3ba/coredns/6.log

    Seeing more bad certificate errors.
    /var/log/pods/vmware-system-license-operator_vmware-system-license-operator-controller-manager-7d8c9679qkmmk_81254b45-d221-49ef-a4dd-ce2e15ad5f56/manager/0.log

     



  • 10.  RE: TKG Supervisor Cluster Constant Disk IO

    Posted Jan 26, 2021 08:04 AM

    Let me raise this with a few folks, that sounds very extreme indeed.



  • 11.  RE: TKG Supervisor Cluster Constant Disk IO

    Posted Jan 26, 2021 10:10 PM

    Thank you. I would say 2.5TBW per 24 hours is a touch on the high side.

    That is, if my maths are correct, it's 2.5TBW per day.

    Would be interested to know if there is some sort of issues with .local domains.

    In part, because of this VMware TAM Lab 067 video where they are using a .local domain.

    https://www.youtube.com/watch?v=bQ4FcieZlZg



  • 12.  RE: TKG Supervisor Cluster Constant Disk IO

    Posted Jan 27, 2021 07:07 AM

    Agreed, engineering is looking into it.



  • 13.  RE: TKG Supervisor Cluster Constant Disk IO

    Posted Feb 03, 2021 02:18 AM

    Any updates? Any additional info I might be able to provide?



  • 14.  RE: TKG Supervisor Cluster Constant Disk IO

    Posted Feb 03, 2021 08:54 AM

    I have no update, other than it is being looked in to.



  • 15.  RE: TKG Supervisor Cluster Constant Disk IO
    Best Answer

    Posted Feb 04, 2021 08:51 AM

    Issue has been confirmed in our engineering environment as well, it is being investigated and hopefully fixed in a future release. If you run this in production I would recommend filing an SR for it and submitting all details, that way it can be attached to the engineering issue.



  • 16.  RE: TKG Supervisor Cluster Constant Disk IO

    Posted Feb 04, 2021 03:52 PM

    Good to hear it isn't something isolated to my environment. Thank you!
    In my case this is just a homelab, so not a huge concern.

    We do have intention of (eventually) deploying this at work though, which is also on a .local domain.
    Not sure on that timeframe though, as we are still working to get the environment upgraded to 7.0U1.

    Is the issue related to .local domains or something else?



  • 17.  RE: TKG Supervisor Cluster Constant Disk IO

    Posted Feb 17, 2021 08:26 AM

    No, it appears to be related to the log level for fluent. 



  • 18.  RE: TKG Supervisor Cluster Constant Disk IO

    Posted Feb 23, 2021 08:56 PM

    I had the same problem with the latest vSphere with Tanzu version (vCenter Server 7.0 U1d / v1.18.2-vsc0.0.7-17449972). I solved the problem by changing the fluentd loglevel from debug to warn in the Supervisor Control Plane. It was spamming debug messages:

    [debug] [input:tail:tail.0] inode=526818 events: IN_MODIFY

    See this for details:

    https://www.virten.net/2021/02/vsphere-with-tanzu-supervisorcontrolplanevm-excessive-disk-write-io/

     



  • 19.  RE: TKG Supervisor Cluster Constant Disk IO

    Posted Feb 24, 2021 07:52 AM

    Yes, that is indeed the way to get around the problem.



  • 20.  RE: TKG Supervisor Cluster Constant Disk IO

    Posted Feb 26, 2021 08:19 AM

    Just found this on the problem: https://www.virten.net/2021/02/vsphere-with-tanzu-supervisorcontrolplanevm-excessive-disk-write-io/

    I checked my system and we're having 8000IOPS from the supervisorVMs.. will create an SR instead of fixing it myself according to the blog.



  • 21.  RE: TKG Supervisor Cluster Constant Disk IO

    Posted Feb 26, 2021 01:44 PM

    I got an answer back on my SR. It was a simple fix for us:

    You can disable fluentbit logging by:

     

    1. ssh into vcsa
    2. editing /etc/vmware/wcp/wcpsvc.yaml on the vcsa
         change logging_fluentbit_enabled: true => logging_fluentbit_enabled: false
    3. service-control --restart wcp

     

    From almost 8000 IOPS to ~250 IOPS:

    Joffer77_0-1614347104848.png

     



  • 22.  RE: TKG Supervisor Cluster Constant Disk IO

    Posted May 02, 2021 06:47 PM

    Looks like the default logging level was finally fixed in 7.0 U2a.

    Just upgraded my vCenter and deployed workspace management again. Not seeing that extraneous disk activity.

    Reviewing YAML for vmware-system-logging shows log level of error. I did not see what the setting was before, but per blog post on virten.net one of the solutions could be to manually update YAML to lower log level then rollout the change.

    https://www.virten.net/2021/02/vsphere-with-tanzu-supervisorcontrolplanevm-excessive-disk-write-io/

     

    Current YAML settings:

    kubectl edit configmaps fluentbit-config --namespace vmware-system-logging

    fluent-bit.conf: |
    [SERVICE]
    Flush 1
    Log_Level error
    Daemon off
    Parsers_File parsers.conf
    Plugins_File plugins.conf