vSAN1

 View Only
Expand all | Collapse all

High Component Usage

LordofVxRail

LordofVxRailJul 29, 2021 12:41 PM

LordofVxRail

LordofVxRailJul 29, 2021 12:43 PM

  • 1.  High Component Usage

    Posted Jul 29, 2021 12:39 PM

    Hello All,

     

    I have a strange issue, where I have 2 clusters (8 node and a 10 node), both have Cloud Director on top, and the Component usage is extremely high.

    there is only around 800 VMs on the cluster, and I can't understand how the component usage is spiralling out of control.

    LordofVxRail_0-1627563421810.png

     

    I use FTT1 default

    LordofVxRail_1-1627562096847.png

     

    LordofVxRail_2-1627562167474.png

     



  • 2.  RE: High Component Usage

    Posted Jul 29, 2021 12:41 PM

    LordofVxRail_0-1627562445845.png

     



  • 3.  RE: High Component Usage

    Posted Jul 29, 2021 12:43 PM

    any ideas as to how this is happening?

     

    thanks!

     

     



  • 4.  RE: High Component Usage

    Posted Jul 29, 2021 12:53 PM

     You have ~77,000 components in this cluster, these comprise the 11,877 Objects you have in this cluster - I am going to make an educated guess that you are storing these mostly as PFTT=1,SFTT=1,SFTM=RAID5 (e.g. a RAID5+RAID5 storage policy) - such a Policy uses a minimum of 9 components per Object.

     

    Thus if you are using such a storage policy then this is completely expected behaviour.

     

    I would also advise you redact hostnames better as from how you have there it is fairly trivial to get their names.



  • 5.  RE: High Component Usage

    Posted Jul 29, 2021 12:56 PM

    hey, thanks for the reply

     

    my storage policy is FTT1, default, nothing complex at all, which is why I am finding this so strange. 

     

     



  • 6.  RE: High Component Usage

    Posted Jul 29, 2021 01:00 PM

    LordofVxRail_0-1627563513473.png

    all vms are compliant 

     

    LordofVxRail_1-1627563555692.png

    LordofVxRail_2-1627563579282.png

     



  • 7.  RE: High Component Usage

    Posted Jul 29, 2021 01:32 PM

    Can you please check the layout of some Objects from the CLI:
    # esxcli vsan debug object list --all > /tmp/objout123

     

    Then just a case of looking at the output and the layout of the Objects to confirm are they stored as regular RAID1 (e.g. 3 components per Object).



  • 8.  RE: High Component Usage

    Posted Jul 29, 2021 01:51 PM

    sure I can check that, here is vsan.obj_status_report

     

    /localhost/DC02/computers> vsan.obj_status_report xxxxxxxxxxx
    2021-07-29 13:09:44 +0000: Querying all VMs on vSAN ...
    2021-07-29 13:09:49 +0000: Querying DOM_OBJECT in the system from xxxxxxxxx-01.xxxxxxxxx ...
    2021-07-29 13:09:49 +0000: Querying DOM_OBJECT in the system from xxxxxxxxx-02.xxxxxxxxx ...
    2021-07-29 13:09:49 +0000: Querying DOM_OBJECT in the system from xxxxxxxxx-04.xxxxxxxxx ...
    2021-07-29 13:09:50 +0000: Querying DOM_OBJECT in the system from xxxxxxxxx-10.xxxxxxxxx ...
    2021-07-29 13:09:50 +0000: Querying DOM_OBJECT in the system from xxxxxxxxx-06.xxxxxxxxx ...
    2021-07-29 13:09:50 +0000: Querying DOM_OBJECT in the system from xxxxxxxxx-09.xxxxxxxxx ...
    2021-07-29 13:09:50 +0000: Querying DOM_OBJECT in the system from xxxxxxxxx-08.xxxxxxxxx ...
    2021-07-29 13:09:52 +0000: Querying DOM_OBJECT in the system from xxxxxxxxx-03.xxxxxxxxx ...
    2021-07-29 13:09:52 +0000: Querying DOM_OBJECT in the system from xxxxxxxxx-05.xxxxxxxxx ...
    2021-07-29 13:09:52 +0000: Querying DOM_OBJECT in the system from xxxxxxxxx-07.xxxxxxxxx ...
    2021-07-29 13:09:53 +0000: Querying all disks in the system from xxxxxxxxx-01.xxxxxxxxx ...
    2021-07-29 13:09:54 +0000: Querying LSOM_OBJECT in the system from xxxxxxxxx-01.xxxxxxxxx ...
    2021-07-29 13:09:54 +0000: Querying LSOM_OBJECT in the system from xxxxxxxxx-02.xxxxxxxxx ...
    2021-07-29 13:09:54 +0000: Querying LSOM_OBJECT in the system from xxxxxxxxx-04.xxxxxxxxx ...
    2021-07-29 13:09:54 +0000: Querying LSOM_OBJECT in the system from xxxxxxxxx-10.xxxxxxxxx ...
    2021-07-29 13:09:54 +0000: Querying LSOM_OBJECT in the system from xxxxxxxxx-06.xxxxxxxxx ...
    2021-07-29 13:09:55 +0000: Querying LSOM_OBJECT in the system from xxxxxxxxx-09.xxxxxxxxx ...
    2021-07-29 13:09:55 +0000: Querying LSOM_OBJECT in the system from xxxxxxxxx-08.xxxxxxxxx ...
    2021-07-29 13:09:55 +0000: Querying LSOM_OBJECT in the system from xxxxxxxxx-03.xxxxxxxxx ...
    2021-07-29 13:09:56 +0000: Querying LSOM_OBJECT in the system from xxxxxxxxx-05.xxxxxxxxx ...
    2021-07-29 13:09:56 +0000: Querying LSOM_OBJECT in the system from xxxxxxxxx-07.xxxxxxxxx ...
    2021-07-29 13:09:57 +0000: Querying all object versions in the system ...
    2021-07-29 13:09:59 +0000: Got all the info, computing table ...

    Histogram of component health for non-orphaned objects

    +-------------------------------------+------------------------------+
    | Num Healthy Comps / Total Num Comps | Num objects with such status |
    +-------------------------------------+------------------------------+
    | 6/6 (OK) | 1543 |
    | 5/5 (OK) | 8318 |
    | 8/8 (OK) | 304 |
    | 7/7 (OK) | 1297 |
    | 4/4 (OK) | 113 |
    | 3/3 (OK) | 297 |
    | 12/12 (OK) | 2 |
    | 75/75 (OK) | 1 |
    | 36/36 (OK) | 2 |
    +-------------------------------------+------------------------------+
    Total non-orphans: 11877


    Histogram of component health for possibly orphaned objects

    +-------------------------------------+------------------------------+
    | Num Healthy Comps / Total Num Comps | Num objects with such status |
    +-------------------------------------+------------------------------+
    +-------------------------------------+------------------------------+
    Total orphans: 0

    Total v10 objects: 11877
    /localhost/DC02/computers>
    /localhost/DC02/computers>



  • 9.  RE: High Component Usage

    Posted Jul 29, 2021 02:15 PM

    ok so I guess this is not "RAID 1"

    LordofVxRail_1-1627568043758.png

    most objects have more than 3 components.....

     

     

     



  • 10.  RE: High Component Usage

    Posted Jul 29, 2021 02:37 PM

    this might be a better example 40.00 GB VM, "RAID 1", 5 components...which is assumed is an indication of some issue with vSAN policy?

     

    LordofVxRail_0-1627569365382.png

     



  • 11.  RE: High Component Usage

    Posted Jul 29, 2021 02:45 PM

    well, here is the issue I guess:

     

    hostFailuresToTolerate: 2

    hostFailuresToTolerate: 2
    [root@zzzzz:~] grep "hostFailuresToTolerate: 2" /tmp/objout123|wc -l
    11426
    [root@zzzzz:~]

     

    even tho VC shows FTT1 for policy.....oh well, I guess the best thing is to create a new FTT1 policy and apply it to all VMs?



  • 12.  RE: High Component Usage

    Posted Jul 29, 2021 03:40 PM

    If that's the Storage Policy you want applied to them then yes.

     

    Hold up - re-read your Storage Policy description - that is FTT=2, change it to FTT=1 if that is what you intended to do here.



  • 13.  RE: High Component Usage

    Posted Jul 29, 2021 07:54 PM

    yep, that's what I'm hoping to achieve

     

    a bit of history of this cluster, I did originally set FTT2 (over a year ago) , then reverted to FTT1 about 6 months ago, as you can see from VC screenshots, the SPBM looks ok, but at the lower level, FTT is still 2...... weird.

     

    I'll go ahead and make a new FTT1 policy and apply it across the estate and cross my fingers & toes.



  • 14.  RE: High Component Usage

    Posted Jul 29, 2021 08:02 PM

    thanks for all your suggestions so far, it's appreciated. 

     



  • 15.  RE: High Component Usage

    Posted Jul 29, 2021 09:44 PM

      Just to clarify what is seemingly misunderstood here: There are two aspects at play here, FTM (Fault Tolerance Method) e.g. RAID1/RAID5/RAID6 and FTT - if you assign a FTT=2,FTM=RAID1 Storage Policy (as you have here) this is basically saying store 3 replicas of the data (+ 2 Witness components for quorum as need an odd number of total components where each component has a single vote) and thus the compliance view of the policy is indeed correct.

     

    From the output you shared though:

    +-------------------------------------+------------------------------+
    | Num Healthy Comps / Total Num Comps | Num objects with such status |
    +-------------------------------------+------------------------------+
    | 6/6 (OK) | 1543 |
    | 5/5 (OK) | 8318 |
    | 8/8 (OK) | 304 |
    | 7/7 (OK) | 1297 |
    | 4/4 (OK) | 113 |
    | 3/3 (OK) | 297 |
    | 12/12 (OK) | 2 |
    | 75/75 (OK) | 1 |
    | 36/36 (OK) | 2 |
    +-------------------------------------+------------------------------+
    Total non-orphans: 11877

     

    It looks like a load of different policies are actually applied e.g. a 6/6 Object might be FTT=2,FTM=RAID6, a 5/5 Object FTT=2,FTM=RAID1, anything with more components could just be auto-striped (due to size being >255GB per component) and thus using more components.