Thank you very much for reply, I noticed you are very active in VSAN forum !
>> "700k read IOPS, 180k write IOPS sustained" , "90k read ; 66k write IOPS"
> - Vendors spec-sheet for devices will generally have multiple strings attached to these stats including terms such as "up to"
> and/or tested on file-systems or test types that are not the same as functional-vSAN so take with a grain of salt.
Right, but enterprise-class disks provide steady-state performance figures and they are real. The particular model I was mentioning here is Micron 9100 PCI which has been tested with these figures, I didn't consider to be important to write that before. I believe filesystem has not too much to do with those performance figures as SSD does not understand OS filesystems, be it NTFS or VMFS.
Moreover, if NTFS performance is tested to be 180.000 random 4kB IOPS (because this is what you can find on internet - Windows 2012 performance tests), then VMFS or other filesystem performance can't be 50.000 IOPS only - if it is 155.000 or 175.000, honestly I don't care too much. BTW, these SSDs are often tested with virtual machines running on VMware and they achieve specified numbers so these are really achievable in VMware environment. All in all, I have little reasons to question manufacturer specifications.
>> "capacity layer is almost full, for example with 9TB of data"
> - Best practice is to utilize ~70% to allow for headspace (as as much as possible should be thin to benefit from dedupe) vSAN starts moving data proactively between disks once they reach 81%
> (with default settings) - though this should be relatively balanced assuming R5 FTM and not too many huge Objects.
70% full is more important for spinning (magnetic) disks as they are for SSDs becase beyond that threshold they start to heavily deteriorate with performance due to usage of inner tracks.
Yes sure the same 80%+ relocation principle applies to SSD capacity disks but they are not as sensitive as magnetic disks from performance perspective - and performance always was the main reason for this 70% threshold and why migrations start after 80%.
>> "after compression/dedupe 2.5 : 1"
> - Whether this is a feasible ratio depends on the data, its size and distribution and % space utilized.
sure it depends, I wrote "let's say". So let's pretend we achieved that compression/dedup ratio :smileyhappy:
>> "Are data calculated, uncompressed, transmitted over network to new destination where they are compressed and stored again ?"
> - Data is deduped/compressed as it is written to disk and is deduped per disk-group, so no.
You are right saying data are compressed/deduped as they are written to capacity layer from cache layer, and this is disk group specific, sure, this is written everywhere.
I didn't find relevant source saying "you have X hosts with compression / dedup, one of them fails, data are reconstructed with mathematical science like in classical raid5 arrays so they don't care about being compressed or not". No doubt VSAN has been created by extremely capable team, I just didn't find anybody confirming or denying what I wrote in regards to the compression/dedup. There might be decompression necessary for some specific reasons I have absolutely no clue about.
The reason I'm asking : in the case I made up (9TB compressed data on each host, single disk group, 2.5:1 dedup/compression ratio), it will be huge difference to transfer 9TB over network or to transfer 22.5TB (2.5x 9TB) over the same network. This essentially is about time which is necessary to recover from disk group failure ; yes I understand it depends on other factors but amount of data is still extremely significant factor.
> Regarding resync:
> It depends on what the cause of failure and how fast this is resolved e.g.:
> - Physically faulted cache-tier, capacity-tier means new DG (Disk-Group) created after replacement
> - Controller/disk driver/firmware or other hardware/power/networking issue and disk-group comes back intact then *should* only require a partial delta resync but how much depends on time and the rate of data-change (+ as their are only four available DGs for components it can't rebuild until it gets all four available).
Right. I was curious about degraded state (=fatal failure) with full component rebuild.
Partial delta resyncs in case of absent state are not my concern because the amount of data to synchronize is extremely smaller compared to full resync.
>> I have seen resyncs in similar configurations go at 1TB+ an hour but I don't keep track of specifics such as higher stripe-width or drive-type/quality
>> that might improve this, multiple available nodes and DGs (and controllers if applicable) per node is definitely preferable if possible.
phenomenal info, thank you very much. 1TB/hour is my dream because it means full component resync overnight even with this huge capacity per host.
>> "Would you recommend against 10TB disk groups ?"
> - I have seen larger with less disks and smaller cache that I wouldn't advise, yours seems reasonable enough and NVMe-cache should help (*maybe* capacity drives
> a smidge bigger than ideal but again this depends on the data - if a lot of larger Objects/components than 2TB may be beneficial over anything smaller).
only 600GB will be used from cache layer, right. Seems like I can't create more than one disk group per host, because I'm gonna use 1U rack servers with 10 SAS bays and three PCI-Express slots out of which only single one is free. At the same time, I see little meaning of SAS SSDs for cache and capacity layer because
- they are hooked to the same disk controller, I only have one controller in each host without possibility to extend
- limited performance compared to PCI NVMe devices
- I'm not going to hide five or six flash capacity devices with 66k write IOPS each behind single cache device with 70k write IOPS performance
- questionable performance of the only disk controller I will have in server especially when it has to de-stage data from SAS cache to SAS capacity, queues latencies etc.
- questionable performance of SAS cache single disk during destage (concurrent read and write operations, no longer "write only")
- usage of scarce 2.5" slots which I rather dedicate to capacity drives.
I understand design implications one DG versus two DGs such as bigger failure domain, better performance, more data to rebuild in case of failure etc. This is the specific reason why I'm asking about 10TB sized all-flash disk group as that is little too much for my taste - in case of failure, that is helluva lot of data to reconstruct !! From performance perspective, I'm replacing two SAS cache drives with single PCI NVMe device with even better performance figures (two SAS drives are not going to provide 700k read IOPS and 180k write IOPS combined). I'm also playing economy game here so I don't have free hands what drives to choose.
Someone might ask "why do you want 10TB per host, why don't you do 5TB per host and twice as much hosts". I have to scale up because of costs - VSAN licensing is going to kill the economy. In the case I have, I'm going to pay more for VSAN licenses than for hardware itself and yes I'm talking about 60TB capacity SSDs total plus another 7TB in NVMe drives !! Six hosts, about 10TB capacity each.
Every single big 10TB host costs more to license than to equip with 10TB of SSDs, pheww. Every small 5TB host would cost about twice as much to license than to equip with SSDs. Unfortunately we are budget constrained so I can't go scale-out way.
Additional questions, if I may : 10TB flash capacity layer, with only 600GB cache (1.2TB PCI NVMe drive but 600GB used only). Yay or nay ? It's not 10% recommended, will it be real problem with ALL-FLASH environment ? Reads always go directly from capacity layer. Writes... no more than 600GB will be used regardless of capacity so... so what ?