VMware vSphere

 View Only
Expand all | Collapse all

ESX4 + Nehalem Host + vMMU = Broken TPS !

kohlerma

kohlermaJul 27, 2009 03:26 PM

  • 1.  ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 24, 2009 11:34 AM

    Since upgrading our 2 host lab environment from 3.5 to 4.0 we are seeing poor Transparent Page Sharing performance on our new Nehalem based HP ML350 G6 host.

    Host A : ML350 G6 - 1 x Intel E5504, 18GB RAM

    Host B : Whitebox - 2 x Intel 5130, 8GB RAM

    Under ESX 3.5 TPS worked correctly on both hosts, but on ESX 4.0 only the older Intel 5130 based host appears to be able to scavenge inactive memory from the VMs.

    To test this out I created a new VM from an existing Win2k3 system disk. (Just to ensure it wasn't an old option in the .vmx file that was causing the issue.) The VM was configured as hardware type 7 and was installed with the latest tools from the 4.0 release.

    During the test the VM was idle and reporting only 156MB of the 768MB as in use. The VM was vmotioned between the two hosts and as can be seen from the attached performance graph there is a very big difference in active memory usage.

    I've also come across an article by Duncan Epping at yellow-bricks.com that may point the cause as being vMMU...

    MMU article

    If vMMU is turned off in the VM settings and the VM restarted then TPS operates as expected on both hosts. (See second image)

    So if it comes down to chosing between the two, would you choose TPU over vMMU or vice versa?



  • 2.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Broadcom Employee
    Posted May 24, 2009 11:40 AM

    Well it's not broken. When memory is scarce it apparently will start breaking up the Large Pages in Small Pages which will be TPS'ed after a while. It's not only for Nehalem btw, AMD RVI has the same side effect. I've already addressed this internally and the developers are looking into it.

    Duncan

    VMware Communities User Moderator | VCP | VCDX

    -


    Blogging:

    Twitter:

    If you find this information useful, please award points for "correct" or "helpful".



  • 3.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 24, 2009 12:28 PM

    Duncan,

    Thanks for the response, and I'll bow to your experience as to whether TPS is still functional in the presence of vMMU, but I'd argue that from a user's perspective something certainly appears broken...

    What led me to investigate this was that I have a number of VMs currently alarming due to 95% memory usage, however on investigation within the VM itself windows is reporting

    Physical Mem = 1024MB

    In Use = 471MB

    Available = 525MB

    Sys Cache = 630MB

    Which can in no way be construed as memory starved.

    Regards,

    Iain



  • 4.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Broadcom Employee
    Posted May 24, 2009 07:50 PM

    I know, as far as I know it's something that's being investigated.... It seems like vCenter reports this info incorrectly. I will contact the developers again and would like to ask you to call support! Have them escalate this to development.

    Duncan

    VMware Communities User Moderator | VCP | VCDX

    -


    Blogging:

    Twitter:

    If you find this information useful, please award points for "correct" or "helpful".



  • 5.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 24, 2009 08:31 PM

    Thanks Duncan, I've raised SR#1303220991 and referenced this thread.

    Regards,

    Iain



  • 6.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 24, 2009 08:43 PM

    I just came up the exactly same behaviour on a dell r710 equipped with two xeon 5520 quadcore and 36Gig Mem. I created some new W2003 machines, all showing 95-98% Guest Mem Usage in vSphere Client, but inside guest shows it´s pretty normal. AND (and now it becomes bad) when comparing the subjective speed inside the guest it is much slower than the same machine with my previous esx 3.5u4.

    This is not good.

    regards,

    Joerg



  • 7.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Broadcom Employee
    Posted May 24, 2009 09:05 PM

    Keep me posted!

    Duncan

    VMware Communities User Moderator | VCP | VCDX

    -


    Blogging:

    Twitter:

    If you find this information useful, please award points for "correct" or "helpful".



  • 8.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 25, 2009 02:43 AM

    Transparent page sharing works only for small pages (we are investigating efficient way to implement it for Large pages). On EPT/NPT capable systems using large pages offers better MMU performance and so ESX takes advantage of large pages transparently. It is possible that you are not getting the same level of TPS benefits on the EPT/NPT systems for this reason.



  • 9.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 25, 2009 02:48 AM

    If you want you can also try disabling the use of large pages

    goto Advanced Settings dialog box, choose Mem.

    set Mem.AllocGuestLargePage to 0

    This should improve TPS.



  • 10.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 25, 2009 07:52 AM

    Hmmm, i must hardly refute. "The same level of benefits" ist highly understated. Let me tell you this one: Yesterday I tried to start the vmware tools installer on a freshly installed w2003 with 1 gig and 1 cpu out of the box. I did it contemporaneous on an esx 3.5 (dell r710) and on an esx 4 (dell r710), EXACTLY the same machine. Now, the ESX 3.5 machine did it in 23 seconds, the ESX 4 machine did it in 320 seconds. Does that sound good? I repeat: This is NOT good. This has to be fixed asap.

    best,

    Joerg



  • 11.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 25, 2009 08:48 AM

    Joerg,

    I'm not experiencing the same performance hit that you are seeing.

    Performance has been good, and the TPS problem has been the only issue sor far that would stop me pushing ESX4 onto our production environment.

    Regards,

    Iain



  • 12.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 25, 2009 08:35 AM

    Thanks I can confirm that after setting Mem.AllocGuestLargePage to 0 and vmotioning the VMs off then back onto the Nehalem host that TPS is again operating with active memory now down at less than 20% for all VMs.

    Can you confirm if the above setting still uses hardware assist for MMU but with the smaller (TPS friendly) page size, or does it have the effect of turning off hardware assistted MMU?



  • 13.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 25, 2009 09:22 AM

    Hi there

    Perhaps I am a bit slow, but I do not understand the full extent of the problem.

    1) Are all VM's (32, 64, Windows, Linux) affected by this vMMU bug on Nehalem hardware?

    2) Currently it seems that all VMs have vMMU set to "Automatic".

    • When I move to our Nehalem infrastructure should I be setting vMMU to "forbid" and then rebooting?

    • Or, is the better solution to set Mem.AllocGuestLargePage to 0 (and rebooting) as kichaonline suggested?



  • 14.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 25, 2009 09:28 AM

    I can confirm all VM types are affected.

    As to what is the best solution... I'll leave that to the more knowledgable members of this community.

    Regards,

    Iain



  • 15.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 25, 2009 01:26 PM

    Hello everyone,

    I have succesfully managed to upgrade our ltitle farm to vSphere 4. The problem is that since then, all my guest have started to get little red exclamation points. Memory usage goes up to 2GB and just stays there no matter what even if in the guest the reported usage is 50% I am used to the guest memory usage going to the max but then it usually went dow, now it just seems blocked.

    • I have upgraded vmware tools on the guests

    • I have upgraded the vmware virtual hardware

    • I have forced the CPU/MMU Virtualization to use Intel VT

    Guest is Win2K8 with 2GB of ram and 2vCPU.

    I am not sure if this is the same issue as you guys have but it seems kinda weird to me since i didn't have this behavior before the upgrade.

    Cheers !



  • 16.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 25, 2009 01:35 PM

    neyz,

    Yes it sounds as though you are hitting the same problem. (As a matter of interest what hardware is your host.)

    You should be able to solve the problem by using the workaround provided earlier...

    On your hosts Advanced Settings set Mem.AllocGuestLargePage to 0 and either restart the VM or migrate the VM off then back onto the host.

    Regards,

    Iain



  • 17.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 25, 2009 02:15 PM

    Hello,

    The modification of this option seems to have resolved the problem. I don't know if this affects the performance of the VM's ?

    We are running with the new Dell R610 with 32GB of RAM and 5500's inside.

    Thanks !



  • 18.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Broadcom Employee
    Posted May 25, 2009 08:40 PM

    Yes it will have an impact on performance. one of the reasons that there's a performance increase with virtualized MMU is large pages.

    Duncan

    VMware Communities User Moderator | VCP | VCDX

    -


    Blogging:

    Twitter:

    If you find this information useful, please award points for "correct" or "helpful".



  • 19.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 26, 2009 02:42 AM

    Disabling Large pages for guest does not disable hardware MMU. It can however have performance impact if the workload has lot of TLB misses (usually happens when the workload accesses large amount of memory pages and TLB cache is not sufficient to store all those memory references).The additional nesting of page table pages with hardware MMU makes TLB miss cost more expensive (this is by hardware design and it not induced by software). Using large pages reduces TLB miss counts and thats why ESX transparently tries to use large pages for VMs on EPT/NPT hardware. This is an optimization that ESX does to maximize performance when using hardware MMU. The only problem is that when large pages is used, TPS needs to find identical 2M chunks (as compared to 4K chunks when small pages is used) and the likelihood of finding this is less (unless guest writes all zeroes to 2M chunk) so ESX does not attempt collapses large pages and thats why memory savings due to TPS goes down when all the guest pages are mapped by large pages by the hypervisor. Internally we are investigating better ways to address this issue.

    So this is not a bug, its a tradeoff between better memory savings versus slightly better performance. Also this issue is not new to ESX 4.0, this should happen with ESX 3.5 as well if you have NPT capable processor. You will see this issue only with ESX 4.0 see if you have Intel EPT capable processor as ESX 3.5 does not uses hardware on Intel EPT processors (it only supports NPT).

    For more information on large pages and its impact on NPT see http://www.vmware.com/files/pdf/large_pg_performance.pdf. The artificact that you are noticing is also described in Appendix B. The option that I provided you disables large pages for all guest and it also documented in the whitepaper, I will find out if there is a per VM config option to selectively enable large pages for workloads that might benefit from large pages.



  • 20.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 26, 2009 05:52 AM

    Sorry for all the questions, but I'm afraid your response raised more queries...

    1) I had assumed hardware MMU replaced software MMU, but this may not be the case? You talk of nesting page tables which I assume means s/w MMU is still running. If this is the case, is there any benefit in running hardware MMU with small pages?

    2) The "Large Page Performance" article talks of conguring the guest to use Large Pages, this isn't something I had done for any of our guests. Do the tools handle that now or is it still a required step?

    3) I have to disagree as to whether this is a bug or not. Large pages make 80% of our VMs alarm due to excess memory usage, and negate one of the main differences between ESX and Hyper-V.



  • 21.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !
    Best Answer

    Posted May 26, 2009 06:14 PM

    >>You talk of nesting page tables which I assume means s/w MMU is still running.

    Nope Nested paging is hardware MMU feature. Page table structures are maintained by software but hardware (MMU unit) fetches the information does the page table walk to fetch the information from the page table structure (or fetches it from TLB cache if it is already in the cache). Software MMU does not use nested page tables, instead it uses shadow pagetables and the hardware directly walks the shadow pagetables (and there is no additional cost for TLB misses)

    >>is there any benefit in running hardware MMU with small pages?

    Oh yes absolutely.

    >>The "Large Page Performance" article talks of conguring the guest to use Large Pages, this isn't something I had done for any of our guests. Do the tools handle that now or is it still a required step?

    There are different levels of large page support. Applications inside the guest can request the OS for large pages and OS can assign large pages if it has contiguous free memory. But OS mapped large pages (i.e Physical pages) may or may not be backed up by the hypervisor by actual large pages (machine pages). For instance guest may think it has 2M chunks but hypervisor may use 4K chunks to map the 2M chunk to the guest, in this case multiple 4K pages have to be accessed by the hypervisor so there is no performance benefit even though guest uses large pages. In ESX 3.5 we introduced the support for large pages in the hypervisor, with this support whenever the guest tries to back up large pages we explicitly go and find large pages to backup the guest large pages. This helps in performance as demonstrated by the whitepaper.

    In addition to this, on NPT/EPT machines hypervisor also opportunistically tries to backup all guest pages (small or large) as large pages. For instance even if the guest is not mapping large pages, contiguous 4K regions of the guest can be mapped by a single large page by the hypervisor, this helps in performance ( by reducing TLB misses). This is the reason why you see the use of large pages even though the guest is not explicitly requesting for it.

    >>I have to disagree as to whether this is a bug or not. Large pages make 80% of our VMs alarm due to excess memory usage, and negate one of the main differences between ESX and Hyper-V.

    There are two issues here. 1) TPS not doing sufficient memory savings when large pages being used - this is not a bug. It is a tradeoff choice that you have to make. The workaround I suggested will help you to choose which tradeoff you want to make on NPT/EPT hardware. The second issue is 2) VM memory alarm - this is a separate issue and it is not dependent on page sharing. VM memory usage alarms turns red whenever the guest active memory usage goes high. Guest active memory is an estimated through random statistical sampling and the algorithm that the hypervisor uses to estimate active memory usage of a VM overestimates active memory when the guest small pages are backed up large pages (since active memory estimate is done with reference to machine pages) and this is a bug. For now you could simply ignore this alarm (since it is a false alarm), I was told that we will be fixing this pretty soon. However note that this will only fix the alarm, the memory usage of the VM will still remain the same.



  • 22.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 26, 2009 08:54 PM

    Thank you for taking the time to produce such a detailed response, it has certainly helped my understanding of what is and isn't happening.

    Regards,

    Iain



  • 23.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Broadcom Employee
    Posted May 27, 2009 01:10 AM

    Thanks Kichaonline for accurate information. A small correction -- we are currently investigating ways to fix the high memory usage issue also. Regarding TPS, as noted earlier this shoud not lead to any performance degradation. When a 2M guest memory region is backed with a machine large page, VMkernel installs page sharing hints for the 512 small (4K) pages in the region. If the system gets overcommitted at a later point, the machine large page will be broken into small pages and previously installed page sharing hints helps to quickly share the broken down small pages. So low TPS numbers when a system is undercommitted does not mean that we won't reap benefits out of TPS when machine gets overcommitted. Thanks.



  • 24.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 28, 2009 01:15 AM

    We are seeing the same issue on our ESX Servers that have been upgraded from 3.5 to 4.0. After the upgrade most of our VMs are in alarm for excessive memory but inside the guest OS there is lots of free memory. For example most of our linux servers (CentOS x64) show 1.5GB free inside the guest but Infrastructure client is reporting it is using a full 2.0GB. If i shutdown/reboot the VM it will go down to 1.0-1.5GB of used memory but eventually will creep up but never go down.

    We are using quadcore Xeon 5500s with 48GB of ram.

    If 3.5 couldn't do hardware MMU with these processors would disabling hardware MMU give us the same performace we were seeing in 3.5? Or are we better off using hardware MMU and setting Mem.AllocGuestLargePage to 0?

    It sounds like this may be a seperate issue that TPS, it seams as if ESX4 isn't reclaiming free memory that the Guest isn't actively using.

    Rajesh do you have any details regarding the high memory issue that you said is being investigated?



  • 25.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 28, 2009 01:59 AM

    If Im not mistaken Intel Xeon 5500s is Nehalem processor so yes you will notice high memory usage alarm in 4.0 and not in ESX 3.5. .As I mentioned earlier you could ignore this alarm as it is a false positive (happens only when large pages is used) and this issue will be fixed pretty soon.

    >>are we better off using hardware MMU and setting Mem.AllocGuestLargePage to 0?

    You should always use hardware MMU whenever possible. Setting Mem.AllocGuestLargePage to Zero is a workaround to get instant TPS benefits but also as a side effect it will fix the alarm problem (since large pages gets disabled with this option).

    >> seams as if ESX4 isn't reclaiming free memory that the Guest isn't actively using.

    That is not correct. ESX reclaims unused (but previously allocated) memory through ballooning and through TPS. The problem is when large pages is used, TPS doesnt kick in instantly, so you woulndt get the instant gratification of noticing TPS memory savings. However when your system is under memory over-commitment, vmkernel memory scheduler will break large pages into small pages transparently and it will collapse it with other shareable pages - this feature called "share before swap" is new to ESX 4.0.So you would still get the same benefits for TPS but only at the time of memory over-commitment.

    To summarrize, when you use ESX 4.0 on Nehalem/Barcelona/Shanghai (EPT/NPT) systems,

    a) ignore the high memory usage alarm - this will be fixed pretty soon

    b) dont worry about TPS - it will kick in automatically when your system is under memory over-commitment



  • 26.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 28, 2009 02:03 AM

    Just one more item to the summary

    c) Use Mem.AllocGuestLargePage=0 workaround only if you need instant TPS sharing benefits (you will however need to vmotion the VM in and out of the system for the changes to take effect)

    I will also draft a KB article and publish it soon for the benefit of others.



  • 27.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 28, 2009 03:46 PM

    Thanks for the summary, it was very helpful.



  • 28.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jun 24, 2009 12:38 PM

    I've raised SR #1401115468 about this same issue with a Dell R710 seeing high guest memory usage reported in both vCenter and when directly connected to the host with the vSphere client. Seems to affect both 2003 and 2008 guests.

    Is there anything official from VMWare out there on this yet?



  • 29.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jun 30, 2009 07:23 PM

    I'm seeing this same issue now as well. We have a start-up environment, one host is a DL380 G5 running ESX 3.5 u4, on hist is a DL380 G6 5540 running esxi 4. I'm seeing the high active memory issue on the 2008 and 2003 server running on the esxi 4 host. I'm hoping to hear of a fix soon as well, we are about to start rolling out a large number of DL380 G6's.



  • 30.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jul 01, 2009 09:20 AM

    I have the same problem in my DELL PE1950 with intel 5400.

    I have just updated from 3.5u4 to 4.0 and the memory usage is too high.



  • 31.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jul 01, 2009 02:52 PM

    What kind of processors are in those PE1950s? The issue we're saying I thought was fairly certain to be specific to Nehalem hosts...



  • 32.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jul 01, 2009 03:29 PM

    I know that the issue is for nehalem host but I'm not the only one that have this problem with a different processor (see the other posts).

    My processor is Quad-Core Xeon E5410 2,33 ghz



  • 33.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jul 01, 2009 05:14 PM

    K. I haven't seen any other posts re: other processors so I'll take your word for it. I thought this specific issue was isolated to the Nehalems...



  • 34.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jul 03, 2009 03:05 PM

    I am having this issue and I have AMD hosts. For me, changing the advanced setting then vmotioning off and back on did not fix the issue. The VM still is in alarm for memory usage (80+%) , when the guest OS reports much lower usage (25%).



  • 35.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jul 07, 2009 03:04 PM

    I am seeing the same issue on quad core xeon processors. These VMs ran fine on ESX 3.5. Changing the advanced setting and vmotioning did not help for us either. A fix would be nice.



  • 36.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jul 16, 2009 12:53 PM

    this one is eating my nerves!!! a simple fix just for the incorrect guest mem reporting problem states 97% all the time would be the the least i expected for the first update interval. but it is NOT fixed!!!!!

    just installed a fresh w2008x64 with 4 gig vmem onto a dell710 (nehalem 5520), insiode guest says it consumes 1.2Gig, vsphere client says MARK, ALARM and 97%.

    vmware, when will this be fixed?

    best,

    Joerg



  • 37.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jul 16, 2009 02:36 PM

    I am having the same problem in a non-production environment. I have three ESXi 4.0 hosts that were very recently on 3.5. The processor/machine types are listed below:

    1) Intel Xeon E5405 (Dell PowerEdge 2900)

    2) Intel Xeon E5410 (Dell PowerEdge 2950)

    3) AMD Operton 2382 (Dell PowerEdge 2970)

    Interestingly, I have no memory problems with (1) or (2). However, I am observing a problem on (3) and ONLY with 64-bit guests. I have a couple of 32-bit guests that are at 5-13% guest memory utilization but all of the 64-bit guests are at 81-97%. I should also mention that I attempted the advanced settings workaround suggested in this thread to no avail. I also tried toggling the MMU options on guests to the three available options but to no avail.

    Furthermore, I should make it clear that it's not just the guest memory that is being reported as high on these guests, it's the consumed host memory. On literally all of the 64-bit machines the amount of consumed host memory is near the limit of what it could be.

    This is a pretty serious problem for us.



  • 38.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jul 16, 2009 03:12 PM

    The bad news - support and other reps are all implying that the intentions are likely to not confront this issue until the first update interval (i.e. no stand-alone "patch").

    The (maybe) good news - support also said the first update interval is likely "sooner than you think". Granted, that's likely just to keep me quiet and patient, BUT, here's hoping... :-\

    This is actually causing a huge problem for us. This is our first 'major' virtualization

    initiative. While all has been going extremely smoothly and actual

    performance is great, this bug is causing a lot of uncertainty

    and nervousness with management who will not allow the virtualization

    of our "core" servers until this bug is corrected. It's reducing their faith in the product. I actually had one VP asking my yesterday about, "That Microsoft product..."



  • 39.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jul 16, 2009 07:15 PM

    We are aware of this issue and it will be fixed in the first update.



  • 40.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jul 16, 2009 07:26 PM

    Great to hear! Any inside-info on around when we can expect to see this first update?

    Thank you!!!



  • 41.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jul 16, 2009 09:18 PM

    I really don´t want to be barefaced - i REALLY don´t! But this issue IS serious and had have to be fixed in the GA. Now you are telling us it will be fixed in the first update. When will this first update be? I am still waiting till this is fixed till i turn these DELL 710´s into production! Please give us a date! This is more than important.

    best,

    Joerg



  • 42.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jul 17, 2009 07:49 PM

    Hoi,

    I had the same problems on 2 rx 300 s5 with nehalem xeon x5550 cpu`s. All 64bit systems equal linux or windows 2008 had guestmemory 90 %. I solved it and now, it is all pretty good. First I set the Mem.AllocGuestLargePage to 0, then i shutdown my win 2008 vm.

    After that I set the cpu/mmu-virtualization from auto to software in the option menu from the 64bit vm.

    I started vm-machine and wait 5 min, all seems good. Next day I shutdown the vm again and set cpu/mmu-virtualization to auto.

    After start and waiting 5 min, all seems good the guestmemory went down to 2%.

    Now I shutdown again -> set the modus from auto to Intel VT (EPT) and started the vm. (I mean in auto modus, the esx took the Hardwarevirtualization)

    Guestmemory oki, vm memory parameters ok.

    This I had done with 4 Windows 2008 64bit and by all vm`s it was the same result .

    Now it`s all good.

    ESX-Version: ESX 4.0.0, 175625



  • 43.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Broadcom Employee
    Posted Jul 17, 2009 08:06 PM

    Setting Mem.AllocGuestLargePage to 0 may reduce performance for some workloads. Check . Please set the Mem.AllocGuestLargePage config option to 0 after careful analysis of your VM's workload.

    If your ESX server's memory is undercommitted, then Nehalem and AMD Barcelona/Shanghai machines may show high "Guest Memory Usage". But, it does not have any bad side-effect, so you can choose to ignore high guest memory usage.

    When an ESX server's memory is overcommitted, you are unlikely to see high "Guest memory usage" and ESX server will reclaim memory from unimportant idle VMs so that it can satisfy the memory demand of active and important VMs.

    Thanks.



  • 44.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jul 17, 2009 08:24 PM

    Rajesh, there have been several reports of this issue on other types of processors.



  • 45.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Broadcom Employee
    Posted Jul 17, 2009 10:00 PM

    Charadeur,

    Even in other processors, you can choose to ignore the high "Guest memory usage" if your ESX server's (and cluster's) memory is undercommitted because it should not have any effect on performance of VMs.

    Thanks.



  • 46.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jul 28, 2009 02:50 PM

    Unfortunatly I am over committed on memory and this is a real problem for us.



  • 47.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 04, 2009 01:14 AM

    Hey Guys -

    I am also experiencing the same problem on HP BL490's with Nahalem processors. I built 5 new Server 2008 64-bit VM's with 4GB of RAM each. All are acting as domain controllers, each VM shows 3.6 - 3.9 GB of Guest RAM usage. When looking at task manager within the VM, actual RAM usage is a little over 1GB. Any update on when this will be fixed?



  • 48.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 04, 2009 01:20 AM

    edited t remove out of office message.

    Message was edited by: joergriether



  • 49.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 04, 2009 01:20 AM

    Bonjour,

    Je serai absent du bureau jusqu'au 21/08 inclus.

    Vous pouvez pour toutes vos demandes habituelles vous adresser à Nicolas VASSILIADIS au +33892895312 ou bien par email : nvassiliadis@blueacacia.com

    Bonne journée,

    Alexandre NEY



  • 50.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 04, 2009 09:52 PM

    I've read through this thread and I'm surprised to find this has been a known issue as long as it has. I am seeing the same issues with my Nahalem based Dell R710's but I'm not sure I have enough understanding to make an educated decision as to whether I should set Mem.AllocGuestLargePage to 0 or not, because quite frankly, a lot of the discussion is over my head :smileyhappy: How will I know that there are any issues with the guest VM, or if it's just a false positive? I am inclined to leave things how they are until there is an update, but we are about to push a large production SQL DB that is currently a physical machine into vSphere4. It is W2K3 Enterprise x64 and SQL x64 with 16GB of memory. It is bothersome to see it reporting 98% guest memory when there is not a single user logged in!



  • 51.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 04, 2009 09:56 PM

    Time to break out the old-school Windows methods of monitoring your server performance!

    Task Manager, Perfmon, etc...

    And yes, we're all (impatiently) waiting for this to be fixed...



  • 52.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 05, 2009 07:06 AM

    I'm seeing this same issue. We have Primergy RX300 S5 server with Intel Xeon E5540 CPUs.

    I hope, the problem will fixed soon.



  • 53.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 05, 2009 12:46 PM

    My advice is I would not roll out my critical SQL server if I was over committed on the ESX memory.



  • 54.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 05, 2009 02:16 PM

    Oh I wish I found this thread before migrating 1/2 my VM's to new hosts.

    SNMP is going nuts on me and I am in a real tight spot.

    I am not permitted to disable the alerts and I can not decipher between a VM thats crying wolf and a VM that is in real trouble.

    My boss is blaming me for not knowing about this before migrating 1/2 of my guest machines to the new Nehalem powered 380 G6 HP servers.

    So do I migrate the VM's back? I will have to remove tools and reinstall the previous version, and do it all AGAIN when this supposed fix comes out.

    I can see it now, my new host servers will sit doing nothing (except consume power) till November. What a collossal waste of time.



  • 55.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 05, 2009 02:26 PM

    I wish I had something more positive to share - but multiple different support techs have all implied that we're not looking until around November at the earliest. I'd love to believe threads like this would expedite things, but let's look at it from their perspective...

    A) Release an update prematurely before full testing cycles and risk what they define as an 'annoyance' to grow in to an actual functionality 'problem'

    B) Wait the full testing cycle and do it "right".

    I can understand that side of things, but that really doesn't help those of us who find ourselves in a difficult place because of this. My boss would likely have my head (my job?) if he knew that right now on our brand new ESX4 servers - there is no alerting enabled because of this problem. I'm very anxious about getting alerts turned on and fully functioning but just can't because of this.

    You likely need to migrate those VMs back if disabling the alerts is not an option. Yes - you're sharing in the nightmare many of us others are...

    My boss is wanting me to look in to a potential VMware ACE/View deployment (replacing our entire desktop infrastructure with VMware), but I'm extremely reluctant because of all of this... :-\ I'm such a huge VMware fanboy, but this is truly doing some damage to my confidence...



  • 56.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 05, 2009 02:35 PM

    That's where VMware is making the mistake....classifying this as an 'annoyance'. Not being able to monitor your Virtual infrastructure is more than an annoyance, it's a serious problem for several customers.

    With how aggressive competition has become in the virtualization market VMware can't afford to not listen to their customers demands...especially with their brand new flagship product -- it's really tarnishing an otherwise great release.

    Wake up VMware and listen to your customers -- we want an out-of-cycle hotfix for this issue now!



  • 57.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 05, 2009 02:41 PM

    As I understand it (feel free to correct me or write a counter-rant if you know it better, this is just my point of view), this issue is rather an unfortunate default setting than a bug, or as someone stated earlier, a tradeoff decision you have to make between TPS and large pages.

    Yes, ESX4 has large pages enabled by default with Mem.AllocGuestLargePage being set to 1.

    With those large pages, the chance of finding duplicate pages is pretty much zero, hence you won't get any benefits from TPS.

    Now you should ask yourself the question, do you really need large pages, does your application actually support it at all (hint: it most likely doesn't)? According to the whitepaper on large page performance, http://www.vmware.com/files/pdf/large_pg_performance.pdf, the benefits of large pages are marginal, and only conceivable if your application actually makes use of them (another hint: pretty much none do).

    I don't see why you shouldn't give setting Mem.AllocGuestLargePage to 0 at least a try. The setting is easily configurable for enabling and disabling and doesn't even require a reboot of the host (vmotion VMs off and back though). All it does is disabling the mapping your VMs memory to large pages on the host, and using the good 'ol small pages just like it did by default in ESX3.X.

    I have no idea why large pages were enabled by default to begin with, and I guess there won't be a real bugfix (except that fix sets Mem.AllocGuestLargePage to 0 as defautl setting) , since you just can't share duplicate large pages, which don't existent to begin with.



  • 58.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 05, 2009 02:51 PM

    We have set Mem.AllocGuestLargePage to 0 on some test hosts, and it does essentially make the issue 'go away'. While I'd prefer a true resolve, I agree with MKguy that unless we can find hard evidence that this creates a performance issue, it should be set on your affected hosts at least until such time as a fix is released.



  • 59.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 05, 2009 04:02 PM

    Now you should ask yourself the question, do you really need large pages, does your application actually support it at all (hint: it most likely doesn't)? According to the whitepaper on large page performance, the benefits of large pages are marginal, and only conceivable if your application actually makes use of them (another hint: pretty much none do).

    Thanks for the link to the doc. So basically it is not enabled in Win2k3 x64 by default anyway (which I verified on my new vm SQL server), and it's not even clear that SQL supports it or has it enabled by default. So I don't see how it could hurt me to set Mem.AllocGuestLargePage to 0, because in essence the guest OS isn't making use of the feature anyway.



  • 60.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 05, 2009 04:59 PM

    Yes thank you MKguy... I did give it a try, I made the change 2 days ago.

    My problem as stated earlier is the alerting or monitoring of my VM's and SNMP going nuts (or did I not make that clear)

    They still show alerts in the status WHILE the guest memmory is now showing OK since the change.

    So please tell me again that "this issue is rather an unfortunate default setting than a bug".

    Either way, my boss does not care, I am on the hook for it, and if fixing it means I have to go back to 3.5.

    Bottom line is; in the interest of keeping my job.....I get to work this weekend for free!

    yay



  • 61.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 05, 2009 08:08 PM

    Help me understand something. Rajesh said you can ignore if your memory is not over committed. So if my memory is over committed I can do the Mem.AllocGuestLargePage fix and as long as I don't need large pages it will be OK?

    Even if that is the case I am also experiencing the same thing dephcon5 is.



  • 62.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 05, 2009 08:22 PM

    Forgive me guys, but my father always taught me growing up not to try to fix the symptoms but to solve the problem...

    We're essentially talking about band-aids here, not a "fix".



  • 63.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 05, 2009 08:35 PM

    I agree with that and it was my bad for calling it a fix. As you say it is a band aid. However if all I have is a band aid maybe I can stop the bleeding until a real fix is forthcoming. VMware?

    Lucky for me my boss is OK with turning the alerts off vs. reinstalling 3.5. And it does seem like the Mem.AllocGuestLargePage to 0 normalizes the memory useage reporting with the excpetion of the alerts. At first I thought it did not but I just was not being patient enough.



  • 64.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 05, 2009 08:41 PM

    Ok, on a long shot I restarted the VPXA service and the MGMT service and the vCenter alerts stopped AND the VM's are behaving much like they did when on 3.5.

    To summarize:

    I have SEVEN DL380 G6 HP servers with two X5560 Intel Nehalem quad core's and 72GB of ram in each host. My vCenter server is a slightly older G5 HP server with SQL 05 ent (for what ever that is worth).

    My ESX 4 servers are new and were not upgraded from 3.5, however all the VM's are migrated from my old 3.5 ESX host and tools was updated at the time of migration.

    I did the "set Mem.AllocGuestLargePage to 0" and then migrate all the VM's off then back. (I waited 2 hours and no change)

    I restarted the vCenter server services (no change) I then rebooted my vCenter server. (no change)

    On a long shot I decided to restart the vxpa and mgmt services on each ESX host and now my VM's are no longer showing in an alarm state.

    Im not sure if I would call this fixed but.... the peformance of my VM's seems to be as good as I had expected from the upgrade in hardware, the alarm state seems to be correct now in that I am not flooded with SNMP and vCenter looks normal.

    Everything else seems to be working ok and when I intentionally flodded a test print server with traffic, it did generate an alarm which at least tells me I am at a good baseline to go forward.

    For me, the ticket was the restarting of the vxpa and mgmt services AFTER the "set Mem.AllocGuestLargePage to 0".

    Your mileage may vary...



  • 65.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 05, 2009 08:46 PM

    Will I still be able to use this prescribed workaround if I don't have the ability to use vMotion (using direct-attached storage on only 1 host)?



  • 66.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 05, 2009 08:53 PM

    You will have to shut down the guest and restart the host but it should work fine.



  • 67.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 05, 2009 09:59 PM

    Yes, ssh session to your ESX boxen. Log in as root or escalate to root with su -

    then run service vmware-vpxa restart

    and service mgmt-vmware restart

    I did NOT restart any hosts or VM's to execute these commands.

    I think you could restart the ESX boxes and acomplish the same thing, but I did not try that so I can not verify that.



  • 68.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 06, 2009 01:07 PM

    Dephcon5, Thanks that works great. Even though this is a workaround it does make me feel better that the VMs are acting normally. I am a little disapointed that VMware does not have the complete steps documented.



  • 69.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 06, 2009 03:22 PM

    Several people have, understandably, asked about when this

    issue will be fixed. We are on track to resolving the problem

    in Patch 2, which is expected in mid to late September.

    In the meantime, disabling large page usage as a temporary

    work-around is probably the best approach, but I would like

    to reiterate that this causes a measurable loss of performance.

    So once the patch becomes available, it is a good idea to go

    back and reenable large pages.

    Also a small clarification. Someone asked if the temporary

    work-around would be "free" (i.e., have no performance penalty)

    for Win2k3 x64 which doesn't enable large pages by default.

    While this may seem plausible, it is however not the case.

    When running a virtual machine, there are two levels of memory

    mapping in use: from guest linear to guest physical address and

    from guest physical to machine address. Large pages provide

    benefits at each of these levels. A guest that doesn't enable

    large pages in the first level mapping, will still get performance

    improvements from large pages if they can be used for the second

    level mapping. (And, unsurprisingly, large pages provide the

    biggest benefits when both mappings are done with large pages.)

    You can read more about this in the "Memory and MMU Virtualization"

    section of this document:

    http://www.vmware.com/resources/techresources/10036

    Thanks,

    Ole



  • 70.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 06, 2009 03:39 PM

    Thank you for you reply, Ole. The one thing that's been bothering me the most is the lack of input from VMware.

    I appreciate you chiming in...



  • 71.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 07, 2009 12:19 PM

    I wanted to throw one thing out there for server 2008 64-bit people. We found that if we allocated 4GB of RAM to a virtual Server 2008 64-bit server (may apply to 32-bit also) the OS would cache data into the physical memory, based upon what the OS "thought" you would use. We found this service from Microsoft to control the memory usage, and were able to put a cap on it. Here is the link:

    After setting this service up on all our server 2008 64-bit VM's and enabling the Mem.AllocGuestLargePage to 0 we are now noticing almost a 50% decrease in the amount of physical ram used on the ESX servers. Before enabling the Microsoft Windows Dynamic Cache Service and Mem.AllocGuestLargePage we were using about 110GB of RAM on our ESX cluster. After putting the fixes in and rebooting ESX hosts and VM's, memory usage is down below 50GB. Performance has stayed about the same in the VM's.

    Hopefully this helps some other people.

    Ben



  • 72.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 17, 2009 06:58 PM

    Does this mean only Nehalem Host have this problem as thread mentioned it "Re: ESX4 + Nehalem Host + vMMU = Broken TPS !"

    So AMD is not involved by this bug?



  • 73.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 17, 2009 07:00 PM

    I have observed this problem in my environment on an AMD-based server (see my previous post).



  • 74.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 10, 2009 12:55 PM

    Is there anymore news re a date for a fix for this?



  • 75.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 10, 2009 12:59 PM

    VMware, now would be a good time to at least give us a rough estimation of a date for a fix. The actual situation is intolerable.



  • 76.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 10, 2009 01:50 PM

    Honestly, this is ridiculous. I am seriously considering moving our virtualized environment over to HyperV or a different solution because of the non-existent response from VMWare on this issue. The whole situation indicates to me that VMWare does not care about supporting it's product. Not to mention that I have run into countless other serious bugs since version 4 of ESX + vCenter.



  • 77.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 10, 2009 02:13 PM

    there are other VERY major things which are also unanswered like that one and that one

    no response from vmware at all not even an excuse or rough date for a solution. Oh, for one there is at least a kb article with a REAL poor "workaround-workaround" but no solution.



  • 78.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 10, 2009 02:17 PM

    Good morning. I understand your concerns, and I'll

    try to find out more specific information today. At

    this point, the only information I have is the rough

    date estimate that I posted on August 6.

    Sincerely,

    Ole



  • 79.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 10, 2009 02:24 PM

    Not to mention that the vCenter client is not supported at all on Windows 7, and the workaround doesn't work on RTM. VMWare seems to be rapidly going down hill at a very inopportune time for them (considering all the increased competition in the virtualization space now).



  • 80.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 10, 2009 02:50 PM

    I have the information from my system vendor (VMwarepartner), that Vmware will release this month (september) a patch where solved the problem.

    Regards Toby



  • 81.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 10, 2009 10:33 PM

    Not to mention that the vCenter client is not supported at all on Windows 7, and the workaround doesn't work on RTM.

    The workaround works for me. I am running vCenter client on Win7 RTM.

    Edit: Oops, upon re-reading I see you mentioned a vCenter client. I double-checked and I am running the VMware vSphere Client under Win7 RTM. Sorry about the mixup.



  • 82.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 10, 2009 11:37 PM

    The problem will be fixed in Patch 02, which we

    currently expect to be available approximately

    September 30.

    Thanks,

    Ole



  • 83.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 14, 2009 10:08 PM

    I had experienced the same issues in our hosting environment after standing up an 8-host cluster using HP DL380 G6 systems with Intel X5560 Nehalem processors. Long story short, I set the Mem.AllocGuestLargePage to 0, put each host into Maintenance Mode to force a vMotion of the running VM's, then back out of Maintenance Mode. Bingo, no more issues or alarms... Seems straight forward and the trade off in terms of performance has gone un-noticed to this point. We run a slew of systems ranging from Win2003 to Win2008 (32 and 64 bit) with SQL, Exchange, Oracle, etc. No impact to customer facing applications or query jobs. As a definite plus, I saw utilization on the hosts go from an average of 45% down to 20% for memory consumption and VM guest memory utilization went from 95%+ in some cases down to 2%. I understand the benefits of hardware MMU and fully look to utilize it where it makes sense, but as a whole the 4k page size does seem to handle the TPS function at a level that most clients should expect considering it's one of the main selling points of VMware ESX over any of the other virtualization suites out there.

    My big question is this - if you set the Mem.AllocGuestLargePage on a host to 0, is there a way to granularly enable the 2M large page size for a VM independently? This would allow me to enable large memory pages for my heavier hitting VM's without setting it up across the board.

    Thanks in advance.



  • 84.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Broadcom Employee
    Posted Sep 14, 2009 10:32 PM

    Ryan,

    My big question is this - if you set the Mem.AllocGuestLargePage on a host to 0, is there a way to granularly enable the 2M large page size for a VM independently? This would allow me to enable large memory pages for my heavier hitting VM's without setting it up across the board.

    Unfortunately, no. Once you disable large pages at the host level, you cannot turn on large pages at per-VM level.

    Thanks.



  • 85.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 14, 2009 10:45 PM

    Got it. Thanks for the quick reply... I understand that the 2M page size effectively breaks down to 4k page sizes on memory overcommit at the host level, but I won't see that benefit for a little while on the hosts that I have running. We're set up for 30 host clusters with an average of 30-40 VM's across 72GB RAM configs. The hosts peak occasionally, but the overcommit isn't a standard practice for us. I'll wait for the patch to drop and test in our labs for the final verdict before I go back to large pages as a default. For now all seems quiet and I can breathe easy knowing we haven't shot our VM:host ratio in the foot by going to ESX 4.0 :smileygrin:



  • 86.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Broadcom Employee
    Posted Sep 15, 2009 12:08 AM

    There are two issues discussed in this thread.

    1) High guest memory usage while using large pages

    This issue will be fixed in patch02.

    2) Low (or zero) TPS numbers while using large pages

    Please check my Feb 21, 2006 post. Low TPS numbers while host memory is under-committed is totally fine because TPS will automatically kick-in if/when host memory gets over-committed. However, we made few TPS improvements that will help to share zeroed pages in some special scenarios even while host memory is under-committed. For example, Windows OS zeros all guest memory at power-on and the improved TPS will share the guest pages that are zeroed by guest OS bootup code. Once the guest starts actually using the zeroed memory, ESX will back the guest memory with large pages to reduce TLB misses and improve performance. This TPS improvement will most likely ship in the first update.

    Thanks.



  • 87.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 15, 2009 09:16 PM

    Does the "zeroed out" behavior also take place on Linux boot up as well? A quick Google search didn't yield any results.



  • 88.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 15, 2009 09:18 PM

    Linux does not zero memory eagerly during boot.

    Sincerely,

    Ole



  • 89.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 16, 2009 06:35 PM

    As a quick question, if EPT is large page based, is it possible to align the guest operating system large/huge pages to the VMM pages? Seems like there would be benefit in doing that similar to aligning to 64k blocks for the VMDKs when attached to NAS/SAN.



  • 90.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 16, 2009 07:00 PM

    Yes, we preferentially use large pages to back guest

    memory where the guest is also using large pages. In

    other words: it's done :smileyhappy:

    Thanks,

    Ole



  • 91.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 21, 2009 05:58 PM

    I have to admit I'm confused how they hypervisor would know the alignment of the memory pages in the guest operating system. For example, when I lock huge pages in Linux, I have to make a kernel parameter change in sysctl.conf defining how many huge pages (2 MB pages) I should use. How does vSphere know which pages I have set to huge pages, since not all of the guest can be set to huge pages?



  • 92.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Broadcom Employee
    Posted Sep 22, 2009 11:28 PM

    Yes, it is difficult for hypervisor to know which guest pages are being used as large pages inside the guest. vSphere uses various techniques to deduce this info and the techniques used depends on whether software or hardware MMU (i.e., RVI or EPT) is used.

    When vSphere is using software MMU

    - Hypervisor reads guest page tables to setup shadow page tables. So it can infer whether guest is mapping a memory region large or small.

    When vSphere is using hardware MMU

    - Hypervisor normally does not need to read guest page tables. If host memory is undercommitted and there are bunch of free host large pages, vSphere backs guest memory with host large pages even though guest is not mapping the memory region large. In some benchmarks, mapping all guest memory with host large pages shows much improved performace even when guest is not using large pages. As discussed in this thread, backing all guest memory large has some side-effects with respect to "guest memory usage" estimation and TPS. These will be fixed soon.

    - Sometimes it may not be possible to map all guest memory with large pages (e.g., memory overcommit situations). So vSphere selectively reads guest page tables and uses sophisticated heuristics to determine which guest memory regions need to be mapped large.

    Note that large page backing of guest memory may not be possible because host memory is fragmented. To avoid this, vSphere implements efficient memory defragmentation algorithms.

    Thanks.



  • 93.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 05, 2009 08:52 PM

    "For me, the ticket was the restarting of the vxpa and mgmt services AFTER the "set Mem.AllocGuestLargePage to 0"."

    I take it this was done from ssh. Forgive my ignorance but can that be done with guest running on the host or do they need to be vmotioned first?



  • 94.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 23, 2009 07:31 PM

    Why is there still not a kb article on this issue? At least I could not find one searching active memory usage, or Nehalem processors. Why the secrecy over what is affecting a large number of your install base? Thanks.



  • 95.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 23, 2009 07:35 PM

    Jeg er ikke på kontoret fra d. 17/9/2009 til mandag d. 28/9/2009.

    Ved uopsættelige henvendelser kan service desk kontaktes på 89498599

    I'm out of office from 17/9/2009 until Monday 28/9/2009

    Urgent matters can be directed to our service desk on +45 89498599



  • 96.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 25, 2009 12:19 PM

    The patch has been made available. I'm deploying as we speak :smileyblush:



  • 97.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 25, 2009 12:21 PM

    Where did you find the patch?



  • 98.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 25, 2009 12:29 PM

    Here's the kb article -



  • 99.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 25, 2009 12:43 PM

    I got it through update manager, but above is indeed the article. I can also tell you, it is working as expected. The vm's started on the patched host (we have patched one so far), are nicely calming their active memory to expected values (in our case 25-40 % instead of 98 %)



  • 100.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 26, 2009 01:17 AM

    Just chiming in --

    I just finished updating my 3 hosts (using esxupdate) with all 3 of the patch bundles that have been released since June. All 3 are back up and running hosting VMs again. It only took a couple of minutes before the memory utilization on all VMs on patched hosts dropped from 92-98% to an average of 5-20% (perhaps I'm giving out too much memory!). I've still got a lot of "warnings" on my VMs -- but I likely just haven't waited long enough for them to clear. It appears the patch really does work.

    Now if only they'll release the update to allow full guest USB pass-through, I'll be a happy camper...



  • 101.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 26, 2009 12:23 PM

    Alright --

    It's the next morning. All of my VMs are still reading low memory utilization as is "normal", however I've still got memory alarms that wont clear themselves.

    I've attempted to restart the VC services, and went as far as to reboot the entire VirtualCenter server - but the alarms still wont go away. There's even one alarm on a host left over from last night citing "Battery on Controller 0 -- Charging". This happens every time I reboot the host, however normally goes away after a few minutes. I've updated the hardware, the actual battery has no alarm on it anymore - but there's still that yellow alert next to the host.

    Any ideas on how to clear all of these old alerts?



  • 102.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 26, 2009 04:20 PM

    I have also the problem with old alarms pending. I go to the Top-Level in Vcenter, klick "Alarms" and than "Definitions". Edit one of the definitions (don't change anything) and then save it. After this the old alarms was gone in my environment. Try it - Good Luck, Paul



  • 103.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 26, 2009 08:47 PM

    Thank you, Paul1!

    All alarms cleared themselves after I edited the alarm. This is now the first time I've seen my VM environment completely free of errors. It's great! How exciting....



  • 104.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Oct 20, 2009 08:32 AM

    @ Paul1 thanks for posting this sollution. I installed the patch from KB 1012514 and "changed" the VM Memory alarm setting. Now everything seems to be normal again. :smileyhappy:



  • 105.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted May 20, 2010 12:27 PM

    So i hate to open old wounds but here we go;

    I have some new Westmere processor Dell machines (these are same gen as the Nehalem), i am having the same memory reporting issues. when i check for patch 200909401 i have under "compliance" "Obsoleted by host". My question is should i force this patch to my hosts or am i missing something else?



  • 106.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 15, 2010 11:54 PM

    We experienced the same issue in our prodcution and we made the changes through the advanced option of just disabling LPS. Any idea on that fix ETA?






    Cheers,

    Chad King

    VCP-410 | Server+

    "If you find this post helpful in anyway please award points as necessary"



  • 107.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 21, 2010 12:27 PM

    With ESX 4 U2 and Dell Nehalum, I still see this problem. I have changed Mem.AllocGuestLargePage to a 0 in advance settings and 24 hours later the memory is still jacked up. Is there another place to disable this for the time being?



  • 108.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 22, 2010 05:43 AM

    Keep in mind you have to vmotion your VM's off the host and back onto it.. did you do that after making the changes?

    CJK



  • 109.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Sep 22, 2010 01:18 PM

    thanks



  • 110.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Aug 06, 2009 12:58 AM

    Following up on my previous posting. We are currently testing the fix.

    Could not put the fix in the next patch but if testing goes out OK, fix will be in the patch after that.



  • 111.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jul 20, 2009 07:50 PM

    Add another customer to the list. We have Dell R710s that are experiencing the issue too. VMware, please fix this soon!



  • 112.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jun 30, 2009 09:45 PM

    Just chiming in - I'm seeing the same problem on my Dell R710s. I've got an open support case for this with VMware. Hopefully we'll see a bug fix shortly, though this is my first "bug" experience with VMware so unsure how soon is realistic...



  • 113.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jul 27, 2009 03:26 PM

    Same issue on DELL M710, M610.



  • 114.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jul 28, 2009 10:33 AM

    Me too, 2x R610 . Vmware, when we can expect a fix?



  • 115.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Oct 13, 2010 11:31 AM

    I hate to drag up an old thread, but I'm still getting this error on a new IBM X3650M3. I've patched the server to the latest patch level of ESXi using update manager. I'm now on ESXi 4.0.0, 228255. I'd imagined that this problem would have been fixed with the later versions of ESXi, but apparently not. I've obvioously tried the "mem.allocguestlargepage" change, to no avail, this is with full reboots of the host server just to be on the safe side.

    I've nothing else to try that I know of, any suggeestions of the guru's on here? :smileysad:

    Cheers



  • 116.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Oct 29, 2010 05:05 PM

    What about esxi 4.1 and vMMU and TPS?

    Is it recommend to force vMMU on Terminal Servers?



  • 117.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jan 18, 2011 03:10 PM

    I think, based on what I've seen so far that this problem still exists in ESXi 4.1



  • 118.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jan 18, 2011 04:00 PM

    Agreed. We have several installations on ESXi 4 and 4.1 and see this issue.



  • 119.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jan 25, 2011 10:50 AM

    I can also confirm that turning off the large page support on all hosts and then vMotioning VMs does indeed kick TPS back into life.

    This quickly begins to save gigabytes of RAM on ESXi4.1



  • 120.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jan 25, 2011 11:02 AM

    For those interested - there's a recent blogpost which explains the postion from VMware...

    http://blogs.vmware.com/uptime/2011/01/new-hardware-can-affect-tps.html

    Regards

    Mike



  • 121.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Feb 07, 2011 09:54 PM

    Hi Guys

    I am going thru the same issue as you guys. however as it been mentioned in Blog "If memory contention develops, the VMKernel will automatically switch to small pages and implement TPS in an effort to free up memory"is working nicely and kicked in after few days of me setting up cluster. this may give some people consolation, so i thought :smileyhappy:



  • 122.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jun 10, 2011 03:20 AM

    If I enable the advanced setting, do I then need to vmotion all my vms from the host so the host is empty before vmotioning them back, or can I just vmotion the vm out and in again then just repeat for each VM?

    Just thinking best way to do it because if I have to vmotion every VM from the affected host I'm going to need to shut some down I think.

    Thanks.



  • 123.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Jul 13, 2011 09:58 AM

    I got this error last year whan i upgrade from ESX 3.5 to 4.0 setting the AllocGuestLargePage to 0 did the trick.

    Now i'm upgrading to ESXi 4.1 U1 i'm getting high memory allocation with no guest memory usage.
    This time nothing seems to work...
    Setting le LargePage to 0, vmotion in and out. still getting the same trouble



  • 124.  RE: ESX4 + Nehalem Host + vMMU = Broken TPS !

    Posted Oct 13, 2011 08:09 PM

    I am on ESXi 4.1 and when I vMotion my VM's from DL380G5 to G7 the memory consumption goes up and my private guest memory is 100%.

    When I vMotion back to the G5 the TPS kicks in again. yes vMMU will leverage the Large Pages and so TPS will not "function" until there is memory contention on the host.

    What about this issue on vSphere 5? Can anyone test this?