I tried this MS condition variables example in a "what the heck" style not expecting much difference :
http://msdn.microsoft.com/en-us/library/ms686903%28v=vs.85%29.aspx
I enclose a vs 2010 project that on 64 bit win7 it seems to run at about 20 million wake consumer/producer per second on 2 ghz 2 core cpu BOTH physical and ESX 5 (I don't have system level access to verify logs to tell whether binary or VT translation)
Notice in this test I am not trying to measure memory bandwidth, i.e. how fast I can transfer data bidirectionally from/to processes, but, rather how fast I can wake the consumer, which in my case is a real time database which carries out a field read/write transaction at each wake in the db process, from my tests the cpu utilization in terms of transactions (i.e. client producer W, db consumer R, consumer W, producer R) is much higher doing controlled wake up of the consumer db engine rather than have it haphazardly polling the producer because the o.s. scheduler misinterprets the consumer polling as useful "work" whereas it's just wasting memory bandwidth with unnecessary read cycles polling a data ready variable set by the client.
If this was true (?) and I haven't done any clumsy mistakes it would really be too good to be true imho, it would seem it's actually doing like 1 wake / 100 cycles ? possibly it may be running with VT and FlexPriority because it's very close to the 400 cycles mentioned, I am aware modern cpus don't work in terms of opcodes/cycles but actually the other way around i.e. many opcodes per cycle where possible with pipelining, so it would seem at first glance to "blow your socks off", so definitely the way to go seems win7 which was probably developed virtualisation aware from the word "go" it's about 300 x faster than win xp 32 bit, unfortunately WakeConditionVariable / SleepConditionVariableCS seem only inter thread not inter process, I could create a remote thread in the db process but it would have to be signaled from another process, I wonder if perhaps using vmware VMCI socket api it's possible to get low latency inter process ipc ?
I also looked at using WM_COPYDATA but it's really slow just like normal sockets about 50k/s signals.
Also there is UMS user mode scheduling but again only inter thread in win 7 User-Mode Scheduling (Windows) but I am in need of inter process scheduling ?
I also attempted with this MS example to call into a kernel device driver :
http://code.msdn.microsoft.com/windowshardware/Event-d245ecb4
with these results on win 7 64 bit :
event 180 k/s both physical and ESX
irp 180 k/s physical 110 k/s ESX
but I would need to make 2 blocking DeviceIoControl calls from both producer and consumer thus halving the rate making it not any faster than inter process event/waitforsingleobject 100 k/s ... ?
I enclose visual studio 2013 example HardwareEventSample.rar if anyone is interested to benchmark ? you need to sign event.sys on win7 or you won't be able to load the driver, use "event 0 0" for irp and "event 0 1" for events methods respectively.
Another method is APC asynchronous procedure calls with QueueUserAPC, I enclose a visual studio 2005 project that does :
on win 7 64 bit 500 k/s both physical and ESX which is the highest rate of one directional consumer wakes so far I could get on ESX (I have the producer polling for the reply in order not to incur inter process waiting with APC)
on xp 32 bits it's only 80 k/s wakes though wrt to xp physical 500 k/s (I also tried putting consumer/consumer affinity:
DWORD res1= SetThreadAffinityMask( GetCurrentThread(), 1 );// consumer core 1
DWORD res1= SetThreadAffinityMask( GetCurrentThread(), 2 );// producer core 2
but it didn't change thread switching rate appreciably possibly because windows is not really real time oriented)