Hi Everyone,
I have a complex problem that with multiple levels of software and hardware is a bit of a puzzle.
Background.
I have inherited from my predecessor a pair of HPE servers as physical hosts running ESXi as our Virtual Machine environment hosting our VM servers. These are connected to a VMWare vCenter Server Appliance (VCSA) which is also a VM residing in one of the two physical hosts.
Both physical hosts have a pair of four port 1Gbe Network Interface Cards (NIC). There is Management Network, a VM Lan network and Backup Network configured on each host. We do not have Enterprise Plus so no Distributed Switches in ESXi are possible I believe.
I have a VM running Windows Server on the second of the two physical hosts with Veeam Backup and Replication Server installed. I have a proxy on another VM on the same physical host. During backup we have two issues.
1. CPU and Datastore usage on the second physical host was hitting maximum.
2. Backup on the first physical host was slow due to there being no proxy on the same physical host.
As a result in, collaboration with Veeam Support, I split and reconfigured backup jobs to make sure the proxy assigned to the backup didn't try to back itself up (which helped with extreme duration of some backups), throttled datastore to prevent high latency and configured a proxy on the first physical server to deal with backup jobs on its own physical host.
All seemed fine until I noticed that the traffic from the internet and other software on the VM was now using the Backup LAN on the new proxy on the first physical host and DNS resolution to the host name was going to the wrong IP. This had caused a web server and a FlexLM licence server to bind to the wrong NIC and IP address.
Much research later I discovered that all the Multihomed VMs had a similar problem in that there were two IP addresses in DNS for them. I researched best practice and found it wasn't compliant so I :
- removed DNS and Default Gateway for all the backup LAN NICs on each VM and
- put in a static route for to the gateway for the backup subnet.
- Finally I added Host file entries with both FQDN and Short Name to each of the proxy VMs to force resolution of the DNS name of each proxy to the Backup Lan IP address assigned to them.
- It is now a Text Book multihomed configuration according to KBs from both MS and Veeam.
This solved the duplicate DNS entries to seperate IP addresses but killed the backup proxy on the first physical host from the point of view of Veeam.
Low level ping and pathping between the Veeam server and the Proxy on the second physical host is fine and lightening fast due to it being handled internally via what I assume to be the hidden 10Gbe virtual switch between all VMs in ESXi. I say "hidden" because in Hyper-V its visible to the GUI. The problem seems to be from the VM proxy in the first physical host.
There is a significant delay in response when resolving the host name of either veeam server or proxy on the second physical host but eventually it does resolving it to the IP on the backup network adapter. It uses it claims (I am not certain it is true) the default route to the gateway on the backup network to get to the VM on the second physical host and there is no packet loss.
The fly in the ointment is that its not stable (disconnections from Veeam and configuration warnings about the ESXi hosts) and there is a significant delay on hop resolution using pathping or tracert. So I looked deeper with the help of Veeam and found we still have issues connecting to the proxy on the first physical host and issues with name resolution of the ESXi hosts in DNS.
Researching the Configuration of the ESXi hosts and Network there is only one default gateway showing on the TCP/IP stacks and it is to the management subnet.
There are three virtual switches:
Management - containing two networks and one vmkernel adapter showing an IP on the management subnet going to a single physical NIC
Management network (vmkernel adapter)
and
Management Lan (two VMs attached).
VM - containing two networks but no vmkernel adapters going to a team of four physical NICs in a Cisco LAG at the switch.
VM LAN (no vmkernel)
and
VoIP LAN (no vmkernel)
Backup - containing two networks and a single vmkernel adapter showing an IP on the Backup subnet going to two physical NICs in a Cisco LAG at the switch.
Backup LAN (vmkernel adapter)
and
Backup with a single VM (no vmkernel)
What worries me is there is only one default gateway for everything and that is on the Management vmkernel to the default gateway of the management interface on the network switch.
In the default TCP/IP stack there is a routing table
There are three networks: the two IP subnets (management and backup) both with a default gateway of 0.0.0.0 and finally a network of 0.0.0.0. going to the default gateway on the management subnet.
Questions
My limited knowledge of IP networking tells me that with that route table every packet on either IP subnet on every physical NIC has to be sent to the default gateway IP.
I suspect this means that despite separating the traffic inside ESXi into Networks and Virtual switches and two separate four port physical NICs we are actually sending all the traffic to a single gateway on the switch.
Or does the ESXi virtual switch route based on the IP address and default gateway and or static route of the guest OS inside the VM? i.e. do the settings on the header of the IP packet sent by the server determine the routing through the virtual switch. If that is so then what purpose does the IP Route Table in the Default TCP/IP Stack serve?
Can anyone supply an answer to this
If it is the case that the default route inside the ESXi default TCP/IP Stack takes precedence then can anyone show me how to correctly set up the TCP/IP stack and what ESXi components (vmkernel adapters and standard virtual switches) I will need and how to configure them please. I have done a lot of reading of Docs and KBs and sadly am none the wiser. The one thing I have found for certain is that custom TCP/IP stacks seem to require CLI access to the host. Also if required what I need to do on the physical Cisco switch they are connected to please. Ideally I need to get the management network, VM network and Backup Network isolated on to their correct subnets and gateways whilst using the correct physical NICs.
I could of course be barking up the completely wrong tree.
Looking forward to whatever advice or assistance I can get.
All the best
Nick