Thanks VMware! My previous post referenced VMware’s KB article about the inbox Broadcom driver with ESX 4.1 which can cause a PSOD.
The workaround was to use the vsish console utility to disable IP Checksum support. You can’t even make this change with PowerCLI as it uses private APIs according to guru LucD.
On further reading and far more serious is that this change doesn’t persist across reboots! Really, VMware you can’t seriously be offering a solution to a major problem that won’t work after a reboot.
Come on everyone…VMware, Broadcom, HP need to get together and sort this out!
Until then I wouldn’t recommend anyone with Broadcom nics deploying ESX 4.1 unless you can get clever and run this as a script when an ESX host boots.
Citrix has just released some more information about its CPU masking technology for XenServer. Citrix calls it Heterogeneous resource pools which require a XenServer Enterprise or Platinum license. This technology is similar to VMware’s Enhanced VMotion Compatibility (EVC).
These features use the capabilities built into the CPUs, either Intel’s FlexMigration or AMD’s Extended Migration to allow the configuration of a CPU to be changed by applying a CPU mask so it appears to provide different features than it actually does. This allows pools or clusters of hosts with different CPUs (from the same vendor) to support live migrations.
This is extremely useful as even CPUs with the same model number can have some differences which could cause XenMotion / Vmotion to fail. Newer generation servers with faster CPUs and even additional cores can be added into existing pools / clusters without any downtime.
Generally you need to start with hosts with the lowest capability CPUs and then add the newer revision ones which when added to the pool / cluster mask CPU features not available with the original CPUs. This can be done with all VMs online as the new hosts have the masks applied before any VMs start to run on the new host, maintaining compatibility with the existing pool /cluster.
If you are adding older hosts into a pool / cluster you would need to amend the mask of existing hosts which would mean all VMs would need to be shut down for this to work as a guest VM cannot downgrade its CPU capabilities while running.
Citrix has also helpfully released a Heterogeneous CPU Pool Self-Test Kit so you can check your CPU compatibility.
I’ve managed to find a good white paper document on HP’s site which may help Cisco people understand more about Virtual Connect and Flex-10.
There’s good information to explain what often freaks out Cisco people about Virtual Connect, no requirement for spanning tree. Also included is the explanation of Flex-Nics, switch stacking and how best to configure the Cisco side of the connection.
The blub according to HP:
“A technical discussion of the HP Virtual Connect 3.1x features and use with a Cisco Catalyst network infrastructure”
VMware has just released an advisory for the Broadcom bnx2x Inbox drivers in ESX 4.1 which will affect HP Servers including blades.
ESX/ESXi 4.1 with Broadcom bnx2x Inbox driver version 1.54.1.v41.1-1vmw experiences a loss of network connectivity and a purple diagnostic screen
The resolution (which is only a workaround) says it manifests itself when using the IP Checksum feature which is on by default on ESX and ESXi 4.1. This feature moves the checksumming (if there’s such a word) from the OS stack to the adapter and can cause a driver or firmware panic.
Now, the question is, is this a serious enough problem to immediately go out and disable checksumming? The KB doesn’t say but this looks like it could be a serious issue.
Hopefully some more information will be forthcoming from VMware / HP / Broadcom but at least you’ve been warned!