What VMware’s EOL of vCenter Server Heartbeat means for availability?
VMware has very surprisingly and suddenly stopped selling vCenter Server Heartbeat from 2nd June 2014. If you have already purchased vCenter Server Heartbeat you will still get support until 2018 so no panic that the whole carpet has been pulled from under your feet but it does beg the question, what to do going forward to make your vCenter installation more highlight available if you need it?
In the EOL announcement, VMware suggests first of all making your vCenter a VM to be able to take advantage of HA to provide high availability. If you cannot for some reason (and you really need to ask yourself why) run vCenter as a VM and it is/needs to be physical then the only solution is to use a backup solution to be able to restore vCenter if it fails.
I wrote a post way back in November 2010, Why vCenter is letting VMware’s side down. Based on vCenter 4.1 it highlighted the growing reliance on vCenter from so many VMware management products and the ever larger number of VMs that vCenter can manage. vCenter was no longer just a management product that didn’t need to be up all the time. Since then more and more applications now rely on vCenter, think cloud scale with vCloud Director and vCloud Automation Center. My 2010 post was detailing issues I had had with a VDI environment which came crashing down when vCenter failed. Think how many VDI users or cloud VMs you may have. If vCenter is down for even an hour or realistically far worse until you can successfully restore vCenter from a backup, what would that mean to your business? During this time not a single VDI user can connect to their desktop or any of your cloud tenants provision a VM
It may not just be the vCenter service itself that fails but something internal like Linked Mode which previously caused me to lose all permissions and roles. In 2010 I migrated a pretty large scale VDI environment to XenServer primarily to mitigate the single point of failure of vCenter. Sure, vCenter has become far more reliable since the 4.1 days with the database corruptions I experienced but vCenter is still not inherently built for high availability and is far more complicated with its now numerous additional services.
vCenter Server Heartbeat was an OEM product from Neverfail-. You basically created a duplicate copy of your vCenter and SQL installation running on another physical server or VM and Heartbeat would keep them in sync and allow you to fail between them if there was an issue. Heartbeat protected not just your VM or physical server but your OS and SQL installation as well. Hopefully you even had the partners on separate storage. Someone could screw up the OS or SQL installation of one of your vCenter pairs or even delete a LUN or a disk shelf could die and you would be able to fail over vCenter and keep running.
Sure, Heartbeat had its issues, it used to be painful to keep the secondary server on the domain as it was in effect hidden when not active, the WAN failover mode was never reliable enough with updating the DNS entry. It didn’t protect again database corruptions or any file changes in the vCenter directories as these would be immediately synced over to the partner. It was a little fiddly to set up but once up and running at least gave you a little more peace of mind. I think it was sometimes a difficult sell where customers didn’t understand the value of protecting vCenter and having to purchase an additional product to do this other than SRM was confusing.
We have no idea what VMware is planning for the future to make vCenter more robust, will we move to a more federated, self replicating vCenter, taking some cues from the much improved SSO in vSphere 5.5? Will vCenter perhaps only be delivered in future as an appliance with cloud scale performance and availability?
What to do now?
If you had already bought Heartbeat from VMware, nothing really changes, you just know no more development will happen so need to plan for the future.
If you don’t already have or want Heartbeat, I encourage you to think about what the impact of a failure of vCenter would mean for you. Ask yourself the standard DR questions, how long can you go without vCenter and how up to date does it have to be when brought back up. For a cloud environment that it continually provisioning critical VMs you will need a pretty up to date vCenter back up and running pretty quickly. Perhaps your VDI environment is a little more static and although you need it back up pretty quickly, you can afford to go back a few hours when you restore. Perhaps your private cloud server provisioning process uses vCAC which would need to connect to vCenter but you don’t actually deploy that many VMs a day, you can survive with vCenter down for a little longer.
Perhaps you need to think about splitting up your vCenters, rather than relying on a massive one, have a few more and create management pods to reduce your failure domain. This obviously has license implications but could be justified by ensuring only part of your cloud or VDI environment is down rather than everything.
Make it a VM
First of all make vCenter a VM or even better look towards using the vCenter Server Appliance. Although the appliance is a newer deployment model having everything in one place has its advantages, less independent moving parts to go wrong, an integrated database and easier upgrade process. Yes you will need to host your Update Manager and old style vCenter plug-ins separately but these don’t form part of the core availability requirement. Some people argue that with the appliance you cannot backup the database independently but I would counter that by saying backup the VM in its entirety as you would many of your other VMs, the database and OS are one package.
Having vCenter as a VM just makes some of your life easier. You are immediately protected against hardware failures with HA and can work around your maintenance process with vMotion and have a better chance of getting the resources you need with DRS. If you also have multiple vCenter VMs, you can make your operational life easier ensuring vCenter doesn’t manage the cluster it sits on. If you have two vCenters managing two separate clusters for example, ensure vCenter A on Cluster A manages Cluster B, vCenter B on Cluster B manages Cluster AB. This means you can always manage your vCenter VMs even when they are down.
Backups and Snapshots
You will also need to rely on backups or snapshots to recover your vCenter VM. This is where having multiple vCenters on different clusters can help as you can use vCenter to take and recover VM snapshots for remote vCenters. Don’t think of having only one rolling snapshot/backup, you are going to need more options when you restore your VM from an hour ago and realise the corruption is still there. If you are using a backup product, what is your restore time? Some backup products allow you to live mount a point in time backup and power on the VM. This would be far quicker than having to restore a whole LUN or .VMDK file from backup storage to primary storage. Please don’t consider tape backups, you should know why. You could also use storage array snapshots to recover your VM using a fast cloning method.
Build a restore plan
Basically you need a way to get back vCenter from a VM snapshot, storage snapshot or backup really fast. More than just recovering the data you need a quick way to initiate the restore. Here automation can help. You could build some PowerCLI scripts that can connect to vCenter or an individual ESXi host if a single vCenter is unavailable. This script could either restore a VM snapshot or talk to your storage array and do the (quick) LUN/file restore and power on the VM. Get clever and have the ability to either list all available snapshots and be able to select one or pass through a snapshot name or time so you can go back further in time. Don’t be dumb and host that PowerCLI script on your vCenter…obviously. Test this all as any backup protection is useless without a restore.
We’ll have to wait and see what the future brings for vCenter availability, I was surprised that VMware EOLd heartbeat with no warning, I have no idea why they didn’t wait until a new solution was available.