Recovering VMDKs on NetApp NFS Datastores

In my day job, I look after the day to day server operations of a university that makes extensive use of vmware and netapp storage. When I started there, and saw they were using NFS for their datastores, I reversed judgement on if they were crazy-smart or just crazy. Thankfully it was the former – crazy-smart.

Using NetApp NFS for VMDK storage allows us to do all sort of cool stuff, especially with regards to backups/recovery/migration. But it had been tedious, especially if someone wanted a single file restored from their VM.. we had to copy the entire VMDK out of the snapshot directory, mount it on another VM somewhere, find the file, and get it back to the customer somehow. And if it was on our secondary filer, we had to do a flexclone, and mount that onto one of the 96 ESX hosts we had, copy the file out.. etc

Wheels spin sometimes, and an idea comes to you. Remember Inception? and all the layers? Going deeper etc? It’s like that.

/home/user/file.txt -> ext3 -> LVM LV -> LVM VG -> LVM PV -> /dev/sda1 -> ESX -> VMDK -> NFS Datastore -> NetApp Data OnTap -> WAFL -> Disks ..

Over the last week, my co-workers and I have been building up a system to make this easier and less disruptive to the infrastructure (which is good for everyone, the less changes you have to make to production, the better). This gist is this..

We have a secured VM, with a couple of NICs – one standard access port, one a VLAN trunk carrying our NAS networks, including the one that the VM Blades use to mount their storage.

Inside this VM, we do magic…

So, 96 blades – that’s a fairly large VM infrastructure. We have two separate environments, in 6 clusters, two routing domains, etc, running a total of 1050+ VMs at last count. Each cluster with their own datastores, diverse physical locations, etc. One of the service improvement projects that I got our great team to do was to implement were someĀ datastores, mounted onto all the clusters, routed where needed. Performance didn’t have to be great, just good enough, and on 10Gb NFS, yeah, it’s pretty good. We have an ISOs datastore, a Templates datastore and a Transfer datastore. The Transfer one was new – the others we’d had for a while.

On our secured VM, we have the Transfer datastore mounted read-write using NFS, as well as the snapvault repository versions of our datastores (mounted read only for safety, but the files are read-only anyway). This now means that if we have to do a full VM recover, we have a simple process –

  • Shut down the VM
  • Edit the settings to remove the hard drives you want to recover (I know, it sound wrong to me too, but trust me..)
  • Storage vMotion the VM onto the Transfer datastore (which, since it doesn’t have any disks, is quick)
  • Locate the version of the VMDK you want in the .snapshot directory of the snapvault location (We have a simple shell script to list all versions)
  • Copy the VMDK files (remember the -flat.vmdk) from the snapvault location into the appropriate directory on theĀ Transfer datastore, using cp &, then running watch ls -l on the destination, if you want a progress indicator
  • Re-add the storage from the vmware settings, finding it in the place you just copied it
  • Power On VM, check it works, then hand back control to customer, and start a storage vMotion to relocate storage back into the correct primary datastore

All done! No messing around on the NetApp making flexclones and mounting them, cleaning them up etc. Depending on your level of risk tolerance, you could copy the VMDK back to the primary location also mounted via NFS, but we consider the small delay of the storage vMotion to be a price worth paying for peace of mind.