Recovering VMDKs on NetApp NFS Datastores

In my day job, I look after the day to day server operations of a university that makes extensive use of vmware and netapp storage. When I started there, and saw they were using NFS for their datastores, I reversed judgement on if they were crazy-smart or just crazy. Thankfully it was the former – crazy-smart.

Using NetApp NFS for VMDK storage allows us to do all sort of cool stuff, especially with regards to backups/recovery/migration. But it had been tedious, especially if someone wanted a single file restored from their VM.. we had to copy the entire VMDK out of the snapshot directory, mount it on another VM somewhere, find the file, and get it back to the customer somehow. And if it was on our secondary filer, we had to do a flexclone, and mount that onto one of the 96 ESX hosts we had, copy the file out.. etc

Wheels spin sometimes, and an idea comes to you. Remember Inception? and all the layers? Going deeper etc? It’s like that.

/home/user/file.txt -> ext3 -> LVM LV -> LVM VG -> LVM PV -> /dev/sda1 -> ESX -> VMDK -> NFS Datastore -> NetApp Data OnTap -> WAFL -> Disks ..

Over the last week, my co-workers and I have been building up a system to make this easier and less disruptive to the infrastructure (which is good for everyone, the less changes you have to make to production, the better). This gist is this..

We have a secured VM, with a couple of NICs – one standard access port, one a VLAN trunk carrying our NAS networks, including the one that the VM Blades use to mount their storage.

Inside this VM, we do magic…

So, 96 blades – that’s a fairly large VM infrastructure. We have two separate environments, in 6 clusters, two routing domains, etc, running a total of 1050+ VMs at last count. Each cluster with their own datastores, diverse physical locations, etc. One of the service improvement projects that I got our great team to do was to implement were some datastores, mounted onto all the clusters, routed where needed. Performance didn’t have to be great, just good enough, and on 10Gb NFS, yeah, it’s pretty good. We have an ISOs datastore, a Templates datastore and a Transfer datastore. The Transfer one was new – the others we’d had for a while.

On our secured VM, we have the Transfer datastore mounted read-write using NFS, as well as the snapvault repository versions of our datastores (mounted read only for safety, but the files are read-only anyway). This now means that if we have to do a full VM recover, we have a simple process –

  • Shut down the VM
  • Edit the settings to remove the hard drives you want to recover (I know, it sound wrong to me too, but trust me..)
  • Storage vMotion the VM onto the Transfer datastore (which, since it doesn’t have any disks, is quick)
  • Locate the version of the VMDK you want in the .snapshot directory of the snapvault location (We have a simple shell script to list all versions)
  • Copy the VMDK files (remember the -flat.vmdk) from the snapvault location into the appropriate directory on the Transfer datastore, using cp &, then running watch ls -l on the destination, if you want a progress indicator
  • Re-add the storage from the vmware settings, finding it in the place you just copied it
  • Power On VM, check it works, then hand back control to customer, and start a storage vMotion to relocate storage back into the correct primary datastore

All done! No messing around on the NetApp making flexclones and mounting them, cleaning them up etc. Depending on your level of risk tolerance, you could copy the VMDK back to the primary location also mounted via NFS, but we consider the small delay of the storage vMotion to be a price worth paying for peace of mind.

Your site gets compromised, what do you do?

.. make people unable to use authentication methods that don’t involve giving you a password, that’s what!

Following on from the Gawker account hack, I have gone and changed a bunch of accounts, even though I may not have actually used a password I generated for Gawker, but it seemed prudent.

Lifehacker have a page up here which details the response..

Including this bit:

2) What if I logged in using Facebook Connect? Was my password compromised?
No. We never stored passwords of users who logged in using Facebook Connect. We have, however, disabled Facebook Connect logins temporarily.

*facepalm*

So what you’re saying is, not only are you incompetent, and let people steal your user/password database, you’ve now stopped the only way of stopping it from happening again??

Nothing pisses me off more than websites that require you to register or login to look at attachments on forums, for example. Facebook Connect (or ideally OpenID) are an awesome solution to the problem of having to create/maintain/worry about accounts on every site on the internet. I mean sure, there are idiots in marketing who love the idea of “rich user engagement” from tying them to your site with an account, but I think they severely overestimate their own importance.

.. but don’t get me started on janrain/rpx’s recent change that suggests you put your paypal username/password into HTML hosted on an insecure site so you can join the social engagement “story”. That’s just stupid.

Fixing GPT partition tables for OSX

With our upcoming visit to Australia, we’re doing backups before we go away. But alas! Elizabeth’s USB drive didn’t work. It became unmounted, and when she plugged it back in, no volumes were found!

.. by OSX

Never wanting to throw away the contents of a drive, I started digging. On a Linux box, I used parted to look at the disk and find that it did indeed know about all the partitions that should be on there, but for whatever reason, they weren’t being enumerated.

Satisfied the data was still there, I went back to my Mac and started poking around. I could see that /dev/disk1 existed, and had no partitions, just as OSX would have be believe. Using the gpt command line utility, I got the following:

# gpt -r show -l /dev/disk1
start size index contents
0 1
1 1 Pri GPT header
2 32 Pri GPT table
34 6
40 409600 1 GPT part - "EFI System Partition"
409640 1464471472 2 GPT part - "Time Machine Backups"
1464881112 262151
1465143263 32 Sec GPT table
1465143295 1 Sec GPT header

Twirling my evil moustache, I thought if I could relabel one of those partitions, it would make it rewrite both partitions, and she should be apples.

# gpt label -i 2 -l "Time Machine Backups" /dev/disk1
/dev/disk1s2 labeled

But no. I then wondered if /usr/sbin/diskarbitrationd was saying anything helpful about the situation, and ran it in debug mode (edited /System/Library/LaunchDaemons/com.apple.diskarbitrationd.plist as root to add the -d flag to startup), and then kill -HUP `cat /var/run/diskarbitrationd.pid` and then tail -f /var/log/diskarbitrationd.log and I got this:

18:11:14 probed disk, id = /dev/disk1, with cd9660, failure.
18:11:14 probed disk, id = /dev/disk1, with exfat, ongoing.
18:11:14 probed disk, id = /dev/disk1, with exfat, failure.
18:11:14 probed disk, id = /dev/disk1, with msdos, ongoing.
18:11:14 probed disk, id = /dev/disk1, with msdos, failure.
18:11:14 probed disk, id = /dev/disk1, with ntfs, ongoing.
18:11:14 probed disk, id = /dev/disk1, with ntfs, failure.
18:11:14 probed disk, id = /dev/disk1, with ufs, ongoing.
18:11:14 probed disk, id = /dev/disk1, with ufs, failure.
18:11:14 probed disk, id = /dev/disk1, no match.

Good effort though, right? I mean, I’m sure Apple must expect regular users to put diskarbitrationd into debug mode on a regular basis.

Anyway.

Found out from this blog post that gdisk was available for OSX. Downloaded, installed and ran it:

# gdisk /dev/disk1
GPT fdisk (gdisk) version 0.6.13

Partition table scan:
MBR: not present
BSD: not present
APM: not present
GPT: present

Found valid GPT with corrupt MBR; using GPT and will write new
protective MBR on save.

Command (? for help): ?

To avoid prolonging the story any more, wrote the partition table to disk, and hey presto, there’s all the data back.

So what did we learn from this? Neither Apple, nor Linux, will try using a backup GPT if the primary one becomes fubared.

And despite all assurances to the contrary, USB bus-powered 2.5 inch HDD’s only just work with OSX’s meager power provision, and if they get unplugged, they won’t have enough juice to flush caches.

so BC is getting a new Premier

The big news in BC yesterday was that Gordon Campbell stepped down as Premier. Some were loudly proclaiming victory, or expressing happiness of his departure.

As he put it in his statement: When public debate becomes focused on one person, instead of what is in the best interest of British Columbians, we have lost sight about what is important. When that happens, it’s time for a change.

Cause let’s look at the mess he left BC in:

  • One of the lowest unemployment rates in Canada.
  • third highest average hourly wage in Canada
  • lowest tax rate for low-income (0%) and middle-income families in Canada.
  • up to a 70% tax reduction for low income families
  • opened 80 new schools, increased education funding every year, more seats in universities, highest per-pupil funding in Canada
  • Balanced budgets for 9 years until the biggest recession in half a century.
  • 42% reduction in provincial budgets before service cuts
  • A provincial credit rating that has been upgraded 7 times in a row to AAA (the highest possible)
  • biggest real GDP growth in Canada
  • $195 million in new Arts grants
  • $80 million in new permanent sport grants and funding
  • 20% increase in the amount paid per person by income assistance
  • Low-income support program spending up by more than 4x
  • Reduced carbon and greenhouse gas emissions – the most aggressive targets set in Canada, with legal enforcement in place
  • (ganked from voice_of_experience on reddit)

    Oh wait, that’s the good stuff.

    And yet, there’s a downside.. apparently some people don’t like the HST (which, when you look at what else the province gives, is actually a reasonable measure..) or they didn’t like the Olympics (what are you going to do about that now? it worked out fine. Sure it cost a lot of money..) or that the Canada Line doesn’t have enough capacity (it grew faster than expected, that’s success isn’t it?), or that he once got arrested for drink driving (let me tell you about Ralph…)

    I’m confident of history’s view of this period in politics. Also, has anyone seen Idiocracy? No? Never mind, seems like it’s playing out in politics right now, what with this and the Tea Party..

    Australian food

    We went to a place called Moose’s Downunder for lunch on Sunday, who bill themselves as providing a little bit of home and a unique Australian experience in Vancouver.

    Well it’s certainly as described on box. It seems to be staffed entirely by Australians, many of whom are from Perth like the owner. I had an Aussie Burger, with Beetroot + Fried Egg + Pineapple. It did indeed remind me of home. Also the chairs were EXACTLY the same as the ones that KK’s/The Last Drop in Crawley used to have before it turned upmarket. Down to the varnish on the arms turning gooey and coming off.

    On the downsides, just like home they charge for drink refills and extra sauces. So just like home, you don’t have to tip, right? :P I kid, I kid. I did tip, as is the local custom.

    Boeing Aviation Geek Fest 2010

    Today was the 2010 Boeing Aviation Geek Fest.

    Let me begin by saying, going on the Boeing tour at the best of times is pretty geeky. This on the other hand, is a once a year tour they don’t promote heavily, but the aviation geeks find out about one way or another.. It’s slightly more expensive than the regular tour, but it’s really for the hardcore fans.

    We started off the day.. well, first, getting here from Canada. We left home and drove to Sumas. Took about 1.75 hours to get across the border.. first a 60 minute lineup to get to the border, then another 45 minutes in with the good people of Immigration to get our I-94 waiver forms (mostly waiting in lines – despite it not being the usual “tourist” border, they were still very nice), then zooming down the highway and getting to the Future of Flight and “checking in” for 1330 hours.

    The AGF day started with a session from Boeing’s professional aviation geek, Michael Lombardi, who is employed as an aviation historian. He went through the last 40 years of Boeing, and gave some fun insights and back stories, then a bit of a Q+A, then some chatting with each other over free candy (yay halloween), then the tour.

    Let me step back.. the regular Boeing tour is pretty cool, you walk on high level platforms and look out over a sight which is similar to the construction of the USSS Enterprise in the most recent Star Trek movie. This tour, on the other hand, is at ground level, walking on the actual factory floor, and through, around and on planes in various stages of production. Sweeet. You have to wear eye protection, just in case, and watch your step through and around cables. It’s an amazing facility up close.

    Inside the factory we saw 777 LN903 for Turkish Airlines up close and personal, getting to kick the tires, almost literally, in addition to actually walking in and around the pieces that would make up LN908 for Air Egypt. As well as that, we saw the first 747-8i in final body join, a bunch of 787s (including the first 3 for Air India) and the 787 static test article.

    Then, they dragged us out of the factory, with some difficulty and back onto the bus. Which did a tour of the KPAE flightline parking lot. I believe a record for the loudest cheer for doing a left-hand turn was set this day when this was announced. We went up and around all the planes waiting for final fit-out and delivery (this site has pictures of them from a-far). Saw 777s for V Australia and Air New Zealand, as well as all the 787s for ANA, and a bunch of 787-8f’s for Cargolux, Korean Airlines and Cathay Pacific Cargo.

    Then it was back to the Future of Flight center for Pizza and networking with other geeks before heading off to our hotel.

    Everyone knows planes are big, even “small” planes like the 737, but the size of the 747 and 777s are pretty amazing. I gush on the regular factory tour, and it’s probably more interesting for most people than the one we did, but the fact is that almost every international airliner in service today was made in either this factory, or Airbus’s in Toulouse.

    What Boeing makes here is pretty much the pinnacle of humankind’s knowledge of technology and ability to build machines, and it’s amazing privilege to get up close and personal on the factory floor. Future of Flight is an amazing center at the best of times, and I have to say, today was an amazing day. I feel so lucky to have been able to attend. Very few members of the public get to do factory floor tours, with this years and last years, there was some overlap, so it’s probably under 75 people have done this one.

    So thank you very much to Future of Flight, Boeing Commercial Aircraft and Airline Reporter for organising the day! Look forward to next year’s!

    See also: Photos from the Stratodeck

    Dear Cisco, wtf are you thinking?

    As an expatriated person, I find myself thinking of home sometimes. Video conferencing with people from the old country is fun, so I thought I’d have a look at the details on Cisco’s new Umi video conferencing unit.

    Let me say, I have no idea what they’re thinking here. It’s for home use. It costs $599. Then, you have to pay $24/month for a plan to use it. To call other people who have a Umi.

    Because it doesn’t work with Skype, or FaceTime. Or anything other than Google Video chat (which is itself free for non PSTN calls).

    So basically, you’re charging as much as a computer + webcam (which you could hook up to a TV), you can’t connect to Skype, and you’re charging a monthly fee for something everyone else is giving away for free.

    Let me know how that works out for you…