Recovering ODA VMs From Lost ACFS Snapshots

By | August 15, 2019

This is a continuation of my previous post regarding dropped ACFS snapshots.  In this scenario, a user logged in to a virtualized ODA system and deleted the underlying ACFS snapshots for multiple virtual machines on the host.  Oracle advises how to back up and restore guest VMs on an ODA in MOS note #1633166.1.

The basic process to get a clean backup of the VM is to shut it down, take a snapshot of the filesystem, then start the VM up.  Shutting down the VM ensures that the filesystem inside the guest is consistent - once the snapshot has been taken, you can copy that off to another location, preserving your backup.  Thankfully, we had backups of the VMs on this system.  Unfortunately, the MOS note's recovery instructions expect that the original ACFS snapshot is still there.  This left us in a bit of a bind, because we didn't have the underlying directory structure in place to restore the VMs.

Since the snapshot is created via the oakcli command, we were a bit nervous about just recreating the snapshot and copying files.  We decided that the best course of action would be to remove the VM from the OAK registry, and let oakcli recreate the snapshot.  This required a bit of trickery.  The high level plan to get our VMs back online was:

  1. Use oakcli to delete the VM
  2. Recreate the VM with oakcli in order to rebuild the directory structure and permissions of ACFS filesystem
  3. Overwrite the disk images with backups
  4. Start the restored VM

There were a few challenges with this approach - trying to delete the VM gave an error:

[root@oda1db01 ~]# oakcli delete vm actemp

OAKERR:7037 Error encountered while deleting VM - <oakvmres><OakMessages><OakMessage><id>0</id><text>OAKERR:7006 Invalid vm.cfg file passed with parameters - /OVS/Repositories/oda1repo/.ACFS/snaps/actemp/VirtualMachines/actemp/vm.cfg [Errno 2] No such file or directory: '/OVS/Repositories/oda1repo/.ACFS/snaps/actemp/VirtualMachines/actemp/vm.cfg'</text></OakMessage></OakMessages></oakvmres>

It looks like oakcli is trying to read the file and delete the snapshot, but the snapshot is already gone.  We learned that if we recreated the snapshot, we could run the delete command:

[root@oda1db01 ~]# acfsutil snap create -w actemp /u01/app/sharedrepo/oda1repo
acfsutil snap create: Snapshot operation is complete.
[root@oda1db01 ~]# oakcli delete vm actemp

Deleted VM : actemp

The VM also had a vdisk that was attached, and it was unfortunately deleted when the snapshots were removed.  This means that we'll need to restore that vdisk as well.  We saw a different error when we tried to delete the vdisk via oakcli:

[root@oda1db01 ~]# oakcli delete vdisk actemp_u01 -repo oda1repo

OAKERR:7059 Error encountered while deleting oakvdk_actemp_u01 vdisk: Vdisk Still attached to Vm

That isn't what I wanted to see - especially since the VM has already been deleted.  The oakcli command checks a file inside the shared repository, oakres.xml, that contains all of the information related to the shared repository.  Unfortunately, the XML exists on a single line, so I pulled it and copied to an XML formatter so that I could read it easily.  The related information for the vdisk was at the end of the file:

<VDisk>
   <Name>oakvdk_actemp_u01_oda1repo</Name>
   <RepoName>oda1repo</RepoName>
   <TypeName>VDiskType</TypeName>
   <VmAttached>1</VmAttached>
   <Type>local</Type>
   <Size>1G</Size>
</VDisk>

In this case, we just need to modify the "VmAttached" value from 1 to 0.  After changing that, we can create a snapshot and delete the vdisk:

[root@oda1db01 ~]# acfsutil snap create -w oakvdk_actemp_u01 /u01/app/sharedrepo/oda1repo
acfsutil snap create: Snapshot operation is complete.
[root@oda1db01 ~]# oakcli delete vdisk actemp_u01 -repo oda1repo

Deleted VDISK : oakvdk_actemp_u01

Now we're in good shape - the related VM and vdisk have been removed from the OAK registry, and we can start to restore the VMs.  In order to start that process, we will create new VMs and vdisks using the standard oakcli commands - this will create the necessary directory structure and populate the oakres.xml file with new entries:

[root@oda1db01 ~]# oakcli clone vm actemp -vmtemplate ol7 -repo oda1repo -node 1

Cloned VM : actemp

[root@oda1db01 ~]# oakcli create vdisk actemp_u01 -repo oda1repo -size 1G -type local

Created Vdisk : oakvdk_actemp_u01

After this is complete, I would need to run "oakcli configure vm" to update CPU/memory/failover settings if they differ from what's in the template.  In my case, the template has what I need, so I'm good there.

Now, I'm ready to copy my backups in to the new directories - perform this step for any VMs and vdisks that were deleted.  If I cloned the VM from the same original template, then I don't need to modify file names.  Once the copy is complete, attach the vdisk to the VM.

[root@oda1db01 ~]# cd /u01/app/sharedrepo/oda1repo/.ACFS/snaps/actemp/VirtualMachines/actemp/
[root@oda1db01 actemp]# ls -l
total 31464372
-rw------- 1 root root 32212254720 Aug  2 10:58 0004fb000012000059ee12fd88927567.img
-rw------- 1 root root         627 Aug  2 10:58 vm.cfg
[root@oda1db01 actemp]# rm -Rf 0004fb000012000059ee12fd88927567.img
[root@oda1db01 actemp]# cp /archive/oda/actemp/0004fb000012000059ee12fd88927567.img .
[root@oda1db01 actemp]# ls -l
total 31465448
-rw------- 1 root root 32212254720 Aug  2 11:19 0004fb000012000059ee12fd88927567.img
-rw------- 1 root root         627 Aug  2 10:58 vm.cfg
[root@oda1db01 actemp]# cd /u01/app/sharedrepo/oda1repo/.ACFS/snaps/oakvdk_actemp_u01/VirtualDisks/
[root@oda1db01 VirtualDisks]# ls -l
total 1053168
-rw-r--r-- 1 root root 1073741824 Aug  9 11:05 oakvdk_actemp_u01
[root@oda1db01 VirtualDisks]# rm -Rf oakvdk_actemp_u01
[root@oda1db01 VirtualDisks]# cp /archive/oda/actemp/oakvdk_actemp_u01 .
[root@oda1db01 VirtualDisks]# ls -l
total 1048576
-rw-r--r-- 1 root root 1073741824 Aug  9 11:26 oakvdk_actemp_u01
[root@oda1db01 VirtualDisks]# oakcli modify vm actemp -attachvdisk actemp_u01

Configured VM : actemp. Changes will take effect on next restart of VM.

Before I can start up the VM, I need to clear out any loop devices that were originally created on dom0 - this is due to the fact that dom0 accesses the repository via NFS, so it leaves stale file handles when they're yanked out from underneath xen.  We can check the status using the losetup command:

[root@oda1db01d0 ~]# losetup -a
/dev/loop0: [0903]:33685522 (/OVS/Repositories/odabaseRepo/VirtualMachines/oakDom1/System.i*)
/dev/loop1: [0903]:33685523 (/OVS/Repositories/odabaseRepo/VirtualMachines/oakDom1/u01.img)
/dev/loop2: [0903]:33685521 (/OVS/Repositories/odabaseRepo/VirtualMachines/oakDom1/swap.img*)
/dev/loop3: [001e]:36028797018964023 (/OVS/Repositories/oda1repo/.ACFS/snaps/odaoem01/VirtualMachine*)
/dev/loop4: [001e]:54043195528446009 (/OVS/Repositories/oda1repo/.ACFS/snaps/oakvdk_odaoem01_u01/Vir*)
loop: can't get info on device /dev/loop5: Stale file handle <----remove loop device
loop: can't get info on device /dev/loop6: Stale file handle <----remove loop device

[root@oda1db01d0 ~]# losetup -d /dev/loop5
[root@oda1db01d0 ~]# losetup -d /dev/loop6

Now, we should be able to start the VM.  Let's see what we get:

[root@oda1db01 ~]# oakcli start vm actemp

Started VM : actemp on Node Number : 0

That looks good - give it a minute to come up, and then we can log in:

[root@oda1db01 ~]# ssh acolvin@actemp
acolvin@actemp's password:
Last login: Wed Jul 31 17:49:08 2019 from 10.9.236.68
[acolvin@actemp ~]$

There you go - we've successfully restored the backup.  It is worth noting that you may need to run an fsck on the VM if your backup snapshot was taken while the VM is running.  Otherwise, you're good to go.

Leave a Reply

Your email address will not be published.