Where’d My VMs Go? Adventures in Simple Linux Forensics

By | August 9, 2019

**While the story is true, any identifying details have been changed to protect both the guilty and the innocent**

I woke up to a message the other day from a colleague on a project where I occasionally provide assistance.  He was a bit panicked - they use an Oracle Database Appliance as a utility server for their environment.  The ODA is virtualized, and provides the database backend for their RMAN catalog, OEM repository, and a few other databases, but also contains the virtual machines for their other management functions: OEM, ansible, source code repositories, and other things.

The colleague had been notified that all of the VMs were reachable by ping, but he couldn't access them via SSH or the VM console in the ODA.  When we ran oakcli show vm, the VMs all appeared to be running:

[root@oda1db01 ~]# oakcli show vm
 
NAME     NODENUM   MEMORY   VCPU    STATE   REPOSITORY
actemp         0    8192M      4   ONLINE     oda1repo
odaoem01       0   32768M     16   ONLINE     oda1repo
odaoem02       1   32768M     16   ONLINE     oda1repo

In order to recreate the issue, I built a VM on one of the internal Enkitec ODA systems, and named it actemp.  I checked the actemp VM.  Sure enough, it would reply to pings, but I couldn't get in with SSH or any via any other service that would normally be listening.  We tried to bounce the VM, and noticed that we had a bigger issue:

[root@oda1db01 ~]# oakcli start vm actemp

Start VM : actemp on Node Number : 0 failed.
DETAILS:
Attempting to start vm on node:0=>FAILED.<OAKERR:7007 Error encountered while starting VM - Error: Unable to open config file: /OVS/Repositories/oda1repo/.ACFS/snaps/actemp/VirtualMachines/actemp/vm.cfg>

At this point, we started to look at the underlying storage.  On a virtualized ODA, guest VMs are stored in a shared ACFS repository.  I initially checked the filesystem to see if any files were present in the directory:

[root@oda1db01 ~]# ls -al /u01/app/sharedrepo/oda1repo/.ACFS/snaps/
total 36
drwxr-xr-x 9 root root 4096 Jul 31 05:58 .
drwxr-xr-x 5 root root 4096 Apr 10 2018 ..
drwxr-xr-x 8 root root 4096 Apr 18 2018 odaoem01
drwxr-xr-x 8 root root 4096 Apr 18 2018 odaoem02

Sure enough, the directories were gone.  Whenever a VM is created, an ACFS snapshot is made - this allows for easy cloning of VMs, and you can take a snapshot to get an image backup of a running VM.  ACFS snapshots can be viewed with the "acfsutil snap info <mountpoint>" command.  The files were definitely gone, but were the snapshots still there?  I ran acfsutil and saw this:

[root@oda1db01 ~]# acfsutil snap info /u01/app/sharedrepo/oda1repo/
snapshot name:               odaoem01
snapshot location:           /u01/app/sharedrepo/oda1repo/.ACFS/snaps/odaoem01
RO snapshot or RW snapshot:  RW
parent name:                 /u01/app/sharedrepo/oda1repo/
snapshot creation time:      Wed Apr 18 15:17:03 2018
storage added to snapshot:   32220020736   (  30.01 GB )

snapshot name:               odaoem02
snapshot location:           /u01/app/sharedrepo/oda1repo/.ACFS/snaps/odaoem02
RO snapshot or RW snapshot:  RW
parent name:                 /u01/app/sharedrepo/oda1repo/
snapshot creation time:      Wed Apr 18 15:29:14 2018
storage added to snapshot:   32220643328   (  30.01 GB )

    number of snapshots:  2
    snapshot space usage: 64440664064  ( 60.02 GB )

uh oh....that's not good.  There should normally be snapshots for any VM templates, VMs, or virtual disks that have been created on the ODA.  Now that we've established that the snapshots themselves were gone, it looked like we were facing two possibilities - a bug wiped out the snapshots, or someone had run the commands to delete our snapshots.  We checked the bash history for previous use of acfsutil, but that didn't show anything.  This didn't rule out whether a user deleted the snapshots, it just means that we need to dig deeper.

Thankfully, acfsutil logs all activities in /u01/app/<gi_owner>/diag/crs/<hostname>/trace/acfsutil_<pid>.trc.  I checked the directory, and saw a handful of trace files listed in there.  Looking at one, I saw our problem:

[root@oda1db01 trace]# cat acfsutil_34147.trc
Trace file /u01/app/grid/diag/crs/oda1db01/crs/trace/acfsutil_34147.trc
Oracle Database 12c Clusterware Release 12.2.0.1.0 - Production Copyright 1996, 2016 Oracle. All rights reserved.
2019-07-31 09:16:52.569 : SNAPDEL:1416106464: [34147] Start: /sbin/acfsutil.bin snap delete actemp /u01/app/sharedrepo/oda1repo
2019-07-31 09:16:55.058 : SNAPDEL:1416106464: [34147] Returned: 0

There we have it - something actually ran an "acfsutil snap delete actemp /u01/app/sharedrepo/oda1repo" on the host, which deleted the snapshot.  If we look at the trace file, we can see exactly when it happened - July 31 at 9:16AM.  What else can we find out?  One of the nice utilities that Oracle installs on their engineered systems is OSwatcher, which collects many stats and metrics at the operating system level.  On the ODA, it's stored in /opt/oracle/oak/oswbb/archive - in that directory are other directories for each of the commands that are saved.  In our case, we'll look at oswps, which is collecting the running processes every 5 seconds.  Since it's running at 5 second intervals, we may not see everything, but let's take a look and see if there are any executions of acfsutil on the 31st.

[root@oda1db01 oswps]# zgrep acfsutil oda1db01_ps_19.07.31.*
oda1db01_ps_19.07.31.0900.dat:4 D root 34147 88190 0 80 0 - 59527 OfsWai 09:16 pts/5 00:00:00 /sbin/acfsutil.bin snap delete actemp /u01/app/sharedrepo/oda1repo
oda1db01_ps_19.07.31.0900.dat:4 D root 35398 88190 0 80 0 - 59526 call_r 09:17 pts/5 00:00:00 /sbin/acfsutil.bin snap delete actemp_vdisk_u01 /u01/app/sharedrepo/oda1repo

I used zgrep because OSWatcher rolls files every hour and compresses them with gzip.  On Exadata, they use bgzip.  In that case, you'd use bzgrep instead of zgrep.

Sure enough, we have a couple of entries in there - we can see that PIDs 34147 and 35398 were the processes that were executed.  What's more interesting is that we can see the parent PID that was used to invoke the commands - 88190.  Now we're getting somewhere - if we can see what that PID was doing, then we'll have a better idea of the cause.  Again, we'll grep for 88190 in that file.  We don't need to use zgrep here, since we're reading an uncompressed dat file:

[root@oda1db01 oswps]# grep 88190 oda1db01_ps_19.07.31.0900.dat
4 S root 88190 87883 0 80 0 - 2906 wait_w 08:35 pts/5 00:00:00 -bash
4 S root 88190 87883 0 80 0 - 2906 wait_w 08:35 pts/5 00:00:00 -bash
4 S root 88190 87883 0 80 0 - 2906 do_wai 08:35 pts/5 00:00:00 -bash
4 S root 88190 87883 0 80 0 - 2906 do_wai 08:35 pts/5 00:00:00 -bash
4 S root 88190 87883 0 80 0 - 2906 wait_w 08:35 pts/5 00:00:00 -bash
4 S root 88190 87883 0 80 0 - 2906 wait_w 08:35 pts/5 00:00:00 -bash
4 S root 88190 87883 0 80 0 - 2906 wait_w 08:35 pts/5 00:00:00 -bash
4 S root 88190 87883 0 80 0 - 2906 wait_w 08:35 pts/5 00:00:00 -bash
4 S root 88190 87883 0 80 0 - 2906 wait_w 08:35 pts/5 00:00:00 -bash
4 S root 88190 87883 0 80 0 - 2906 wait_w 08:35 pts/5 00:00:00 -bash
4 D root 34147 88190 0 80 0 - 59527 OfsWai 09:16 pts/5 00:00:00 /sbin/acfsutil.bin snap delete actemp /u01/app/sharedrepo/oda1repo
4 S root 88190 87883 0 80 0 - 2906 do_wai 08:35 pts/5 00:00:00 -bash
4 D root 35398 88190 0 80 0 - 59526 call_r 09:17 pts/5 00:00:00 /sbin/acfsutil.bin snap delete actemp_vdisk_u01 /u01/app/sharedrepo/oda1repo
4 S root 88190 87883 0 80 0 - 2906 do_wai 08:35 pts/5 00:00:00 -bash
4 S root 88190 87883 0 80 0 - 2906 wait_w 08:35 pts/5 00:00:00 -bash

Now, we can see that acfsutil wasn't being called by a script or a scheduled job, it came from the bash shell.  One more grep for the parent of the bash shell, and we'll get some more information:

[root@oda1db01 oswps]# grep 87883 oda1db01_ps_19.07.31.0900.dat
4 S root 87883 6619 0 80 0 - 25541 poll_s 08:35 ? 00:00:00 sshd: root@pts/5
4 S root 88190 87883 0 80 0 - 27130 wait_w 08:35 pts/5 00:00:00 -bash
4 S root 87883 6619 0 80 0 - 25541 poll_s 08:35 ? 00:00:00 sshd: root@pts/5
4 S root 88190 87883 0 80 0 - 27130 wait_w 08:35 pts/5 00:00:00 -bash
4 S root 87883 6619 0 80 0 - 25541 poll_s 08:35 ? 00:00:00 sshd: root@pts/5
4 S root 88190 87883 0 80 0 - 27130 wait_w 08:35 pts/5 00:00:00 -bash
4 S root 87883 6619 0 80 0 - 25541 poll_s 08:35 ? 00:00:00 sshd: root@pts/5

That settles it - someone logged in via SSH and ran the acfsutil snap delete commands.  Now, to find out who?  From there, we can search the authentication logs in /var/log/secure. In the event that this was done with malicious intent, hopefully syslogs are being shipped to a remote server where you can check.  In my case, nobody was trying to cover their tracks, so I looked for PID 87883 in /var/log/secure:

[root@oda1db01 oswps]# grep 87883 /var/log/secure
Jul 31 08:35:29 oda1db01 sshd[87883]: Accepted password for root from 10.10.10.34 port 51633 ssh2
Jul 31 08:35:29 oda1db01 sshd[87883]: pam_unix(sshd:session): session opened for user root by (uid=0)
Jul 31 09:22:43 oda1db01 sshd[87883]: pam_unix(sshd:session): session closed for user root

There you have it - someone logged in as root from 10.10.10.34 at 8:35AM, ran the commands, and logged out at 9:22AM.  If we are able to track down who was connected to that IP, then you have everything you need.  Ideally, users should be logging in with their own account and using sudo to run commands - that makes it easier to track who runs what, when.  Needless to say, that configuration is high on this project's list of priorities.  It's set up on most of the systems, but the ODA hadn't made it there yet.

Now that we've found out what happened, how do we recover from this mess?

Leave a Reply

Your email address will not be published.