Exadata Cell Patching – USB I/O Errors

By | February 5, 2017

I was playing around with the Exadata X2-2 in the Enkitec lab this weekend, and hit an interesting issue when patching the storage servers.  We were taking the system up to version 12.1.2.3.3 for testing purposes.  I fired off the patchmgr script, and one of the storage servers failed when beginning the first phase of the patching cycle:

[root@enkdb03 patch_12.1.2.3.3.161208]# ./patchmgr -cells cell_group -patch -ignore_alerts
********************************************************************************
NOTE Cells will reboot during the patch or rollback process.
NOTE For non-rolling patch or rollback, ensure all ASM instances using
NOTE the cells are shut down for the duration of the patch or rollback.
NOTE For rolling patch or rollback, ensure all ASM instances using
NOTE the cells are up for the duration of the patch or rollback.

WARNING Do not start more than one instance of patchmgr.
WARNING Do not interrupt the patchmgr session.
WARNING Do not alter state of ASM instances during patch or rollback.
WARNING Do not resize the screen. It may disturb the screen layout.
WARNING Do not reboot cells or alter cell services during patch or rollback.
WARNING Do not open log files in editor in write mode or try to alter them.

NOTE All time estimates are approximate.
NOTE You may interrupt this patchmgr run in next 60 seconds with CONTROL-c.
********************************************************************************

2017-02-05 17:10:24 -0600 [INFO] Disabling /var/log/cellos cleanup on this node for the duration of the patchmgr session.

2017-02-05 17:11:27 -0600 :Working: DO: Check cells have ssh equivalence for root user. Up to 10 seconds per cell ...
2017-02-05 17:11:28 -0600 :SUCCESS: DONE: Check cells have ssh equivalence for root user.
2017-02-05 17:11:34 -0600 :Working: DO: Initialize files, check space and state of cell services. Up to 1 minute ...
2017-02-05 17:12:03 -0600 :SUCCESS: DONE: Initialize files, check space and state of cell services.
2017-02-05 17:12:03 -0600 :Working: DO: Copy, extract prerequisite check archive to cells. If required start md11 mismatched partner size correction. Up to 40 minutes ...
2017-02-05 17:12:20 -0600 Wait correction of degraded md11 due to md partner size mismatch. Up to 30 minutes.
2017-02-05 17:12:23 -0600 :SUCCESS: DONE: Copy, extract prerequisite check archive to cells. If required start md11 mismatched partner size correction.
2017-02-05 17:12:23 -0600 :Working: DO: Check prerequisites on all cells. Up to 2 minutes ...
2017-02-05 17:13:21 -0600 :SUCCESS: DONE: Check prerequisites on all cells.
2017-02-05 17:13:21 -0600 :Working: DO: Copy the patch to all cells. Up to 3 minutes ...
2017-02-05 17:13:58 -0600 :SUCCESS: DONE: Copy the patch to all cells.
2017-02-05 17:14:00 -0600 :Working: DO: Execute plugin check for Patch Check Prereq ...
2017-02-05 17:14:00 -0600 :INFO: Patchmgr plugin start: Prereq check for exposure to bug 22909764 v1.0. Details in logfile /u01/stage/patches/12.1.2.3.3/patch_12.1.2.3.3.161208/patchmgr.stdout.
2017-02-05 17:14:00 -0600 :INFO: Patchmgr plugin start: Prereq check for exposure to bug 17854520 v1.3. Details in logfile /u01/stage/patches/12.1.2.3.3/patch_12.1.2.3.3.161208/patchmgr.stdout.
2017-02-05 17:14:00 -0600 :SUCCESS: No exposure to bug 17854520 with non-rolling patching
2017-02-05 17:14:00 -0600 :INFO: Patchmgr plugin start: Prereq check for exposure to bug 22468216 v1.0. Details in logfile /u01/stage/patches/12.1.2.3.3/patch_12.1.2.3.3.161208/patchmgr.stdout.
2017-02-05 17:14:01 -0600 :SUCCESS: Patchmgr plugin complete: Prereq check passed for the bug 22468216
2017-02-05 17:14:01 -0600 :INFO : Patchmgr plugin start: Prereq check for exposure to bug 24625612 v1.0.
2017-02-05 17:14:01 -0600 :INFO : Details in logfile /u01/stage/patches/12.1.2.3.3/patch_12.1.2.3.3.161208/patchmgr.stdout.
2017-02-05 17:14:02 -0600 :SUCCESS: Patchmgr plugin complete: Prereq check passed for the bug 24625612
2017-02-05 17:14:02 -0600 :SUCCESS: DONE: Execute plugin check for Patch Check Prereq.
2017-02-05 17:14:07 -0600 1 of 5 :Working: DO: Initiate patch on cells. Cells will remain up. Up to 5 minutes ...
2017-02-05 17:14:19 -0600 1 of 5 :SUCCESS: DONE: Initiate patch on cells.
2017-02-05 17:14:19 -0600 2 of 5 :Working: DO: Waiting to finish pre-reboot patch actions. Cells will remain up. Up to 45 minutes ...
2017-02-05 17:15:19 -0600 Wait for patch pre-reboot procedures

TIMEOUT for following cells0600 Minutes left 045

enkcel06: enkcel06 192.168.8.221 2017-02-05 17:16:24 -0600: [INFO] Patch is in state: init. 2017-02-05 17:16:25 -0600
FAILED With Exception for following cells

enkcel04: enkcel04 192.168.8.219 2017-02-05 17:16:23 -0600: 2017-02-05 17:16:25 -0600
enkcel05: enkcel05 192.168.8.220 2017-02-05 17:16:23 -0600: 2017-02-05 17:16:25 -0600
2017-02-05 17:16:25 -0600 2 of 5 :FAILED: Details in files <cell_name>.log /u01/stage/patches/12.1.2.3.3/patch_12.1.2.3.3.161208/patchmgr.stdout, /u01/stage/patches/12.1.2.3.3/patch_12.1.2.3.3.161208/patchmgr.stderr
2017-02-05 17:16:25 -0600 2 of 5 :FAILED: DONE: Waiting to finish pre-reboot patch actions.
[ERROR] Patch failed. Please run cleanup before retrying.

Looking at the final lines,  you can see that enkcel04 and enkcel05 seem to be tripping things up.  The first thing to do is look in the enkcel04.log and enkcel05.log files found in the patch's working directory.  Sure enough, I see the error:

[root@enkdb03 patch_12.1.2.3.3.161208]# tail -20 enkcel05.log
enkcel05: [0m Copying ./cellbits/debugos.tbz to /mnt/usb.make.cellboot/./cellbits ...
enkcel05: _EXIT_EXCEPTION_Cell enkcel05 192.168.8.220 2017-02-05 17:15:20 -0600:
enkcel05:
enkcel05: 2017-02-05 17:16:22 -0600 [INFO]: patchmgr launch attempt from enkdb03.enkitec.com_192.168.8.217_u01_stage_patches_12.1.2.3.3_patch_12.1.2.3.3.161208.
enkcel05: 2017-02-05 17:16:22 -0600 [INFO]: dostep called: 6ipw patch non_rolling 900 3600 900 36000 default
enkcel05: [40;36m[INFO ]
enkcel05: [0m Copying ./cellbits/cellrpms.tbz to /mnt/usb.make.cellboot/./cellbits ...
enkcel05: [40;36m[INFO ]
enkcel05: [0m Copying ./cellbits/exaos.tbz to /mnt/usb.make.cellboot/./cellbits ...
enkcel05: [40;36m[INFO ]
enkcel05: [0m Copying ./cellbits/commonos.tbz to /mnt/usb.make.cellboot/./cellbits ...
enkcel05: [40;36m[INFO ]
enkcel05: [0m Copying ./cellbits/cell.bin to /mnt/usb.make.cellboot/./cellbits ...
enkcel05: _EXIT_EXCEPTION_Cell enkcel05 192.168.8.220 2017-02-05 17:16:23 -0600:
enkcel05: device node not found
enkcel05: device node not found
enkcel05: [EXCEPTION] CELLBOOT USB has IO errors. Replace the CELLBOOT USB before retrying patch.
enkcel05: Cell enkcel05 192.168.8.220
enkcel05: _EXIT_EXCEPTION_Cell enkcel05 192.168.8.220 2017-02-05 17:16:23 -0600:
enkcel05: enkcel05 192.168.8.220 2017-02-05 17:16:23 -0600: 2017-02-05 17:16:25 -0600

From looking at the output, the upgrade script was in the middle of recreating the backup USB device before patching when it complained about I/O errors on the device.  Like any good administrator should, I ran an exachk on the system the day before patching, and ran the prerequisite check right before firing off the patch.  Both of those items passed without complaining about the USB device.  Looking closer at the host, the USB device seems ok:

[root@enkcel04 ~]# imageinfo

Kernel version: 2.6.39-400.243.1.el6uek.x86_64 #1 SMP Wed Nov 26 09:15:35 PST 2014 x86_64
Cell version: OSS_12.1.2.1.0_LINUX.X64_141206.1
Cell rpm version: cell-12.1.2.1.0_LINUX.X64_141206.1-1.x86_64

Active image version: 12.1.2.1.0.141206.1
Active image activated: 2015-01-23 13:11:39 -0600
Active image status: success
Active system partition on device: /dev/md6
Active software partition on device: /dev/md8

Cell boot usb partition: /dev/sdac1
Cell boot usb version: 12.1.2.1.0.141206.1

Inactive image version: 12.1.2.3.3.161208
Inactive system partition on device: /dev/md5
Inactive software partition on device: /dev/md7

Inactive marker for the rollback: /boot/I_am_hd_boot.inactive
Inactive grub config for the rollback: /boot/grub/grub.conf.inactive
Inactive kernel version for the rollback: 2.6.39-400.128.17.el5uek
Rollback to the inactive partitions: Impossible

I tried to recreate the USB device, and it passed through clearly.  There wasn't anything interesting in the log files found in /root/_cellupd_dpullec_/_p_ on the cells.  Finally, I checked on MOS, and found 'Bug 25466351 : [EXCEPTION] CELLBOOT USB HAS IO ERRORS. REPLACE THE CELLBOOT USB BEFORE RETRYING" - this looks more like it!  As it turns out, this is due to information pulled from the current logs during the patchmgr run.  Running "logrotate -f /etc/logrotate.conf" was all that needed to be done to bypass this.  After rotating the logs, I reran patchmgr and everything went through without a hitch.

Leave a Reply

Your email address will not be published.