Stuck Timezone Upgrades and Smart Scans

I heard a strange report from a client late last week - they weren't getting any smart scans on their Exadata - a data warehouse, no less (and we checked to see if cell_offload_processing was set to true). Queries would run normally, but for some reason, the "cell physical IO interconnect bytes returned by smart scan" statistic matched the "cell physical IO bytes eligible for predicate offload" statistic one for one. Also, we saw a strange message about "cell num smart IO sessions using passthru mode due to timezone" being incremented each time we ran a smart scan. There are 3 listed stats for reasons you would get passthru: user, cellsrv, and timezone. We didn't appear to have any differences between the timezones on the hosts, or within the database. Here's some output from SQLT showing the problems:

cell num smart IO sessions using passthru mode due to timezone	1
cell physical IO bytes eligible for predicate offload	30189412352
cell physical IO interconnect bytes	30189412352
cell physical IO interconnect bytes returned by smart scan	30189412352
cell scans	1

At this point, we didn't have much information to go by, so we tried to track things down - first, I wanted to look at the storage servers to see if there was anything weird going on there. I went to look at the alert log on the cell in /var/log/oracle/diag/asm/cell//trace/alert.log. For some reason, it had not been updated on the cells since June. Not sure why, we decided to perform a rolling bounce of cellsrv, which cleared up that problem. I looked through the trace files, but didn't see anything special. From there, we looked inside of cellcli to see if there was something weird going on with IORM or a quarantine:

CellCLI> list iormplan detail
	 name:              	 cel01_IORMPLAN
	 catPlan:
	 dbPlan:
	 objective:         	 auto
	 status:            	 active

CellCLI> list quarantine detail

Nothing there, either. In fact, the cells didn't seem to be busy at all, one of the primary reasons for seeing sessions go into passthru mode. At this point, it was time to try to isolate whether the problem was specific to one database, or all of the databases on the machine. I created a test table with a fair amount of rows (6M) and ran a simple "select count(*) from acolvin_test;" to see what would happen.

First, from the problem database:

USER@DUMB> @fsx
Enter value for sql_text:
Enter value for sql_id: 91qy20qp5fsrk

SQL_ID	       CHILD  PLAN_HASH  EXECS	AVG_ETIME AVG_PX OFFLOAD IO_SAVED_% SQL_TEXT						  
------------- ------ ---------- ------ ---------- ------ ------- ---------- ----------------------------------
91qy20qp5fsrk	   0  628493321      1	      .95      0 Yes		    .00 select count(*) from acolvin_test 

USER@DUMB> @mystats
Enter value for event_name: %cell%IO%

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO interconnect bytes					     304193536
cell physical IO bytes saved during optimized file creation			     0
cell physical IO bytes saved during optimized RMAN file restore 		     0
cell physical IO bytes eligible for predicate offload			     304160768  <--
cell physical IO bytes saved by storage index					     0
cell physical IO bytes sent directly to DB node to balance CPU			     0
cell smart IO session cache lookups						     0
cell smart IO session cache hits						     0
cell smart IO session cache soft misses 					     0
cell smart IO session cache hard misses 					     0
cell smart IO session cache hwm 						     0
cell num smart IO sessions in rdbms block IO due to user			     0
cell num smart IO sessions in rdbms block IO due to no cell mem 		     0
cell num smart IO sessions in rdbms block IO due to big payload 		     0
cell num smart IO sessions using passthru mode due to user			     0
cell num smart IO sessions using passthru mode due to cellsrv			     0
cell num smart IO sessions using passthru mode due to timezone			     1  <--
cell num smart file creation sessions using rdbms block IO mode 		     0
cell physical IO interconnect bytes returned by smart scan		     304160768  <--
cell simulated physical IO bytes eligible for predicate offload 		     0
cell simulated physical IO bytes returned by predicate offload			     0
cell IO uncompressed bytes							     0

As you can see, on the problem database, bytes eligible for offload and bytes returned are the same, along with the passthru metric. Weird, right? The next thing that I wanted to test was whether the same data would give this problem in another database. That way, we could pin the issue down to database-specific, or if it was system wide. I exported the table and imported into another database, and tested things out:

USER@SMART> @fsx
Enter value for sql_text:
Enter value for sql_id: b5qxn5afbdw0s

SQL_ID	       CHILD  PLAN_HASH  EXECS	AVG_ETIME AVG_PX OFFLOAD IO_SAVED_% SQL_TEXT						  
------------- ------ ---------- ------ ---------- ------ ------- ---------- ----------------------------------
b5qxn5afbdw0s	   0  628493321      1	      .25      0 Yes	      83.59 select count(*) from acolvin_test 

USER@SMART> @mystats
Enter value for event_name: %cell%IO%

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO interconnect bytes					      49810608
cell physical IO bytes saved during optimized file creation			     0
cell physical IO bytes saved during optimized RMAN file restore 		     0
cell physical IO bytes eligible for predicate offload			     303521792  <--
cell physical IO bytes saved by storage index					     0
cell physical IO bytes sent directly to DB node to balance CPU			     0
cell smart IO session cache lookups						     0
cell smart IO session cache hits						     0
cell smart IO session cache soft misses 					     0
cell smart IO session cache hard misses 					     0
cell smart IO session cache hwm 						     0
cell num smart IO sessions in rdbms block IO due to user			     0
cell num smart IO sessions in rdbms block IO due to no cell mem 		     0
cell num smart IO sessions in rdbms block IO due to big payload 		     0
cell num smart IO sessions using passthru mode due to user			     0
cell num smart IO sessions using passthru mode due to cellsrv			     0
cell num smart IO sessions using passthru mode due to timezone			     0  <--
cell num smart file creation sessions using rdbms block IO mode 		     0
cell physical IO interconnect bytes returned by smart scan		      47901872  <--
cell simulated physical IO bytes eligible for predicate offload 		     0
cell simulated physical IO bytes returned by predicate offload			     0
cell IO uncompressed bytes						     303808512

Look at that, we get a smart scan! Of course, we can tell that we see a smart scan because the number of bytes returned is lower than the number of bytes eligible for offload. With this information in tow, I ran an exachk against the database, also looking for any nondefault parameters. Nothing special there.

At this point, Kerry got involved, and sent an email to one of his friends inside of Oracle. We were told that it may be due to a database left in a timezone upgrade state - something I hadn't really thought of. This is a new 11.2.0.3 database, and the client most likely hasn't done anything with the timezones. I started digging around on MOS and found a note about timezone upgrades in 11.2 (Updating the RDBMS DST version in 11g Release 2 (11.2.0.1 and up) using DBMS_DST (Doc ID 977512.1)). It gave me a few pointers to see what was going on:

SQL> SELECT PROPERTY_NAME, SUBSTR(property_value, 1, 30) value
2 FROM DATABASE_PROPERTIES
3 WHERE PROPERTY_NAME LIKE 'DST_%'
4 ORDER BY PROPERTY_NAME;

PROPERTY_NAME		       VALUE
------------------------------ ------------------------------
DST_PRIMARY_TT_VERSION	       14
DST_SECONDARY_TT_VERSION       4
DST_UPGRADE_STATE	       DATAPUMP(1)

When I saw this, I new something was off. The note said to clean up any pending data pump jobs, of which there were 2 that were not running. Even after cleaning those jobs up, the upgrade state still continued to be "DATAPUMP(1)" - which didn't have many hits on Metalink. I did find one more note that would help, though - "DataPump Import Fails With Errors ORA-39097 ORA-39065 ORA-56935 (Doc ID 1307959.1)" - this note mentioned a bug that is still under investigation, along with a workaround:

USER@DUMB> ALTER SESSION SET EVENTS '30090 TRACE NAME CONTEXT FOREVER, LEVEL 32';

Session altered.

USER@DUMB> exec dbms_dst.unload_secondary;

PL/SQL procedure successfully completed.

USER@DUMB> SELECT PROPERTY_NAME, SUBSTR(property_value, 1, 30) value
2 FROM DATABASE_PROPERTIES
3 WHERE PROPERTY_NAME LIKE 'DST_%'
4 ORDER BY PROPERTY_NAME;

PROPERTY_NAME		       VALUE
------------------------------ ------------------------------
DST_PRIMARY_TT_VERSION	       14
DST_SECONDARY_TT_VERSION       0
DST_UPGRADE_STATE	       NONE

After running these steps, I tried running the smart scan one more time:

USER@DUMB> select count(*) from acolvin_test;

COUNT(*)
----------
3999996

Elapsed: 00:00:00.53
USER@DUMB> @mystats
Enter value for event_name: %cell%IO%

NAME										 VALUE
---------------------------------------------------------------------- ---------------
cell physical IO interconnect bytes					      48326208
cell physical IO bytes saved during optimized file creation			     0
cell physical IO bytes saved during optimized RMAN file restore 		     0
cell physical IO bytes eligible for predicate offload			     303702016  <--
cell physical IO bytes saved by storage index					     0
cell physical IO bytes sent directly to DB node to balance CPU			     0
cell smart IO session cache lookups						     0
cell smart IO session cache hits						     0
cell smart IO session cache soft misses 					     0
cell smart IO session cache hard misses 					     0
cell smart IO session cache hwm 						     0
cell num smart IO sessions in rdbms block IO due to user			     0
cell num smart IO sessions in rdbms block IO due to no cell mem 		     0
cell num smart IO sessions in rdbms block IO due to big payload 		     0
cell num smart IO sessions using passthru mode due to user			     0
cell num smart IO sessions using passthru mode due to cellsrv			     0
cell num smart IO sessions using passthru mode due to timezone			     0  <--
cell num smart file creation sessions using rdbms block IO mode 		     0
cell physical IO interconnect bytes returned by smart scan		      47908416  <--
cell simulated physical IO bytes eligible for predicate offload 		     0
cell simulated physical IO bytes returned by predicate offload			     0
cell IO uncompressed bytes						     303742976

After this, things started looking much better.

One thought on “Stuck Timezone Upgrades and Smart Scans”

Kerry Osborne September 18, 2013

This was an awesome display of team work. Karen, Carlos, Tanel and you all collaborating on the same problem. (not to mention the unnamed Oracle resource who pointed us in the right direction) 🙂 Impressive that you were the one to resolve the issue out of that crowd (even if it was more in your wheel house and Tanel was bouncing between continents for a big chunk of the time). Nice job.

Reply ↓

One thought on “Stuck Timezone Upgrades and Smart Scans”

Leave a Reply Cancel reply