I heard a strange report from a client late last week - they weren't getting any smart scans on their Exadata - a data warehouse, no less (and we checked to see if cell_offload_processing was set to true). Queries would run normally, but for some reason, the "cell physical IO interconnect bytes returned by smart scan" statistic matched the "cell physical IO bytes eligible for predicate offload" statistic one for one. Also, we saw a strange message about "cell num smart IO sessions using passthru mode due to timezone" being incremented each time we ran a smart scan. There are 3 listed stats for reasons you would get passthru: user, cellsrv, and timezone. We didn't appear to have any differences between the timezones on the hosts, or within the database. Here's some output from SQLT showing the problems:
cell num smart IO sessions using passthru mode due to timezone | 1 |
cell physical IO bytes eligible for predicate offload | 30189412352 |
cell physical IO interconnect bytes | 30189412352 |
cell physical IO interconnect bytes returned by smart scan | 30189412352 |
cell scans | 1 |
At this point, we didn't have much information to go by, so we tried to track things down - first, I wanted to look at the storage servers to see if there was anything weird going on there. I went to look at the alert log on the cell in /var/log/oracle/diag/asm/cell//trace/alert.log. For some reason, it had not been updated on the cells since June. Not sure why, we decided to perform a rolling bounce of cellsrv, which cleared up that problem. I looked through the trace files, but didn't see anything special. From there, we looked inside of cellcli to see if there was something weird going on with IORM or a quarantine:
CellCLI> list iormplan detail
name: cel01_IORMPLAN
catPlan:
dbPlan:
objective: auto
status: active
CellCLI> list quarantine detail
Nothing there, either. In fact, the cells didn't seem to be busy at all, one of the primary reasons for seeing sessions go into passthru mode. At this point, it was time to try to isolate whether the problem was specific to one database, or all of the databases on the machine. I created a test table with a fair amount of rows (6M) and ran a simple "select count(*) from acolvin_test;" to see what would happen.
First, from the problem database:
USER@DUMB> @fsx
Enter value for sql_text:
Enter value for sql_id: 91qy20qp5fsrk
SQL_ID CHILD PLAN_HASH EXECS AVG_ETIME AVG_PX OFFLOAD IO_SAVED_% SQL_TEXT
------------- ------ ---------- ------ ---------- ------ ------- ---------- ----------------------------------
91qy20qp5fsrk 0 628493321 1 .95 0 Yes .00 select count(*) from acolvin_test
USER@DUMB> @mystats
Enter value for event_name: %cell%IO%
NAME VALUE
---------------------------------------------------------------------- ---------------
cell physical IO interconnect bytes 304193536
cell physical IO bytes saved during optimized file creation 0
cell physical IO bytes saved during optimized RMAN file restore 0
cell physical IO bytes eligible for predicate offload 304160768 <--
cell physical IO bytes saved by storage index 0
cell physical IO bytes sent directly to DB node to balance CPU 0
cell smart IO session cache lookups 0
cell smart IO session cache hits 0
cell smart IO session cache soft misses 0
cell smart IO session cache hard misses 0
cell smart IO session cache hwm 0
cell num smart IO sessions in rdbms block IO due to user 0
cell num smart IO sessions in rdbms block IO due to no cell mem 0
cell num smart IO sessions in rdbms block IO due to big payload 0
cell num smart IO sessions using passthru mode due to user 0
cell num smart IO sessions using passthru mode due to cellsrv 0
cell num smart IO sessions using passthru mode due to timezone 1 <--
cell num smart file creation sessions using rdbms block IO mode 0
cell physical IO interconnect bytes returned by smart scan 304160768 <--
cell simulated physical IO bytes eligible for predicate offload 0
cell simulated physical IO bytes returned by predicate offload 0
cell IO uncompressed bytes 0
As you can see, on the problem database, bytes eligible for offload and bytes returned are the same, along with the passthru metric. Weird, right? The next thing that I wanted to test was whether the same data would give this problem in another database. That way, we could pin the issue down to database-specific, or if it was system wide. I exported the table and imported into another database, and tested things out:
USER@SMART> @fsx
Enter value for sql_text:
Enter value for sql_id: b5qxn5afbdw0s
SQL_ID CHILD PLAN_HASH EXECS AVG_ETIME AVG_PX OFFLOAD IO_SAVED_% SQL_TEXT
------------- ------ ---------- ------ ---------- ------ ------- ---------- ----------------------------------
b5qxn5afbdw0s 0 628493321 1 .25 0 Yes 83.59 select count(*) from acolvin_test
USER@SMART> @mystats
Enter value for event_name: %cell%IO%
NAME VALUE
---------------------------------------------------------------------- ---------------
cell physical IO interconnect bytes 49810608
cell physical IO bytes saved during optimized file creation 0
cell physical IO bytes saved during optimized RMAN file restore 0
cell physical IO bytes eligible for predicate offload 303521792 <--
cell physical IO bytes saved by storage index 0
cell physical IO bytes sent directly to DB node to balance CPU 0
cell smart IO session cache lookups 0
cell smart IO session cache hits 0
cell smart IO session cache soft misses 0
cell smart IO session cache hard misses 0
cell smart IO session cache hwm 0
cell num smart IO sessions in rdbms block IO due to user 0
cell num smart IO sessions in rdbms block IO due to no cell mem 0
cell num smart IO sessions in rdbms block IO due to big payload 0
cell num smart IO sessions using passthru mode due to user 0
cell num smart IO sessions using passthru mode due to cellsrv 0
cell num smart IO sessions using passthru mode due to timezone 0 <--
cell num smart file creation sessions using rdbms block IO mode 0
cell physical IO interconnect bytes returned by smart scan 47901872 <--
cell simulated physical IO bytes eligible for predicate offload 0
cell simulated physical IO bytes returned by predicate offload 0
cell IO uncompressed bytes 303808512
Look at that, we get a smart scan! Of course, we can tell that we see a smart scan because the number of bytes returned is lower than the number of bytes eligible for offload. With this information in tow, I ran an exachk against the database, also looking for any nondefault parameters. Nothing special there.
At this point, Kerry got involved, and sent an email to one of his friends inside of Oracle. We were told that it may be due to a database left in a timezone upgrade state - something I hadn't really thought of. This is a new 11.2.0.3 database, and the client most likely hasn't done anything with the timezones. I started digging around on MOS and found a note about timezone upgrades in 11.2 (Updating the RDBMS DST version in 11g Release 2 (11.2.0.1 and up) using DBMS_DST (Doc ID 977512.1)). It gave me a few pointers to see what was going on:
SQL> SELECT PROPERTY_NAME, SUBSTR(property_value, 1, 30) value
2 FROM DATABASE_PROPERTIES
3 WHERE PROPERTY_NAME LIKE 'DST_%'
4 ORDER BY PROPERTY_NAME;
PROPERTY_NAME VALUE
------------------------------ ------------------------------
DST_PRIMARY_TT_VERSION 14
DST_SECONDARY_TT_VERSION 4
DST_UPGRADE_STATE DATAPUMP(1)
When I saw this, I new something was off. The note said to clean up any pending data pump jobs, of which there were 2 that were not running. Even after cleaning those jobs up, the upgrade state still continued to be "DATAPUMP(1)" - which didn't have many hits on Metalink. I did find one more note that would help, though - "DataPump Import Fails With Errors ORA-39097 ORA-39065 ORA-56935 (Doc ID 1307959.1)" - this note mentioned a bug that is still under investigation, along with a workaround:
USER@DUMB> ALTER SESSION SET EVENTS '30090 TRACE NAME CONTEXT FOREVER, LEVEL 32';
Session altered.
USER@DUMB> exec dbms_dst.unload_secondary;
PL/SQL procedure successfully completed.
USER@DUMB> SELECT PROPERTY_NAME, SUBSTR(property_value, 1, 30) value
2 FROM DATABASE_PROPERTIES
3 WHERE PROPERTY_NAME LIKE 'DST_%'
4 ORDER BY PROPERTY_NAME;
PROPERTY_NAME VALUE
------------------------------ ------------------------------
DST_PRIMARY_TT_VERSION 14
DST_SECONDARY_TT_VERSION 0
DST_UPGRADE_STATE NONE
After running these steps, I tried running the smart scan one more time:
USER@DUMB> select count(*) from acolvin_test;
COUNT(*)
----------
3999996
Elapsed: 00:00:00.53
USER@DUMB> @mystats
Enter value for event_name: %cell%IO%
NAME VALUE
---------------------------------------------------------------------- ---------------
cell physical IO interconnect bytes 48326208
cell physical IO bytes saved during optimized file creation 0
cell physical IO bytes saved during optimized RMAN file restore 0
cell physical IO bytes eligible for predicate offload 303702016 <--
cell physical IO bytes saved by storage index 0
cell physical IO bytes sent directly to DB node to balance CPU 0
cell smart IO session cache lookups 0
cell smart IO session cache hits 0
cell smart IO session cache soft misses 0
cell smart IO session cache hard misses 0
cell smart IO session cache hwm 0
cell num smart IO sessions in rdbms block IO due to user 0
cell num smart IO sessions in rdbms block IO due to no cell mem 0
cell num smart IO sessions in rdbms block IO due to big payload 0
cell num smart IO sessions using passthru mode due to user 0
cell num smart IO sessions using passthru mode due to cellsrv 0
cell num smart IO sessions using passthru mode due to timezone 0 <--
cell num smart file creation sessions using rdbms block IO mode 0
cell physical IO interconnect bytes returned by smart scan 47908416 <--
cell simulated physical IO bytes eligible for predicate offload 0
cell simulated physical IO bytes returned by predicate offload 0
cell IO uncompressed bytes 303742976
After this, things started looking much better.
This was an awesome display of team work. Karen, Carlos, Tanel and you all collaborating on the same problem. (not to mention the unnamed Oracle resource who pointed us in the right direction) 🙂 Impressive that you were the one to resolve the issue out of that crowd (even if it was more in your wheel house and Tanel was bouncing between continents for a big chunk of the time). Nice job.