HAIP and Exadata

By | July 1, 2015

If you've run an exachk report, y0u may have seen the following message with regard to your databases:

Status Type Message Status On Details
FAIL Database Check Database parameter CLUSTER_INTERCONNECTS is NOT set to the recommended value db01:dbm011, db02:dbm012 View

This check is commonly seen when a database is created on Exadata without using the custom "Exadata" templates included with the database creation assistant.  These customized templates include a multitude of recommended parameter settings found in MOS note #1274318.1 (Exadata Setup/Configuration Best Practices) - one of which is the CLUSTER_INTERCONNECTS parameter.  This parameter is used to determine which IP addresses will be used for communication between database instances on the cluster.  If left, unset, the instance will default to the high availability IP addresses (HAIP) on interfaces defined by Grid Infrastructure to host the cluster interconnect.

What is HAIP?

HAIP is a feature that was introduced back in 11.2.0.2 which allows administrators to utilize multiple network interfaces for the cluster interconnect without needing to configure any kind of bonding.  Interfaces identified within clusterware to be used for the interconnect will automatically receive IP addresses when clusterware starts.  These IP addresses fall within the familiar 169.254.0.0/16 space, which is primarily seen when DHCP interfaces are unable to acquire and address.  Because each of the interfaces will receive an IP, HAIP allows for easy active/active cluster interconnect configuration without needing to configure host and switch based bonding using LACP.  On Exadata, Oracle recommends to not use this feature, hence the incident on the exachk report seen above.  There is no way to disable the feature - the only way to ensure that it is not used is to manually set the CLUSTER_INTERCONNECTS parameter.

HAIP and Exadata

Like many Oracle features that can cause issues (I'm looking at you, automatic memory management), things will run fine for a long time, until they hit the breaking point.  We have seen incidents where an unset CLUSTER_INTERCONNECTS parameter will run fine for months or years, but when it fails, it's a noticeable failure.

In one case, we had a client with an X3-8 full rack running 12.1.0.2 with the April 2015 patch release.  I received an email saying that one day, some of the database instances crashed across the cluster, and would not restart.  The other instance continued to run without an issue, but it was spread across the two nodes.  The email included a chart like this (these aren't the real database names):

Database Node 1 Node 2
DW UP UP
HR DOWN UP
OID UP DOWN
HIST UP DOWN
STG UP UP

There weren't any changes made to the databases in question - they had been patched and upgraded 6 weeks ago, and been running without any issues until this day.  Naturally the first thing that I did when I heard about the problem was try to start up the instance.  Here's what I saw in the alert log for the instance that would not start:

PRCR-1013 : Failed to start resource ora.hr.db
PRCR-1064 : Failed to start resource ora.hr.db on node db01
CRS-5017: The resource action "ora.hr.db start" encountered the following error:
ORA-03113: end-of-file on communication channel
Process ID: 0
Session ID: 0 Serial number: 0
. For details refer to "(:CLSN00107" in "/u01/app/grid/diag/crs/db01/crs/trace/crsd_oraagent_oracle.trc".

CRS-2674: Start of 'ora.hr.db' on 'db01' failed

how about that for an explanation? Here's what was in the alert log for the instance that failed to start up. This section is from when it was attempting to negotiate with the other instance:

Fri Jun 26 13:52:46 2015
* Load Monitor used for high load check
* New Low - High Load Threshold Range = [153600 - 204800]
Fri Jun 26 13:52:46 2015
Reconfiguration started (old inc 0, new inc 12)
List of instances (total 2) :
 1 2
My inst 1 (I'm a new instance)
 Global Resource Directory frozen
* allocate domain 0, invalid = TRUE
 Communication channels reestablished
Fri Jun 26 13:52:47 2015
 * domain 0 valid = 1 according to instance 2
Fri Jun 26 13:52:48 2015
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
Fri Jun 26 13:52:48 2015
Fri Jun 26 13:52:48 2015
 LMS 6: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 5: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Fri Jun 26 13:52:48 2015
 LMS 4: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Fri Jun 26 13:52:48 2015
 LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Fri Jun 26 13:52:48 2015
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Fri Jun 26 13:52:48 2015
Fri Jun 26 13:52:48 2015
 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Submitted all GCS remote-cache requests
Fri Jun 26 14:01:47 2015
LMD0 (ospid: 119535) waits for event 'process diagnostic dump' for 1 secs.
Fri Jun 26 14:01:52 2015
LMS0 (ospid: 119579) received an instance eviction notification from instance 2 [2]
Fri Jun 26 14:01:52 2015
LMON received an instance eviction notification from instance 2
The instance eviction reason is 0x2
The instance eviction map is 1
Errors in file /u01/app/oracle/diag/rdbms/hr/hr1/trace/hr1_lmhb_120095.trc  (incident=1004937):
ORA-29770: global enqueue process LMD0 (OSID 119535) is hung for more than 70 seconds
Incident details in: /u01/app/oracle/diag/rdbms/hr/hr1/incident/incdir_1004937/hr1_lmhb_120095_i1004937.trc
Fri Jun 26 14:01:54 2015
Received an instance abort message from instance 2

That's giving a little bit more information - it looks like the instance is getting shut down by the other node before it can start. Maybe if we look on the running instance, it will give us some information about why it's kicking out the instance. Here is the same cluster configuration section from the instance that is running ok:

Reconfiguration started (old inc 14, new inc 16)
List of instances (total 2) :
 1 2
New instances (total 1) :
 1
My inst 2
 Global Resource Directory frozen
 Communication channels reestablished
 Master broadcasted resource hash value bitmaps
 Non-local Process blocks cleaned out
 LMS 4: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 6: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 5: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
 Set master node info
 Submitted all remote-enqueue requests
 Dwn-cvts replayed, VALBLKs dubious
 All grantable enqueues granted
 Submitted all GCS remote-cache requests
 Fix write in gcs resources
Fri Jun 26 14:02:54 2015
Reconfiguration complete (total time 6.8 secs)
Fri Jun 26 14:03:54 2015
Increasing number of real time LMS from 0 to 7
Fri Jun 26 14:05:16 2015
LMS0 (ospid: 34439) has detected no messaging activity from instance 1
LMS0 (ospid: 34439) issues an IMR to resolve the situation
Please check LMS0 trace file for more detail.
Fri Jun 26 14:05:16 2015
Communications reconfiguration: instance_number 1 by ospid 34439
Fri Jun 26 14:06:15 2015
Evicting instance 1 from cluster
Waiting for instances to leave: 1
Fri Jun 26 14:06:15 2015
Dumping diagnostic data in directory=[cdmp_20150626140615], requested by (instance=1, osid=30058 (LMS0)), summary=[abnormal instance termination].

Now we are getting somewhere. It looks like there is a cluster communication issue between the nodes. Everything seemed ok, but I noticed the following in the instance alert log when it was starting up:

Fri Jun 26 13:52:36 2015
Cluster communication is configured to use the following interface(s) for this instance
  169.254.12.191
  169.254.65.68
  169.254.177.146
  169.254.214.117
cluster interconnect IPC version: Oracle RDS/IP (generic)

The host was attempting to use the HAIP interfaces instead of the static IP addresses configured on the InfiniBand interfaces. Knowing that this was not the recommended setting, and knowing that I had seen MOS notes in the past about issues when using HAIP on Exadata, I looked at the CLUSTER_INTERCONNECTS setting on the database:

SQL> show parameter interconnect

NAME				     TYPE	 VALUE
------------------------------------ ----------- ------------------------------
cluster_interconnects		     string

As expected, the parameter was not set. We attempted to set the parameter for each of the instances via the SPfile:

SQL> alter system set cluster_interconnects='172.16.0.185:172.16.0.186:172.16.0.187:172.16.0.188' sid='hr1' scope=spfile;

System altered.

SQL> alter system set cluster_interconnects='172.16.0.189:172.16.0.190:172.16.0.191:172.16.0.192' sid='hr2' scope=spfile;

System altered.

After setting the parameter, we were able to bring up the instance without any issues:

SQL> startup nomount
ORA-32004: obsolete or deprecated parameter(s) specified for RDBMS instance
ORACLE instance started.

Total System Global Area 1.7180E+10 bytes
Fixed Size                  5304248 bytes
Variable Size            4889346120 bytes
Database Buffers         1.2147E+10 bytes
Redo Buffers              138514432 bytes
SQL> 

While it is interesting to see the adverse effects of leaving this parameter unset, the degradation could have been avoided in the first place if the exachk failures had been recognized and addressed.  The exachk script has a very robust and ever-changing list of checks that are run, and this is a good example of why it should be run regularly.

2 thoughts on “HAIP and Exadata

  1. Anisur

    I appreciate and like this 12C/Exadata advisory on CLUSTER_INTERCONNECTS parameter. One of the valuable solution wrt 12C/exadata

    Reply
  2. Daniel

    I have encountered another issue when using HAIP on Exadata.
    When rolling patching IB Switches, HAIP failover delayed for more than 1 min when switch reboot.

    Reply

Leave a Reply

Your email address will not be published.