If you've run an exachk report, y0u may have seen the following message with regard to your databases:
Status | Type | Message | Status On | Details |
---|---|---|---|---|
FAIL | Database Check | Database parameter CLUSTER_INTERCONNECTS is NOT set to the recommended value | db01:dbm011, db02:dbm012 | View |
This check is commonly seen when a database is created on Exadata without using the custom "Exadata" templates included with the database creation assistant. These customized templates include a multitude of recommended parameter settings found in MOS note #1274318.1 (Exadata Setup/Configuration Best Practices) - one of which is the CLUSTER_INTERCONNECTS parameter. This parameter is used to determine which IP addresses will be used for communication between database instances on the cluster. If left, unset, the instance will default to the high availability IP addresses (HAIP) on interfaces defined by Grid Infrastructure to host the cluster interconnect.
What is HAIP?
HAIP is a feature that was introduced back in 11.2.0.2 which allows administrators to utilize multiple network interfaces for the cluster interconnect without needing to configure any kind of bonding. Interfaces identified within clusterware to be used for the interconnect will automatically receive IP addresses when clusterware starts. These IP addresses fall within the familiar 169.254.0.0/16 space, which is primarily seen when DHCP interfaces are unable to acquire and address. Because each of the interfaces will receive an IP, HAIP allows for easy active/active cluster interconnect configuration without needing to configure host and switch based bonding using LACP. On Exadata, Oracle recommends to not use this feature, hence the incident on the exachk report seen above. There is no way to disable the feature - the only way to ensure that it is not used is to manually set the CLUSTER_INTERCONNECTS parameter.
HAIP and Exadata
Like many Oracle features that can cause issues (I'm looking at you, automatic memory management), things will run fine for a long time, until they hit the breaking point. We have seen incidents where an unset CLUSTER_INTERCONNECTS parameter will run fine for months or years, but when it fails, it's a noticeable failure.
In one case, we had a client with an X3-8 full rack running 12.1.0.2 with the April 2015 patch release. I received an email saying that one day, some of the database instances crashed across the cluster, and would not restart. The other instance continued to run without an issue, but it was spread across the two nodes. The email included a chart like this (these aren't the real database names):
Database | Node 1 | Node 2 |
DW | UP | UP |
HR | DOWN | UP |
OID | UP | DOWN |
HIST | UP | DOWN |
STG | UP | UP |
There weren't any changes made to the databases in question - they had been patched and upgraded 6 weeks ago, and been running without any issues until this day. Naturally the first thing that I did when I heard about the problem was try to start up the instance. Here's what I saw in the alert log for the instance that would not start:
PRCR-1013 : Failed to start resource ora.hr.db
PRCR-1064 : Failed to start resource ora.hr.db on node db01
CRS-5017: The resource action "ora.hr.db start" encountered the following error:
ORA-03113: end-of-file on communication channel
Process ID: 0
Session ID: 0 Serial number: 0
. For details refer to "(:CLSN00107" in "/u01/app/grid/diag/crs/db01/crs/trace/crsd_oraagent_oracle.trc".
CRS-2674: Start of 'ora.hr.db' on 'db01' failed
how about that for an explanation? Here's what was in the alert log for the instance that failed to start up. This section is from when it was attempting to negotiate with the other instance:
Fri Jun 26 13:52:46 2015
* Load Monitor used for high load check
* New Low - High Load Threshold Range = [153600 - 204800]
Fri Jun 26 13:52:46 2015
Reconfiguration started (old inc 0, new inc 12)
List of instances (total 2) :
1 2
My inst 1 (I'm a new instance)
Global Resource Directory frozen
* allocate domain 0, invalid = TRUE
Communication channels reestablished
Fri Jun 26 13:52:47 2015
* domain 0 valid = 1 according to instance 2
Fri Jun 26 13:52:48 2015
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Fri Jun 26 13:52:48 2015
Fri Jun 26 13:52:48 2015
LMS 6: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
LMS 5: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Fri Jun 26 13:52:48 2015
LMS 4: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Fri Jun 26 13:52:48 2015
LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Fri Jun 26 13:52:48 2015
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Fri Jun 26 13:52:48 2015
Fri Jun 26 13:52:48 2015
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fri Jun 26 14:01:47 2015
LMD0 (ospid: 119535) waits for event 'process diagnostic dump' for 1 secs.
Fri Jun 26 14:01:52 2015
LMS0 (ospid: 119579) received an instance eviction notification from instance 2 [2]
Fri Jun 26 14:01:52 2015
LMON received an instance eviction notification from instance 2
The instance eviction reason is 0x2
The instance eviction map is 1
Errors in file /u01/app/oracle/diag/rdbms/hr/hr1/trace/hr1_lmhb_120095.trc (incident=1004937):
ORA-29770: global enqueue process LMD0 (OSID 119535) is hung for more than 70 seconds
Incident details in: /u01/app/oracle/diag/rdbms/hr/hr1/incident/incdir_1004937/hr1_lmhb_120095_i1004937.trc
Fri Jun 26 14:01:54 2015
Received an instance abort message from instance 2
That's giving a little bit more information - it looks like the instance is getting shut down by the other node before it can start. Maybe if we look on the running instance, it will give us some information about why it's kicking out the instance. Here is the same cluster configuration section from the instance that is running ok:
Reconfiguration started (old inc 14, new inc 16)
List of instances (total 2) :
1 2
New instances (total 1) :
1
My inst 2
Global Resource Directory frozen
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
LMS 4: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
LMS 1: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
LMS 2: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
LMS 6: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
LMS 0: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
LMS 5: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
LMS 3: 0 GCS shadows cancelled, 0 closed, 0 Xw survived
Set master node info
Submitted all remote-enqueue requests
Dwn-cvts replayed, VALBLKs dubious
All grantable enqueues granted
Submitted all GCS remote-cache requests
Fix write in gcs resources
Fri Jun 26 14:02:54 2015
Reconfiguration complete (total time 6.8 secs)
Fri Jun 26 14:03:54 2015
Increasing number of real time LMS from 0 to 7
Fri Jun 26 14:05:16 2015
LMS0 (ospid: 34439) has detected no messaging activity from instance 1
LMS0 (ospid: 34439) issues an IMR to resolve the situation
Please check LMS0 trace file for more detail.
Fri Jun 26 14:05:16 2015
Communications reconfiguration: instance_number 1 by ospid 34439
Fri Jun 26 14:06:15 2015
Evicting instance 1 from cluster
Waiting for instances to leave: 1
Fri Jun 26 14:06:15 2015
Dumping diagnostic data in directory=[cdmp_20150626140615], requested by (instance=1, osid=30058 (LMS0)), summary=[abnormal instance termination].
Now we are getting somewhere. It looks like there is a cluster communication issue between the nodes. Everything seemed ok, but I noticed the following in the instance alert log when it was starting up:
Fri Jun 26 13:52:36 2015
Cluster communication is configured to use the following interface(s) for this instance
169.254.12.191
169.254.65.68
169.254.177.146
169.254.214.117
cluster interconnect IPC version: Oracle RDS/IP (generic)
The host was attempting to use the HAIP interfaces instead of the static IP addresses configured on the InfiniBand interfaces. Knowing that this was not the recommended setting, and knowing that I had seen MOS notes in the past about issues when using HAIP on Exadata, I looked at the CLUSTER_INTERCONNECTS setting on the database:
SQL> show parameter interconnect
NAME TYPE VALUE
------------------------------------ ----------- ------------------------------
cluster_interconnects string
As expected, the parameter was not set. We attempted to set the parameter for each of the instances via the SPfile:
SQL> alter system set cluster_interconnects='172.16.0.185:172.16.0.186:172.16.0.187:172.16.0.188' sid='hr1' scope=spfile;
System altered.
SQL> alter system set cluster_interconnects='172.16.0.189:172.16.0.190:172.16.0.191:172.16.0.192' sid='hr2' scope=spfile;
System altered.
After setting the parameter, we were able to bring up the instance without any issues:
SQL> startup nomount
ORA-32004: obsolete or deprecated parameter(s) specified for RDBMS instance
ORACLE instance started.
Total System Global Area 1.7180E+10 bytes
Fixed Size 5304248 bytes
Variable Size 4889346120 bytes
Database Buffers 1.2147E+10 bytes
Redo Buffers 138514432 bytes
SQL>
While it is interesting to see the adverse effects of leaving this parameter unset, the degradation could have been avoided in the first place if the exachk failures had been recognized and addressed. The exachk script has a very robust and ever-changing list of checks that are run, and this is a good example of why it should be run regularly.
I appreciate and like this 12C/Exadata advisory on CLUSTER_INTERCONNECTS parameter. One of the valuable solution wrt 12C/exadata
I have encountered another issue when using HAIP on Exadata.
When rolling patching IB Switches, HAIP failover delayed for more than 1 min when switch reboot.