I was tasked a while back with modifying the IP addresses assigned to the InfiniBand network on an X6-2 rack. We were configuring a ZFS appliance to share between 2 Exadata clusters, and we wanted to keep them on separate network ranges. In this case, we were moving the interfaces from 192.168.10.x/22 (the Exadata default network range) to 192.168.14.x/22. The process is documented very well by Oracle in the Exadata Maintenance Guide, but after going through the procedure, I hit an issue. I could start CRS on one node, but any additional nodes would fail to start after a 10 minute timeout:
[root@db02 ~]# crsctl start crs -wait
CRS-4123: Starting Oracle High Availability Services-managed resources
CRS-2672: Attempting to start 'ora.mdnsd' on 'db02'
CRS-2672: Attempting to start 'ora.evmd' on 'db02'
CRS-2676: Start of 'ora.mdnsd' on 'db02' succeeded
CRS-2676: Start of 'ora.evmd' on 'db02' succeeded
CRS-2672: Attempting to start 'ora.gpnpd' on 'db02'
CRS-2676: Start of 'ora.gpnpd' on 'db02' succeeded
CRS-2672: Attempting to start 'ora.gipcd' on 'db02'
CRS-2676: Start of 'ora.gipcd' on 'db02' succeeded
CRS-2672: Attempting to start 'ora.drivers.acfs' on 'db02'
CRS-2676: Start of 'ora.drivers.acfs' on 'db02' succeeded
CRS-2672: Attempting to start 'ora.cssdmonitor' on 'db02'
CRS-2676: Start of 'ora.cssdmonitor' on 'db02' succeeded
CRS-2672: Attempting to start 'ora.cssd' on 'db02'
CRS-2672: Attempting to start 'ora.diskmon' on 'db02'
CRS-2676: Start of 'ora.diskmon' on 'db02' succeeded
CRS-2676: Start of 'ora.cssd' on 'db02' succeeded
CRS-2672: Attempting to start 'ora.cluster_interconnect.haip' on 'db02'
CRS-2672: Attempting to start 'ora.ctssd' on 'db02'
CRS-5021: Check of storage failed: details at "(:CLSN00117:)" in "/u01/app/oracle/diag/crs/db02/crs/trace/ohasd_orarootagent_root.trc"
CRS-2676: Start of 'ora.ctssd' on 'db02' succeeded
CRS-2676: Start of 'ora.cluster_interconnect.haip' on 'db02' succeeded
CRS-2672: Attempting to start 'ora.asm' on 'db02'
CRS-2676: Start of 'ora.asm' on 'db02' succeeded
CRS-2672: Attempting to start 'ora.storage' on 'db02'
ORA-27300: OS system dependent operation:sslssunreghdlr failed with status: 0
ORA-27301: OS failure message: Error 0
ORA-27302: failure occurred at: sskgpreset1CRS-2883: Resource 'ora.storage' failed during Clusterware stack start.
CRS-4406: Oracle High Availability Services synchronous start failed.
CRS-4000: Command Start failed, or completed with errors.
It was strange - I could start CRS on any node on the cluster, but it would only run on a single node. As soon as I tried to start CRS on a second node, we'd get the same error every time. I checked the CRS alert log, and didn't get much:
2018-05-15 16:29:04.864 [OCSSD(94878)]CRS-1713: CSSD daemon is started in hub mode
2018-05-15 16:30:15.237 [OCSSD(94878)]CRS-1707: Lease acquisition for node db02 number 3 completed
2018-05-15 16:30:16.337 [OCSSD(94878)]CRS-1605: CSSD voting file is online: o/192.168.14.11;192.168.14.12/DBFS_DG_CD_02_cel02; details in /u01/app/oracle/diag/crs/db02/crs/trace/ocssd.trc.
2018-05-15 16:30:16.400 [OCSSD(94878)]CRS-1605: CSSD voting file is online: o/192.168.14.13;192.168.14.14/DBFS_DG_CD_02_cel03; details in /u01/app/oracle/diag/crs/db02/crs/trace/ocssd.trc.
2018-05-15 16:30:16.760 [OCSSD(94878)]CRS-1605: CSSD voting file is online: o/192.168.14.15;192.168.14.16/DBFS_DG_CD_02_cel04; details in /u01/app/oracle/diag/crs/db02/crs/trace/ocssd.trc.
2018-05-15 16:30:20.771 [OCSSD(94878)]CRS-1601: CSSD Reconfiguration complete. Active nodes are db02 db04 .
2018-05-15 16:30:20.799 [ORAROOTAGENT(94048)]CRS-5021: Check of storage failed: details at "(:CLSN00117:)" in "/u01/app/oracle/diag/crs/db02/crs/trace/ohasd_orarootagent_root.trc"
2018-05-15 16:30:22.779 [OCSSD(94878)]CRS-1720: Cluster Synchronization Services daemon (CSSD) is ready for operation.
2018-05-15 16:30:22.793 [OCTSSD(101356)]CRS-8500: Oracle Clusterware OCTSSD process is starting with operating system process ID 101356
2018-05-15 16:30:23.754 [OCTSSD(101356)]CRS-2403: The Cluster Time Synchronization Service on host db02 is in observer mode.
2018-05-15 16:30:24.606 [ORAROOTAGENT(94048)]CRS-5019: All OCR locations are on ASM disk groups [DBFS_DG], and none of these disk groups are mounted. Details are at "(:CLSN00140:)" in "/u01/app/oracle/diag/crs/db02/crs/trace/ohasd_orarootagent_root.trc".
2018-05-15 16:30:24.833 [OCTSSD(101356)]CRS-2407: The new Cluster Time Synchronization Service reference node is host db04.
2018-05-15 16:30:24.834 [OCTSSD(101356)]CRS-2401: The Cluster Time Synchronization Service started on host db04.
2018-05-15 16:30:36.026 [ORAROOTAGENT(94048)]CRS-5019: All OCR locations are on ASM disk groups [DBFS_DG], and none of these disk groups are mounted. Details are at "(:CLSN00140:)" in "/u01/app/oracle/diag/crs/db02/crs/trace/ohasd_orarootagent_root.trc".
2018-05-15 16:40:32.357 [ORAROOTAGENT(94048)]CRS-5818: Aborted command 'start' for resource 'ora.storage'. Details at (:CRSAGF00113:) {0:5:3} in /u01/app/oracle/diag/crs/db02/crs/trace/ohasd_orarootagent_root.trc.
2018-05-15 16:40:36.499 [ORAROOTAGENT(94048)]CRS-5017: The resource action "ora.storage start" encountered the following error:
2018-05-15 16:40:36.499+Storage agent start action aborted. For details refer to "(:CLSN00107:)" in "/u01/app/oracle/diag/crs/db02/crs/trace/ohasd_orarootagent_root.trc".
2018-05-15 16:40:36.501 [OHASD(93851)]CRS-2757: Command 'Start' timed out waiting for response from the resource 'ora.storage'. Details at (:CRSPE00221:) {0:5:3} in /u01/app/oracle/diag/crs/db02/crs/trace/ohasd.trc.
I started to dig through the trace files, and I saw something that may be able to help in the ocssd.trc file:
018-05-15 16:43:50.588 : CSSD:28108544: clssgmRPC: rpc 0x7f7e02559250 (RPC#67) tag(43002a) sent to node 4
2018-05-15 16:43:50.589 : CSSD:39147264: clssgmRPCDone: rpc 0x7f7e02559250 (RPC#67) state 6, flags 0x202
2018-05-15 16:43:50.618 : CSSD:25212672: clssgmGroupState: requested group state of unknown group OSM_LOCAL_ALL
2018-05-15 16:43:50.694 : CSSD:25212672: clssgmMemberPublicInfo: group ASM_NETWORK not found
I checked my network settings within the OCR, and it showed both the old and new IB networks in the registry:
[root@db01 ~]# oifcfg getif
bondeth0.2063 10.144.63.0 global public
ib0 192.168.12.0 global cluster_interconnect,asm <------ new network range
ib1 192.168.12.0 global cluster_interconnect,asm <------ new network range
ib0 192.168.8.0 global cluster_interconnect,asm <------ old network range
ib1 192.168.8.0 global cluster_interconnect,asm <------ old network range
After digging and cross-referencing with some MOS notes, I figured out that this was due to the new listeners created in 12c for the ASM network. When I ran srvctl, I saw 2 listeners that were registered with the old InfiniBand network:
[oracle@db01 ~]$ srvctl config listener -asmlistener
Name: ASMNET1LSNR_ASM
Type: ASM Listener
Owner: oracle
Subnet: 192.168.8.0 <------ old network range
Home:
End points: TCP:1526
Listener is enabled.
Listener is individually enabled on nodes:
Listener is individually disabled on nodes:
Name: ASMNET2LSNR_ASM
Type: ASM Listener
Owner: oracle
Subnet: 192.168.8.0 <------ old network range
Home:
End points: TCP:1527
Listener is enabled.
Listener is individually enabled on nodes:
Listener is individually disabled on nodes:
I found MOS note #283684.1, which describes the process for modifying the cluster interconnect when using flex ASM. It turns out that my issue was related to nodes not being able to communicate on the ASM network. Even though the new network shows up in oifcfg, the nodes are trying to communicate with these listeners. I had to add a new ASM network listener and remove the old ones:
[oracle@db01 ~]$ srvctl add listener -asmlistener -l ASMIBNETLSNR_ASM -subnet 192.168.12.0
[oracle@db01 ~]$ srvctl update listener -listener ASMNET1LSNR_ASM -asm -remove -force
[oracle@db01 ~]$ srvctl update listener -listener ASMNET2LSNR_ASM -asm -remove -force
Once I cleaned up the listeners, I was able to start clusterware on all nodes without any issues.