We had a client that was running into a strange issue on their Exadata where new connections coming in through the SCAN were failing. After doing some troubleshooting, it was discovered that it was related to one of the SCAN listeners not properly accepting requests from new sessions. The VIP and listener were running, and everything looked normal.
We had the following SCAN setup:
SCAN VIP # | VIP IP |
1 | 172.25.2.70 |
2 | 172.25.2.68 |
3 | 172.25.2.69 |
For some reason, sessions trying to connect via VIP #2 on the SCAN were not getting through.
[oracle@s8270a30-phx ~]$ tnsping 172.25.2.68
TNS Ping Utility for Linux: Version 11.2.0.3.0 - Production on 18-MAR-2014 20:02:33
Copyright (c) 1997, 2011, Oracle. All rights reserved.
Used parameter files:
/u01/app/oracle/product/11.2.0.3/dbhome_1/network/admin/sqlnet.ora
Used EZCONNECT adapter to resolve the alias
Attempting to contact (DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=))(ADDRESS=(PROTOCOL=TCP)(HOST=172.25.2.68)(PORT=1521)))
TNS-12541: TNS:no listener
Everything looked good on the cluster, as we could see the IPs up and running, and the listener looked good:
[oracle@dm03db01 ~]$ /sbin/ifconfig
bondeth0 Link encap:Ethernet HWaddr 00:21:28:E7:D7:5B
inet addr:172.25.2.60 Bcast:172.25.255.255 Mask:255.255.248.0
inet6 addr: fe80::221:28ff:fee7:d75b/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:23230221735 errors:0 dropped:0 overruns:1061 frame:0
TX packets:38652899593 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:3129281507199 (2.8 TiB) TX bytes:41491136417663 (37.7 TiB)
bondeth0:1 Link encap:Ethernet HWaddr 00:21:28:E7:D7:5B
inet addr:172.25.2.68 Bcast:172.25.7.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bondeth0:2 Link encap:Ethernet HWaddr 00:21:28:E7:D7:5B
inet addr:172.25.2.69 Bcast:172.25.7.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bondeth0:3 Link encap:Ethernet HWaddr 00:21:28:E7:D7:5B
inet addr:172.25.2.66 Bcast:172.25.7.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bondeth0:6 Link encap:Ethernet HWaddr 00:21:28:E7:D7:5B
inet addr:172.25.2.64 Bcast:172.25.7.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bondeth0:7 Link encap:Ethernet HWaddr 00:21:28:E7:D7:5B
inet addr:172.25.2.67 Bcast:172.25.7.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
[oracle@dm03db01 ~]$ lsnrctl status LISTENER_SCAN2 | head -20
LSNRCTL for Linux: Version 11.2.0.3.0 - Production on 20-MAR-2014 20:29:35
Copyright (c) 1991, 2011, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_SCAN2)))
STATUS of the LISTENER
------------------------
Alias LISTENER_SCAN2
Version TNSLSNR for Linux: Version 11.2.0.3.0 - Production
Start Date 07-MAR-2014 09:32:45
Uptime 13 days 9 hr. 56 min. 50 sec
Trace Level off
Security ON: Local OS Authentication
SNMP OFF
Listener Parameter File /u01/app/11.2.0.3/grid/network/admin/listener.ora
Listener Log File /u01/app/oracle/diag/tnslsnr/dm03db01/listener_scan2/alert/log.xml
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER_SCAN2)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.25.2.68)(PORT=1521)))
Services Summary...
[oracle@dm03db01 ~]$ lsnrctl status LISTENER_SCAN3 | head -20
LSNRCTL for Linux: Version 11.2.0.3.0 - Production on 20-MAR-2014 20:31:42
Copyright (c) 1991, 2011, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_SCAN3)))
STATUS of the LISTENER
------------------------
Alias LISTENER_SCAN3
Version TNSLSNR for Linux: Version 11.2.0.3.0 - Production
Start Date 16-FEB-2014 10:28:03
Uptime 32 days 9 hr. 3 min. 39 sec
Trace Level off
Security ON: Local OS Authentication
SNMP OFF
Listener Parameter File /u01/app/11.2.0.3/grid/network/admin/listener.ora
Listener Log File /u01/app/11.2.0.3/grid/log/diag/tnslsnr/dm03db01/listener_scan3/alert/log.xml
Listening Endpoints Summary...
(DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER_SCAN3)))
(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.25.2.69)(PORT=1521)))
Services Summary...
This is a half rack X2-2, with only 2 compute nodes licensed. This is why we had the following interfaces up and running on dm03db01:
Interface | Related IP |
bondeth0 | Host IP |
bondeth0:1 | SCAN2 VIP |
bondeth0:2 | SCAN3 VIP |
bondeth0:3 | dm0303-vip |
bondeth0:6 | dm0304-vip |
bondeth0:7 | dm0301-vip |
Because nodes 3 and 4 are not being used, CRS was shut down on them...hence the extra VIPs up and running on node 1. After taking a look at the issue, I shut down the listener and VIP associated with SCAN2. During this process, I ran a continuous ping from our application server, s8270a30-phx:
[oracle@dm03db01 ~]$ srvctl stop scan_listener -i 2
[oracle@dm03db01 ~]$ srvctl status scan_listener
SCAN Listener LISTENER_SCAN1 is enabled
SCAN listener LISTENER_SCAN1 is running on node dm03db02
SCAN Listener LISTENER_SCAN2 is enabled
SCAN listener LISTENER_SCAN2 is not running
SCAN Listener LISTENER_SCAN3 is enabled
SCAN listener LISTENER_SCAN3 is running on node dm03db01
[oracle@dm03db01 ~]$ ps -ef | grep lsnr
oracle 316 26489 0 08:44 pts/0 00:00:00 grep lsnr
oracle 11426 1 0 Feb16 ? 00:38:23 /u01/app/11.2.0.3/grid/bin/tnslsnr LISTENER_SCAN3 -inherit
oracle 11458 1 0 Feb16 ? 00:52:43 /u01/app/11.2.0.3/grid/bin/tnslsnr LISTENER -inherit
[oracle@dm03db01 ~]$ srvctl stop scan -i 2
[oracle@dm03db01 ~]$ /sbin/ifconfig
bondeth0 Link encap:Ethernet HWaddr 00:21:28:E7:D7:5B
inet addr:172.25.2.60 Bcast:172.25.255.255 Mask:255.255.248.0
inet6 addr: fe80::221:28ff:fee7:d75b/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:23218232564 errors:0 dropped:0 overruns:1061 frame:0
TX packets:38633150465 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:3126152011061 (2.8 TiB) TX bytes:41468962228200 (37.7 TiB)
bondeth0:2 Link encap:Ethernet HWaddr 00:21:28:E7:D7:5B
inet addr:172.25.2.69 Bcast:172.25.7.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bondeth0:3 Link encap:Ethernet HWaddr 00:21:28:E7:D7:5B
inet addr:172.25.2.66 Bcast:172.25.7.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bondeth0:6 Link encap:Ethernet HWaddr 00:21:28:E7:D7:5B
inet addr:172.25.2.64 Bcast:172.25.7.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bondeth0:7 Link encap:Ethernet HWaddr 00:21:28:E7:D7:5B
inet addr:172.25.2.67 Bcast:172.25.7.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
After shutting down the IP and listener, the ping was still replying. This verified the theory that the IP was in use somewhere else. Next, we had to track it down. It was easy to do via the arp utility on the application server:
[root@s8270a30-phx ~]# arp -n
Address HWtype HWaddress Flags Mask Iface
172.25.2.62 ether 00:21:28:E7:D9:47 C bond0
172.25.2.64 ether 00:21:28:E7:D7:5B C bond0
172.25.2.63 ether 00:21:28:E7:D4:61 C bond0
172.25.1.7 ether 00:50:56:BD:01:50 C bond0
172.25.0.5 ether 00:00:0C:9F:F0:07 C bond0
172.25.2.68 ether 00:21:28:E7:D9:47 C bond0
172.25.2.61 ether 00:21:28:E7:D1:EB C bond0
172.25.2.60 ether 00:21:28:E7:D7:5B C bond0
172.25.2.65 ether 00:21:28:E7:D1:EB C bond0
A simple nslookup showed that this IP was associated with the dm03db03 server:
[root@s8270a30-phx ~]# nslookup 172.25.2.62
Server: 172.25.1.7
Address: 172.25.1.7#53
62.2.25.172.in-addr.arpa name = dm0303.xxxxx.pvt.
From here, we went to the dm03db03 server, and checked the interfaces that were running:
[root@dm03db03 ~]# ifconfig
bondeth0 Link encap:Ethernet HWaddr 00:21:28:E7:D9:47
inet addr:172.25.2.62 Bcast:172.25.255.255 Mask:255.255.248.0
inet6 addr: fe80::221:28ff:fee7:d947/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:162693018 errors:0 dropped:0 overruns:0 frame:0
TX packets:81208365 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:12825636230 (11.9 GiB) TX bytes:85377835236 (79.5 GiB)
bondeth0:1 Link encap:Ethernet HWaddr 00:21:28:E7:D9:47
inet addr:172.25.2.66 Bcast:172.25.7.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bondeth0:2 Link encap:Ethernet HWaddr 00:21:28:E7:D9:47
inet addr:172.25.2.68 Bcast:172.25.7.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
It looks like when CRS was last shut down on this server, it was running the SCAN1 VIP, and didn't properly release it (or the VIP associated with the host). Disabling those interfaces should get rid of the duplicate IP issue.
[root@dm03db03 ~]# ifconfig bondeth0:1 down
[root@dm03db03 ~]# ifconfig bondeth0:2 down
[root@dm03db03 ~]# ifconfig
bondeth0 Link encap:Ethernet HWaddr 00:21:28:E7:D9:47
inet addr:172.25.2.62 Bcast:172.25.255.255 Mask:255.255.248.0
inet6 addr: fe80::221:28ff:fee7:d947/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:162728761 errors:0 dropped:0 overruns:0 frame:0
TX packets:81225700 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:12828169943 (11.9 GiB) TX bytes:85397251157 (79.5 GiB)
At this point, the ping stopped responding, so we started the SCAN VIP and listener on dm03db01.
[oracle@dm03db01 ~]$ srvctl start scan -i 2
[oracle@dm03db01 ~]$ srvctl start scan_listener -i 2
[oracle@dm03db01 ~]$ ps -ef | grep lsnr
oracle 5483 1 0 08:47 ? 00:00:02 /u01/app/11.2.0.3/grid/bin/tnslsnr LISTENER_SCAN2 -inherit
oracle 10702 26489 0 10:24 pts/0 00:00:00 grep lsnr
oracle 11426 1 0 Feb16 ? 00:38:27 /u01/app/11.2.0.3/grid/bin/tnslsnr LISTENER_SCAN3 -inherit
oracle 11458 1 0 Feb16 ? 00:52:54 /u01/app/11.2.0.3/grid/bin/tnslsnr LISTENER -inherit
[oracle@dm03db01 ~]$ /sbin/ifconfig
bondeth0 Link encap:Ethernet HWaddr 00:21:28:E7:D7:5B
inet addr:172.25.2.60 Bcast:172.25.255.255 Mask:255.255.248.0
inet6 addr: fe80::221:28ff:fee7:d75b/64 Scope:Link
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
RX packets:23230221735 errors:0 dropped:0 overruns:1061 frame:0
TX packets:38652899593 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:3129281507199 (2.8 TiB) TX bytes:41491136417663 (37.7 TiB)
bondeth0:1 Link encap:Ethernet HWaddr 00:21:28:E7:D7:5B
inet addr:172.25.2.68 Bcast:172.25.7.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bondeth0:2 Link encap:Ethernet HWaddr 00:21:28:E7:D7:5B
inet addr:172.25.2.69 Bcast:172.25.7.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bondeth0:3 Link encap:Ethernet HWaddr 00:21:28:E7:D7:5B
inet addr:172.25.2.66 Bcast:172.25.7.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bondeth0:6 Link encap:Ethernet HWaddr 00:21:28:E7:D7:5B
inet addr:172.25.2.64 Bcast:172.25.7.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
bondeth0:7 Link encap:Ethernet HWaddr 00:21:28:E7:D7:5B
inet addr:172.25.2.67 Bcast:172.25.7.255 Mask:255.255.248.0
UP BROADCAST RUNNING MASTER MULTICAST MTU:1500 Metric:1
After this, we attempted to hit the VIP with tnsping from the application server:
[oracle@s8270a30-phx ~]$ tnsping 172.25.2.68
TNS Ping Utility for Linux: Version 11.2.0.3.0 - Production on 28-MAR-2014 10:30:43
Copyright (c) 1997, 2011, Oracle. All rights reserved.
Used parameter files:
/u01/app/oracle/product/11.2.0.3/dbhome_1/network/admin/sqlnet.ora
Used EZCONNECT adapter to resolve the alias
Attempting to contact (DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=))(ADDRESS=(PROTOCOL=TCP)(HOST=172.25.2.68)(PORT=1521)))
OK (0 msec)
After this, applications have stopped the "random" connection issues, and all is back to normal.
Interesting issue.
wow. so the suggestion is to turn on the left part of the Rack? :-d