SCAN VIP Troubleshooting

By | March 28, 2014

We had a client that was running into a strange issue on their Exadata where new connections coming in through the SCAN were failing.  After doing some troubleshooting, it was discovered that it was related to one of the SCAN listeners not properly accepting requests from new sessions.  The VIP and listener were running, and everything looked normal.

We had the following SCAN setup:

SCAN VIP # VIP IP
1 172.25.2.70
2 172.25.2.68
3 172.25.2.69

For some reason, sessions trying to connect via VIP #2 on the SCAN were not getting through.

[oracle@s8270a30-phx ~]$ tnsping 172.25.2.68

TNS Ping Utility for Linux: Version 11.2.0.3.0 - Production on 18-MAR-2014 20:02:33
Copyright (c) 1997, 2011, Oracle.  All rights reserved.

Used parameter files:
/u01/app/oracle/product/11.2.0.3/dbhome_1/network/admin/sqlnet.ora

Used EZCONNECT adapter to resolve the alias
Attempting to contact (DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=))(ADDRESS=(PROTOCOL=TCP)(HOST=172.25.2.68)(PORT=1521)))
TNS-12541: TNS:no listener

Everything looked good on the cluster, as we could see the IPs up and running, and the listener looked good:

[oracle@dm03db01 ~]$ /sbin/ifconfig
bondeth0  Link encap:Ethernet  HWaddr 00:21:28:E7:D7:5B
          inet addr:172.25.2.60  Bcast:172.25.255.255  Mask:255.255.248.0
          inet6 addr: fe80::221:28ff:fee7:d75b/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:23230221735 errors:0 dropped:0 overruns:1061 frame:0
          TX packets:38652899593 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:3129281507199 (2.8 TiB)  TX bytes:41491136417663 (37.7 TiB)

bondeth0:1 Link encap:Ethernet  HWaddr 00:21:28:E7:D7:5B
          inet addr:172.25.2.68  Bcast:172.25.7.255  Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

bondeth0:2 Link encap:Ethernet  HWaddr 00:21:28:E7:D7:5B
          inet addr:172.25.2.69  Bcast:172.25.7.255  Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

bondeth0:3 Link encap:Ethernet  HWaddr 00:21:28:E7:D7:5B
          inet addr:172.25.2.66  Bcast:172.25.7.255  Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

bondeth0:6 Link encap:Ethernet  HWaddr 00:21:28:E7:D7:5B
          inet addr:172.25.2.64  Bcast:172.25.7.255  Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

bondeth0:7 Link encap:Ethernet  HWaddr 00:21:28:E7:D7:5B
          inet addr:172.25.2.67  Bcast:172.25.7.255  Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

[oracle@dm03db01 ~]$ lsnrctl status LISTENER_SCAN2 | head -20

LSNRCTL for Linux: Version 11.2.0.3.0 - Production on 20-MAR-2014 20:29:35

Copyright (c) 1991, 2011, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_SCAN2)))
STATUS of the LISTENER
------------------------
Alias                     LISTENER_SCAN2
Version                   TNSLSNR for Linux: Version 11.2.0.3.0 - Production
Start Date                07-MAR-2014 09:32:45
Uptime                    13 days 9 hr. 56 min. 50 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/11.2.0.3/grid/network/admin/listener.ora
Listener Log File         /u01/app/oracle/diag/tnslsnr/dm03db01/listener_scan2/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER_SCAN2)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.25.2.68)(PORT=1521)))
Services Summary...

[oracle@dm03db01 ~]$ lsnrctl status LISTENER_SCAN3 | head -20

LSNRCTL for Linux: Version 11.2.0.3.0 - Production on 20-MAR-2014 20:31:42

Copyright (c) 1991, 2011, Oracle.  All rights reserved.

Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=IPC)(KEY=LISTENER_SCAN3)))
STATUS of the LISTENER
------------------------
Alias                     LISTENER_SCAN3
Version                   TNSLSNR for Linux: Version 11.2.0.3.0 - Production
Start Date                16-FEB-2014 10:28:03
Uptime                    32 days 9 hr. 3 min. 39 sec
Trace Level               off
Security                  ON: Local OS Authentication
SNMP                      OFF
Listener Parameter File   /u01/app/11.2.0.3/grid/network/admin/listener.ora
Listener Log File         /u01/app/11.2.0.3/grid/log/diag/tnslsnr/dm03db01/listener_scan3/alert/log.xml
Listening Endpoints Summary...
  (DESCRIPTION=(ADDRESS=(PROTOCOL=ipc)(KEY=LISTENER_SCAN3)))
  (DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=172.25.2.69)(PORT=1521)))
Services Summary...

This is a half rack X2-2, with only 2 compute nodes licensed. This is why we had the following interfaces up and running on dm03db01:

Interface Related IP
bondeth0 Host IP
bondeth0:1 SCAN2 VIP
bondeth0:2 SCAN3 VIP
bondeth0:3 dm0303-vip
bondeth0:6 dm0304-vip
bondeth0:7 dm0301-vip

Because nodes 3 and 4 are not being used, CRS was shut down on them...hence the extra VIPs up and running on node 1.  After taking a look at the issue, I shut down the listener and VIP associated with SCAN2.  During this process, I ran a continuous ping from our application server, s8270a30-phx:

[oracle@dm03db01 ~]$ srvctl stop scan_listener -i 2
[oracle@dm03db01 ~]$ srvctl status scan_listener
SCAN Listener LISTENER_SCAN1 is enabled
SCAN listener LISTENER_SCAN1 is running on node dm03db02
SCAN Listener LISTENER_SCAN2 is enabled
SCAN listener LISTENER_SCAN2 is not running
SCAN Listener LISTENER_SCAN3 is enabled
SCAN listener LISTENER_SCAN3 is running on node dm03db01
[oracle@dm03db01 ~]$ ps -ef | grep lsnr
oracle     316 26489  0 08:44 pts/0    00:00:00 grep lsnr
oracle   11426     1  0 Feb16 ?        00:38:23 /u01/app/11.2.0.3/grid/bin/tnslsnr LISTENER_SCAN3 -inherit
oracle   11458     1  0 Feb16 ?        00:52:43 /u01/app/11.2.0.3/grid/bin/tnslsnr LISTENER -inherit
[oracle@dm03db01 ~]$ srvctl stop scan -i 2
[oracle@dm03db01 ~]$ /sbin/ifconfig
bondeth0  Link encap:Ethernet  HWaddr 00:21:28:E7:D7:5B
          inet addr:172.25.2.60  Bcast:172.25.255.255  Mask:255.255.248.0
          inet6 addr: fe80::221:28ff:fee7:d75b/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:23218232564 errors:0 dropped:0 overruns:1061 frame:0
          TX packets:38633150465 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:3126152011061 (2.8 TiB)  TX bytes:41468962228200 (37.7 TiB)

bondeth0:2 Link encap:Ethernet  HWaddr 00:21:28:E7:D7:5B
          inet addr:172.25.2.69  Bcast:172.25.7.255  Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

bondeth0:3 Link encap:Ethernet  HWaddr 00:21:28:E7:D7:5B
          inet addr:172.25.2.66  Bcast:172.25.7.255  Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

bondeth0:6 Link encap:Ethernet  HWaddr 00:21:28:E7:D7:5B
          inet addr:172.25.2.64  Bcast:172.25.7.255  Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

bondeth0:7 Link encap:Ethernet  HWaddr 00:21:28:E7:D7:5B
          inet addr:172.25.2.67  Bcast:172.25.7.255  Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

After shutting down the IP and listener, the ping was still replying. This verified the theory that the IP was in use somewhere else. Next, we had to track it down. It was easy to do via the arp utility on the application server:

[root@s8270a30-phx ~]# arp -n
Address                  HWtype  HWaddress           Flags Mask            Iface
172.25.2.62              ether   00:21:28:E7:D9:47   C                     bond0
172.25.2.64              ether   00:21:28:E7:D7:5B   C                     bond0
172.25.2.63              ether   00:21:28:E7:D4:61   C                     bond0
172.25.1.7               ether   00:50:56:BD:01:50   C                     bond0
172.25.0.5               ether   00:00:0C:9F:F0:07   C                     bond0
172.25.2.68              ether   00:21:28:E7:D9:47   C                     bond0
172.25.2.61              ether   00:21:28:E7:D1:EB   C                     bond0
172.25.2.60              ether   00:21:28:E7:D7:5B   C                     bond0
172.25.2.65              ether   00:21:28:E7:D1:EB   C                     bond0

A simple nslookup showed that this IP was associated with the dm03db03 server:

[root@s8270a30-phx ~]# nslookup 172.25.2.62
Server:		172.25.1.7
Address:	172.25.1.7#53

62.2.25.172.in-addr.arpa	name = dm0303.xxxxx.pvt.

From here, we went to the dm03db03 server, and checked the interfaces that were running:

[root@dm03db03 ~]# ifconfig
bondeth0  Link encap:Ethernet  HWaddr 00:21:28:E7:D9:47
          inet addr:172.25.2.62  Bcast:172.25.255.255  Mask:255.255.248.0
          inet6 addr: fe80::221:28ff:fee7:d947/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:162693018 errors:0 dropped:0 overruns:0 frame:0
          TX packets:81208365 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:12825636230 (11.9 GiB)  TX bytes:85377835236 (79.5 GiB)

bondeth0:1 Link encap:Ethernet  HWaddr 00:21:28:E7:D9:47
          inet addr:172.25.2.66  Bcast:172.25.7.255  Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

bondeth0:2 Link encap:Ethernet  HWaddr 00:21:28:E7:D9:47
          inet addr:172.25.2.68  Bcast:172.25.7.255  Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

It looks like when CRS was last shut down on this server, it was running the SCAN1 VIP, and didn't properly release it (or the VIP associated with the host). Disabling those interfaces should get rid of the duplicate IP issue.

[root@dm03db03 ~]# ifconfig bondeth0:1 down
[root@dm03db03 ~]# ifconfig bondeth0:2 down
[root@dm03db03 ~]# ifconfig
bondeth0  Link encap:Ethernet  HWaddr 00:21:28:E7:D9:47
          inet addr:172.25.2.62  Bcast:172.25.255.255  Mask:255.255.248.0
          inet6 addr: fe80::221:28ff:fee7:d947/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:162728761 errors:0 dropped:0 overruns:0 frame:0
          TX packets:81225700 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:12828169943 (11.9 GiB)  TX bytes:85397251157 (79.5 GiB)

At this point, the ping stopped responding, so we started the SCAN VIP and listener on dm03db01.

[oracle@dm03db01 ~]$ srvctl start scan -i 2
[oracle@dm03db01 ~]$ srvctl start scan_listener -i 2
[oracle@dm03db01 ~]$ ps -ef | grep lsnr
oracle    5483     1  0 08:47 ?        00:00:02 /u01/app/11.2.0.3/grid/bin/tnslsnr LISTENER_SCAN2 -inherit
oracle   10702 26489  0 10:24 pts/0    00:00:00 grep lsnr
oracle   11426     1  0 Feb16 ?        00:38:27 /u01/app/11.2.0.3/grid/bin/tnslsnr LISTENER_SCAN3 -inherit
oracle   11458     1  0 Feb16 ?        00:52:54 /u01/app/11.2.0.3/grid/bin/tnslsnr LISTENER -inherit
[oracle@dm03db01 ~]$ /sbin/ifconfig
bondeth0  Link encap:Ethernet  HWaddr 00:21:28:E7:D7:5B
          inet addr:172.25.2.60  Bcast:172.25.255.255  Mask:255.255.248.0
          inet6 addr: fe80::221:28ff:fee7:d75b/64 Scope:Link
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1
          RX packets:23230221735 errors:0 dropped:0 overruns:1061 frame:0
          TX packets:38652899593 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:3129281507199 (2.8 TiB)  TX bytes:41491136417663 (37.7 TiB)

bondeth0:1 Link encap:Ethernet  HWaddr 00:21:28:E7:D7:5B
          inet addr:172.25.2.68  Bcast:172.25.7.255  Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

bondeth0:2 Link encap:Ethernet  HWaddr 00:21:28:E7:D7:5B
          inet addr:172.25.2.69  Bcast:172.25.7.255  Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

bondeth0:3 Link encap:Ethernet  HWaddr 00:21:28:E7:D7:5B
          inet addr:172.25.2.66  Bcast:172.25.7.255  Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

bondeth0:6 Link encap:Ethernet  HWaddr 00:21:28:E7:D7:5B
          inet addr:172.25.2.64  Bcast:172.25.7.255  Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

bondeth0:7 Link encap:Ethernet  HWaddr 00:21:28:E7:D7:5B
          inet addr:172.25.2.67  Bcast:172.25.7.255  Mask:255.255.248.0
          UP BROADCAST RUNNING MASTER MULTICAST  MTU:1500  Metric:1

After this, we attempted to hit the VIP with tnsping from the application server:

[oracle@s8270a30-phx ~]$ tnsping 172.25.2.68

TNS Ping Utility for Linux: Version 11.2.0.3.0 - Production on 28-MAR-2014 10:30:43

Copyright (c) 1997, 2011, Oracle.  All rights reserved.

Used parameter files:
/u01/app/oracle/product/11.2.0.3/dbhome_1/network/admin/sqlnet.ora

Used EZCONNECT adapter to resolve the alias
Attempting to contact (DESCRIPTION=(CONNECT_DATA=(SERVICE_NAME=))(ADDRESS=(PROTOCOL=TCP)(HOST=172.25.2.68)(PORT=1521)))
OK (0 msec)

After this, applications have stopped the "random" connection issues, and all is back to normal.

 

2 thoughts on “SCAN VIP Troubleshooting

Leave a Reply

Your email address will not be published.