Exadata Interconnect Addressing Tips

By | June 13, 2011

Having done a handful of Exadata implementations, there’s always been one piece of the configuration that’s bothered me more than anything else.  In the process of ordering an Exadata, Oracle sends the customer a “Configuration Worksheet” that asks questions about how the system should be configured.  It’s standard stuff:  hostnames, DNS and NTP servers, UID and GID for the oracle/dba/oinstall (that’s another sore spot) accounts, and IP addresses for the various interfaces.  The worksheet comes as a nifty PDF that the customer can modify to suit the needs of the Exadata system.

Unfortunately, the PDF does not allow the customer to modify the IP range used for the IB network.  The only option from this form is to use the network 192.168.8.0/22 with the hosts using 192.168.10.1 – 192.168.10.22 (for a full rack).  Why the /22 you might ask?  Oracle recommends using a subnet of 255.255.252.0 so that multiple Exadata systems can be connected, along with an Exalogic, and whatever other products they have down the line that will connect with Exadata on the IB network.  It would be nice if Oracle would allow customers to define this network range themselves, instead of sticking everybody in the 192.168.8.0/22 network.  Some say that it won’t be a problem, because the interconnect is non-routable, but I disagree. Find out why after the jump

The problem is that if that subnet is used anywhere else in the enterprise, those systems will not be able to connect to the Exadata at all.  If the Exadata nodes have a route that sends 192.168.8.0/22 to the IB network, how will it respond to packets coming from a valid host using one of those IPs?  The Exadata will never be able to respond, since the routing tables tell the Exadata host to send those packets to bondib0.

For example, say an Exadata has been configured to use 192.168.12.0/24 for its IB network.  The client access and management networks are 192.168.10.0/24 and 192.168.11.0/24 respectively.  Say that I create a new network outside of the IB switch network that uses 192.168.12.0/24 and create the associated routes to allow this network to talk to 192.168.10.0 and 192.168.11.0.  Gateway for all networks is .1.  If I have a host on the 192.168.12.0/24 (ethernet) network, it can access anything on 192.168.10.0 and 192.168.11.0 except any Exadata hosts.

I have my Macbook connected to a network with the address 192.168.12.100. 192.168.10.15 is a (non-Exadata) host on the 192.168.10.0 network, while enkdb01 is one of our Exadata compute nodes.

Andy-Colvins-Macbook:~ acolvin$ ifconfig en0
en0: flags=8963<up,broadcast,smart,running,promisc,simplex,multicast> mtu 1500
        ether 00:1f:f3:59:6f:ac
        inet6 fe80::21f:f3ff:fe59:6fac%en0 prefixlen 64 scopeid 0x5
        inet 192.168.12.100 netmask 0xffffff00 broadcast 192.168.12.255
        media: autoselect (100baseTX <full-duplex,flow-control>)
        status: active
 
Andy-Colvins-Macbook:~ acolvin$ ping -c 5 192.168.10.15
PING 192.168.10.15 (192.168.10.15): 56 data bytes
64 bytes from 192.168.10.15: icmp_seq=0 ttl=62 time=1.139 ms
64 bytes from 192.168.10.15: icmp_seq=1 ttl=62 time=1.185 ms
64 bytes from 192.168.10.15: icmp_seq=2 ttl=62 time=1.062 ms
64 bytes from 192.168.10.15: icmp_seq=3 ttl=62 time=1.082 ms
64 bytes from 192.168.10.15: icmp_seq=4 ttl=62 time=1.146 ms
 
--- 192.168.10.15 ping statistics ---
5 packets transmitted, 5 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 1.062/1.123/1.185/0.045 ms
 
Andy-Colvins-Macbook:~ acolvin$ traceroute -n -m 3 192.168.10.15
traceroute to 192.168.10.15 (192.168.10.15), 3 hops max, 52 byte packets
 1  192.168.12.1  0.867 ms  0.438 ms  0.482 ms
 2  192.168.10.15  1.085 ms  0.904 ms  0.773 ms
 
Andy-Colvins-Macbook:~ acolvin$ ping -c 5 enkdb01
PING enkdb01.enkitec.com (192.168.8.201): 56 data bytes
Request timeout for icmp_seq 0
Request timeout for icmp_seq 1
Request timeout for icmp_seq 2
Request timeout for icmp_seq 3
 
--- enkdb01.enkitec.com ping statistics ---
5 packets transmitted, 0 packets received, 100.0% packet loss
 
Andy-Colvins-Macbook:~ acolvin$ traceroute -n -m 3 enkdb01
traceroute to enkdb01.enkitec.com (192.168.8.201), 3 hops max, 52 byte packets
 1  192.168.12.1  0.738 ms  0.559 ms  0.353 ms
 2  * * *
 3  * * *
</full-duplex,flow-control></up,broadcast,smart,running,promisc,simplex,multicast>

As you can see, we are able to ping 192.168.10.15, but can’t ping enkdb01. What’s more interesting is to see what’s going on inside enkdb01:

[acolvin@enkdb01 ~]$ /sbin/route -v
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
192.168.12.0    *               255.255.255.0   U     0      0        0 bond0
192.168.8.0     *               255.255.252.0   U     0      0        0 eth0
169.254.0.0     *               255.255.0.0     U     0      0        0 bond0
default         router.enkitec. 0.0.0.0         UG    0      0        0 eth0
 
[acolvin@enkdb01 ~]$ sudo /usr/sbin/tcpdump -i eth0 | grep "192.168.12"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 96 bytes
07:41:22.359865 IP 192.168.12.100 > enkdb01.enkitec.com: ICMP echo request, id 32399, seq 0, length 64
07:41:23.359831 IP 192.168.12.100 > enkdb01.enkitec.com: ICMP echo request, id 32399, seq 1, length 64
07:41:24.360013 IP 192.168.12.100 > enkdb01.enkitec.com: ICMP echo request, id 32399, seq 2, length 64
07:41:25.360139 IP 192.168.12.100 > enkdb01.enkitec.com: ICMP echo request, id 32399, seq 3, length 64
07:41:26.360343 IP 192.168.12.100 > enkdb01.enkitec.com: ICMP echo request, id 32399, seq 4, length 64
244 packets captured
244 packets received by filter
0 packets dropped by kernel

Packets are coming in, but not going anywhere. If we look at the Infiniband interface, you can see what it’s trying to do:

[acolvin@enkdb01 ~]$ sudo /usr/sbin/tcpdump -i bond0 | grep "192.168.12.100"
tcpdump: WARNING: arptype 32 not supported by libpcap - falling back to cooked socket
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on bond0, link-type LINUX_SLL (Linux cooked), capture size 96 bytes
07:41:22.361029 arp who-has 192.168.12.100 tell enkdb01.enkitec.com hardware #32
07:41:23.361060 arp who-has 192.168.12.100 tell enkdb01.enkitec.com hardware #32
07:41:24.361125 arp who-has 192.168.12.100 tell enkdb01.enkitec.com hardware #32
07:41:26.361099 arp who-has 192.168.12.100 tell enkdb01.enkitec.com hardware #32
07:41:27.361208 arp who-has 192.168.12.100 tell enkdb01.enkitec.com hardware #32
07:41:28.361182 arp who-has 192.168.12.100 tell enkdb01.enkitec.com hardware #32
3089 packets captured
3089 packets received by filter
0 packets dropped by kernel

Just as we expected, enkdb01 is looking at the routing table and sending ARP requests over bond0 (the Infiniband interface) to find out what MAC has 192.168.12.100. If you’ve worked on a RAC system before, it should not surprise you that the interconnect IPs need to be separate. Oracle’s checkip scripts that look for potential network issues do not even run any checks against the Infiniband network. Moral of the story is that you shouldn’t gloss over the Infiniband network even though it’s “non-routable.”

Be Sociable, Share!

7 thoughts on “Exadata Interconnect Addressing Tips

  1. Paresh

    Hi Andy,

    Thanks for sharing this informative article. What does Oracle say if and when you asked about changing IP range used for the IB network?

    I think it should be possible for Oracle to honor this as you can do the same for all future Exadata and Exalogic servers you order and have the same subnet for IB (and whatever else Oracle wants to sell us ;).

    Unless Oracle has some hard coded values in their code for this subnet to optimize IB traffic (I have no clue if this is possible just a wild guess). I think Oracle hard coding IP values in code is highly unlikely.

    Thanks,
    Paresh

    Reply
    1. acolvin Post author

      It shouldn’t be a problem, as long as the change gets put in before the ACS installation starts. Ideally, I’d love to be able to provide that service to clients, but that’s a no-go so far. It could partly be solved by the HAIP feature in 11.2.0.2, which may be the direction that Oracle looks to take the interconnect.

      Reply
  2. Pingback: Log Buffer #225, A Carnival of the Vanities for DBAs | The Pythian Blog

  3. Vishal Gupta

    Andy/Paresh,

    Its is possible to change the IP address range of IB network. Information submitted to Oracle via pdf based Configuration sheet is inputed into a xls based configuration sheet called dbm_configurator.xls, which is used to generate few config files. This dbm_configurator.xls can be found under /opt/oracle.SupportTools/onecommand folder on compute node after compute node has been imaged using image for compute nodes. Usually, filled in sheet and generated config files (like cell_group, *_group, dbmachine.params, etc) are placed on first compute node, which is used to drive the entire Exadata rack build process using onecommand.

    So you can ask for this dbm_configurator.xls file to be sent to you (or you could use one from your existing Exadata if you have any) for filling information instead of pdf. For one of my client, i had done the same. We have different IP ranges for our Exadata IB network (eg. we have 192.168.54.1/22).

    This comes in handy when you want to connect Prod and UAT Exadata to the same backup infrastructure (TSM Server) over IB for faster backup from prod and faster restores to UAT. Backup and restore times are reduced from 3-4 days over 1GbE to 4-8hours over 40Gb IPoIB.

    I write a blog about all this.

    Reply
    1. acolvin Post author

      Yes, the IPs can be set from the dbm_configurator spreadsheet, but my experience with ACS has been that they’re reluctant to change things like that, or anything outside of the standard configuration. From the dbm_configurator, there are many things that can be done outside of the standard ACS installation.

      Reply
      1. Vishal Gupta

        I have been able to tell ACS to change things outside of their standard dbm_configuration pdf. I usually give them the pre-filled dbm_configurator spreadsheet, but difficulty with that spreadsheet version keeps changing with newer cell images. As Oracle keeps enhancing its spreadsheet. Since spreadsheet also generate a set of files like (dbmachine.params, dbMachine_prefix, DBM.DAT etc) by clicking generate files, if one supplied all these generated files to ACS, they can simply import it into their latest and greatest version of dbm_configurator spreadsheet. But all this requires one to know bit about Exadata and onecommand before hand. So if someone is knew to Exadata, then they would not have a clue that, all these details can be changed. Getting exadata installed and configured for a completely customer can be quite overwhelming without prior exadata knowledge. I guess that where Exadata consultants like you and me add the value and give the right inputs to ACS as per customer’s site and requirements. ACS does not make efforts to find out future plans of customer.

        Oracle very well knows that any customer who is backing up their Exadata racks over infiniband would at some point in time want to restore that backups onto dev/test environment as well. If someone would try to restore these backups over 1Gbps network interfaces ( as found in V2 models, though X2 models have 10Gbps interfaces), it would take ages. So usually customer would want to connect the both prod and test/dev exadata racks to same tape media servers over infiniband. If both prod/test/dev have same internal infiniband IPs, then it would create problem. So, ACS should ask up-front in the PDF file itself, what internal IP address range does customer require. If customer is not aware, then ACS should advise of these future requirements, so that customer can make informed decisions. But that would be ideal situation.

        Nevertheless all internal IPs can be changed later on, though it involves the clusterwise downtime. I have very recently change all the internal infiniband IPs for a test exadata rack, so that same TSM media server could be connected to both prod and test rack using infiniband. This enable to backup/restore backups in a faster and timely manner.

        Reply
        1. Andy Colvin Post author

          It is definitely something that can be changed, but usually isn’t thought about until after the box is up and running. And as we all know, it’s tougher to get downtime once people are working on the system, even if it is just developers. If you can get to the spreadsheet early enough, you can generate those parameter files, and change the values in there. What would really be nice is if Oracle would allow people other than ACS to do the installation, because we’d truly be able to customize it from the beginning, instead of having to repeat the process.

          Reply

Leave a Reply