Inside the Oracle Database Appliance – Part 1
We've had a few weeks to play around with the ODA in our office, and I've been able to crack it open and get to into the software and hardware that powers it.
For starters, the system runs a new model of Sun Fire - the X4370 M2. The 4U chassis is basically 2 separate 2U blades (Oracle is calling them system controllers - SCs) that have direct attached storage on the front. Here's a listing of the hardware in each SC:
| Sun X4370M2 System Controller Components (2 SCs per X4370M2) |
|
| CPU | 2x 6-core Intel Xeon X5675 3.06GHz |
| Memory | 96GB 1333MHz DDR3 |
| Network | 2x 10GbE (SFP+) PCIe card 4x 1GbE PCIe card 2x 1GbE onboard |
| Internal Storage | 2x 500GB SATA for operating system 1x 4GB USB internal |
| RAID Controller | 2x SAS-2 LSI HBA |
| Shared Storage | 20x 600GB 3.5" SAS 15,000 RPM hard drives 4x 73GB 3.5" SSDs |
| External Storage | 2x external MiniSAS ports |
| Operating System | Oracle Enterprise Linux 5.5 x86-64 |
Pictures of a real live ODA after the break.
If you're anything like me, the first thing you wanted to know about the ODA is what's inside? Follow along as we walk through the hardware involved.
What's an SC?
The SC is essentially Oracle's term for the blades that sit inside the Sun Fire X4370 M2.
From this view, the back of the SC is at the bottom. At the top (front), are 2 connections that plug into the chassis of the X4370 M2. The SCs slide out from the back of the chassis. Looking closer at the back of the SC, we have the external connections
From the back, you can see the PCI cards on the left, and the onboard ports on the right. In the middle are the fan modules. On the left, we have 4, gigabit ethernet ports, 2, 10GbE (SFP+) ports, and the 2 external SAS connections. On the right are the dual gigabit ethernet (onboard) ports, the serial and network ports for the ILOM, and your standard VGA and USB ports. Above the onboard ports are the 500GB SATA hard drives used for the operating system.
What's Inside?
As mentioned above, there are 2 Seagate 500GB serial ATA hard drives that are used for the operating system:
Along with the serial ATA drives is a 4GB USB flash stick that can be used to create a bootable rescue installation of Oracle Enterprise Linux. Also, this drive is used for some firmware updates.
As for the disk controllers, each SC has 2 LSI controllers. One is on the internal PCIe slot, and another is on a standard PCIe slot. They are the SAS9211-8i controller.
Operating system
One of the many things that I found interesting on the box was that OEL 5.5 was installed, not one of the newer releases. Also, the server is running the RedHat compatible kernel, and not the Unbreakable Enterprise Kernel (UEK), which is the default on newer releases of OEL 5.
[root@xlnxrda01 ~]# uname -a Linux xlnxrda01 2.6.18-194.32.1.0.1.el5 #1 SMP Tue Jan 4 16:26:54 EST 2011 x86_64 x86_64 x86_64 GNU/Linux [root@xlnxrda01 ~]# cat /etc/redhat-release Red Hat Enterprise Linux Server release 5.5 (Tikanga) |
Disk configuration
As for the storage, Oracle has removed one of my biggest peeves with the Exadata storage servers. On Exadata, the operating system resides on 30GB partitions on the first 2 hard disks. Because of this, a 30GB griddisk has to be created on the remaining 10 disks, which becomes the DBFS_DG diskgroup (formerly SYSTEMDG). With Exadata, this diskgroup becomes the location for OCR/voting files, unless DATA or RECO is high redundancy. In that case, DBFS_DG is just wasted space. Anyways, going back to the ODA (that was the point of this post, wasn't it?), this problem is no longer present thanks to the 2 500GB 2.5" SATA drives in the back. These drives utilize software RAID (just like the Exadata storage servers), but don't take advantage of the active/inactive partition scheme:
[root@patty ~]# parted /dev/sda print
Model: ATA SEAGATE ST95001N (scsi)
Disk /dev/sda: 500GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number Start End Size Type File system Flags
1 32.3kB 107MB 107MB primary ext3 boot, raid
2 107MB 500GB 500GB primary raid
Information: Don't forget to update /etc/fstab, if necessary.
[root@patty ~]# parted /dev/sdb print
Model: ATA SEAGATE ST95001N (scsi)
Disk /dev/sdb: 500GB
Sector size (logical/physical): 512B/512B
Partition Table: msdos
Number Start End Size Type File system Flags
1 32.3kB 107MB 107MB primary ext3 boot, raid
2 107MB 500GB 500GB primary raid
Information: Don't forget to update /etc/fstab, if necessary.
[root@patty ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 sdb1[1] sda1[0]
104320 blocks [2/2] [UU]
md1 : active raid1 sdb2[1] sda2[0]
488279488 blocks [2/2] [UU]
unused devices:
[root@patty ~]# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroupSys-LogVolRoot
30G 7.4G 21G 27% /
/dev/md0 99M 17M 77M 18% /boot
/dev/mapper/VolGroupSys-LogVolOpt
59G 5.8G 50G 11% /opt
/dev/mapper/VolGroupSys-LogVolU01
97G 188M 92G 1% /u01
tmpfs 48G 0 48G 0% /dev/shm |
Let's look at the LVM setup a little closer. We have a physical volume with a 465GB volume group (VolGroupSys). From here, we have LogVolRoot (30GB), LogVolOpt (60GB), and LogVolU01 (100GB). That leaves us with more than 250GB free on each SC for either adding new filesystems, growing existing filesystems, or taking LVM snapshots.
[root@patty ~]# pvdisplay --- Physical volume --- PV Name /dev/md1 VG Name VolGroupSys PV Size 465.66 GB / not usable 3.44 MB Allocatable yes PE Size (KByte) 32768 Total PE 14901 Free PE 8053 Allocated PE 6848 PV UUID Q99xBf-AMdf-so7d-VF8J-cIjB-cHeQ-cbWzSD [root@patty ~]# vgdisplay --- Volume group --- VG Name VolGroupSys System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 5 VG Access read/write VG Status resizable MAX LV 0 Cur LV 4 Open LV 4 Max PV 0 Cur PV 1 Act PV 1 VG Size 465.66 GB PE Size 32.00 MB Total PE 14901 Alloc PE / Size 6848 / 214.00 GB Free PE / Size 8053 / 251.66 GB VG UUID 96cHoA-qhpG-A2hr-tbG1-c1oW-k9lI-MGURFU [root@patty ~]# lvdisplay --- Logical volume --- LV Name /dev/VolGroupSys/LogVolRoot VG Name VolGroupSys LV UUID 5qpcEM-aPPA-hGGf-GBBs-nwRi-HUNN-qrsIQp LV Write Access read/write LV Status available # open 1 LV Size 30.00 GB Current LE 960 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:0 --- Logical volume --- LV Name /dev/VolGroupSys/LogVolOpt VG Name VolGroupSys LV UUID Ge110F-s9oB-Eqak-muo0-3yWn-GEuN-klI1Rs LV Write Access read/write LV Status available # open 1 LV Size 60.00 GB Current LE 1920 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:1 --- Logical volume --- LV Name /dev/VolGroupSys/LogVolU01 VG Name VolGroupSys LV UUID lDBl2R-ZpX7-QZxJ-t8wI-DKde-kANX-zPf3XP LV Write Access read/write LV Status available # open 1 LV Size 100.00 GB Current LE 3200 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:2 --- Logical volume --- LV Name /dev/VolGroupSys/LogVolSwap VG Name VolGroupSys LV UUID Z2shPb-Dbe6-oVIR-hq2z-nDHs-xfNW-43miCW LV Write Access read/write LV Status available # open 1 LV Size 24.00 GB Current LE 768 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:3 |
Network Configuration
The ODA has 8 (6 GbE and 2 10GbE) physical ethernet ports available to you, along with 2 internal fibre ports that are used for a built-in cluster interconnect. Here's the output of running "ethtool" on the internal NICs:
[root@patty ~]# ethtool eth0
Settings for eth0:
Supported ports: [ FIBRE ]
Supported link modes: 1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: FIBRE
PHYAD: 0
Transceiver: external
Auto-negotiation: on
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000001 (1)
Link detected: yes
[root@patty ~]# ethtool eth1
Settings for eth1:
Supported ports: [ FIBRE ]
Supported link modes: 1000baseT/Full
Supports auto-negotiation: Yes
Advertised link modes: 1000baseT/Full
Advertised auto-negotiation: Yes
Speed: 1000Mb/s
Duplex: Full
Port: FIBRE
PHYAD: 0
Transceiver: external
Auto-negotiation: on
Supports Wake-on: d
Wake-on: d
Current message level: 0x00000001 (1)
Link detected: yes |
Surprisingly, these NICs aren't bonded, so we have 2 separate cluster interconnects, which means that we also have 2 HAIP devices. Also, eth1 and eth2 (the onboard NICs) were used to create a bond for the public traffic.
[patty:oracle:+ASM1] /home/oracle > oifcfg getif eth0 192.168.16.0 global cluster_interconnect eth1 192.168.17.0 global cluster_interconnect bond0 192.168.8.0 global public |
Note that the default configuration of the ODA doesn't include a management network, like the Exadata does. That doesn't mean that you can't set up a management network, just that it's not part of the initial setup process.
There's a little overview of what's inside the ODA. The next piece in the series will go into a little more detail on the disks, as well as the Oracle configuration.







September 21st, 2011 - 12:16
X4370M2? That’s just silly when there was no M1.
September 21st, 2011 - 12:23
Don’t put it past Oracle….they started the database product with version 2.
September 21st, 2011 - 15:56
The would have been a Sun product in development called X4370. Since Oracle’s acquisition of Sun the product has been further developed and thus tagged M2.
September 22nd, 2011 - 04:06
the Nehalem product without M2
all westmere server has M2 tag
September 22nd, 2011 - 04:30
Nic post questions
1)can you share the cluster interconnect cables and Internal card that has two ge and
2)there is 4 more slots open, can one add more SSD?
3)why the name oakcli?
4)what is the function of two UART port?
5)any picture of two SAS HBA connect to both server?
thx
September 22nd, 2011 - 09:45
@laotsao – The 2 UART ports are used for the internal cluster interconnect. Because the UDA is designed to only be used as a 2-node RAC, they eliminated the need for a cluster interconnect that is cabled. As for adding more SSD, there’s no room…It’s got 24 disk slots, and has 20 hard disks, 4 SSD. oakcli comes from “Oracle Appliance Kit CLI.”
When I’m back in the office, I’ll try to get some more pics of the inside. We’ll see if I can get the guys to let me take one of the nodes down.
September 22nd, 2011 - 14:53
thx, yes there is no more slots open with 4 SSD and 20 SAS HDD
September 23rd, 2011 - 05:10
Great post, Andy!
Only one thing: The DBFS_DG is not wasted space if you implement the DBFS database (therefore the new name of the diskgroup) with it’s tablespaces there. That is the recommended way to host flat files (that you may use for SQL*Loader or External Tables) on Exadata.
November 1st, 2011 - 08:59
Andy are there any gotchas with the shared storage ?
Can you address it from each controller on each SC ? Assuming you want a volume on each SC.
Do you have to take the triple protection on the controller or could you take a two raid 6 volumes at 4800GB each ?
Cheers Vin
November 2nd, 2011 - 09:51
Vin,
The shared storage is configured to be used within ASM diskgroups. By default, there are 3 diskgroups created: DATA, RECO, and REDO. If you want a shared filesystem between the 2 SCs, you will use either external NFS storage, or create a volume using ACFS. It is my understanding that the only protection for the ASM disks is through ASM redundancy, which we have seen to be very resilient. High redundancy is how the box was set up by the configurator, and there was not an option through the GUI to change that. If you are running this in a production environment, I would definitely recommend running with high redundancy. There is no hardware RAID used on the ASM disks. The 500GB disks that are in the back are isolated to each SC.
November 13th, 2011 - 20:40
One thing that I mentioned above were the 2 external SAS ports. I’ve heard from a couple of people at Oracle that the ODA does not support using these external connections. It sounds like (no official confirmation) that the only supported methods of storage expansion are using NFS (preferably direct NFS) and iSCSI. We’re working on iSCSI in our lab, and it’s not as straightforward as you would expect. Results on that in a future post.
January 19th, 2012 - 15:59
Hi there ,
Can you please tell me what are ip address requirement for EE single node installation?
Is that 2 ip per ODA is enough for EE single node installation?
Thanks
Ahmed
January 23rd, 2012 - 15:16
Ahmed,
No matter what your configuration is, you will need 8 IP addresses. That includes 2 for the ILOMs (each SC has an ILOM), 2 for the SCs, 2 for VIPs (each SC will have a vip), and 2 for the scan (because the cluster will only have 2 nodes, it only needs 2 IPs for the scan). While it isn’t required to have the ILOMs connected to the network, it is definitely recommended. Also, even if you chose to not run RAC for the ODA, you will still get a clustered grid infrastructure, which will utilize the VIPs and scan. This is included free of charge when you license enterprise edition.
February 9th, 2012 - 08:00
Hi,
There is a requirement that the interconnect will use switched network and not cross cable.
How is it implemented in ODA ? Do they have internal switch on the somehow changed the concept and are using cross cable
Hadar
February 9th, 2012 - 15:20
They’re not really switched interfaces, but the internal NICs use the onboard Intel 82576 chip. From the RAC FAQ (note #220970.1):
——————————————————————————————
Is crossover cable supported as an interconnect with RAC on any platform ?
NO. CROSS OVER CABLES ARE NOT SUPPORTED. The requirement is to use a switch:
Detailed Reasons:
1) cross-cabling limits the expansion of RAC to two nodes
2) cross-cabling is unstable:
a) Some NIC cards do not work properly with it. They are not able to negotiate the DTE/DCE clocking, and will thus not function. These NICS were made cheaper by assuming that the switch was going to have the clock. Unfortunately there is no way to know which NICs do not have that clock.
b) Media sense behaviour on various OS’s (most notably Windows) will bring a NIC down when a cable is disconnected. Either of these issues can lead to cluster instability and lead to ORA-29740 errors (node evictions).
Due to the benefits and stability provided by a switch, and their afforability ($200 for a simple 16 port GigE switch), and the expense and time related to dealing with issues when one does not exist, this is the only supported configuration.
From a purely technology point of view Oracle does not care if the customer uses cross over cable or router or switches to deliver a message. However, we know from experience that a lot of adapters misbehave when used in a crossover configuration and cause a lot of problems for RAC. Hence we have stated on certify that we do not support crossover cables to avoid false bugs and finger pointing amongst the various parties: Oracle, Hardware vendors, Os vendors etc…
——————————————————————————————
It’s my understanding that Oracle has tested against these chips and has verified that the issues above are not present on this particular chip.