Monthly Archives: September 2011

What’s New With Exadata – September 2011

Over the past few weeks, I’ve been working on some new (and older) installations of Exadata, and came across a few items that piqued my interest.  Each of these things had been on my mind for a while, but it’s nice to see them finally resolved.

The first is a small change to the installation tree of the Oracle homes on Exadata.  With the release of 11.2.0.2, Oracle created a new “best practice” of performing all patch sets out of place into a new home.  While this makes it really easy to roll back a patch, the default naming convention for Oracle homes on Exadata became a bit of a sticky situation. If your 11.2.0.2 Grid Infrastructure home was at /u01/app/11.2.0/grid, where would you put your 11.2.0.3 home when it’s ready to come out?  This was the topic of more than a few discussions around the Enkitec office.  Do you extend the version out another digit to 11.2.0.3, or version the home (/u01/app/11.2.0/grid_11.2.0.2, etc).  Well, Oracle has put this discussion to rest….Your new Oracle home directories on Exadata are:

Grid Infrastructure - /u01/app/11.2.0.2/grid
Database - /u01/app/oracle/product/11.2.0.2/dbhome_1

read on for another change (it has to do with bundle patches) Continue reading

Inside the Oracle Database Appliance – Part 1

We’ve had a few weeks to play around with the ODA in our office, and I’ve been able to crack it open and get to into the software and hardware that powers it.

For starters, the system runs a new model of Sun Fire – the X4370 M2.  The 4U chassis is basically 2 separate 2U blades (Oracle is calling them system controllers – SCs) that have direct attached storage on the front.  Here’s a listing of the hardware in each SC:

Sun X4370M2 System Controller Components
(2 SCs per X4370M2)
CPU 2x 6-core Intel Xeon X5675 3.06GHz
Memory 96GB 1333MHz DDR3
Network 2x 10GbE (SFP+) PCIe card
4x 1GbE PCIe card
2x 1GbE onboard
Internal Storage 2x 500GB SATA for operating system
1x 4GB USB internal
RAID Controller 2x SAS-2 LSI HBA
Shared Storage 20x 600GB 3.5″ SAS 15,000 RPM hard drives
4x 73GB 3.5″ SSDs
External Storage 2x external MiniSAS ports
Operating System Oracle Enterprise Linux 5.5 x86-64

Pictures of a real live ODA after the break.

Continue reading

Oracle Announces Oracle Database Appliance

Oracle has announced a new product, the “Oracle Database Appliance,” or ODA (pronounced oh-duh) as I like to call it.  Enkitec has been fortunate enough to get our hands on a test box.  Be sure to check out my post on a deep dive (LINK GOES HERE) inside the ODA.

The gist of the ODA is that it’s a small RAC (though RAC isn’t required) in a box.  Contrary to the rumors, it’s not a “mini-Exadata” as some people have speculated.  As you would expect, there’s no capability for smart scans.  The ODA does build on one of Exadata’s big advantages, the rapid installation time.  Compared to a typical Oracle installation, there is so much time lost in the process of getting a server ready for an Oracle database.  On typical installations, the following things have to be done before a system is ready:

  • racking and cabling the system to power and network
  • connecting servers to the SAN
  • allocating LUNs on the SAN
  • installing the operating system
  • configuring the operating system for Oracle database use (kernel and memory settings)
  • mounting LUNs from the SAN and ensuring multipathing

With the ODA, you only have to perform the first task.  Everything else is taken care of.  The OS is installed and optimized, storage connected, and multipathing configured.  It may not sound like much, but how many projects have you seen delayed because the SAN switch wasn’t zoned correctly, etc?

While many people will say that this machine doesn’t appeal to a mass market, there are plenty of Oracle shops that could use a system with 12-24 cores and 4TB of usable space.  It’s not built to be a data warehouse or OLTP beast…just a really solid machine with plenty of redundancy running an Oracle database.

Openworld 2011 Presentation – Sizing the FRA

I’ll be presenting at OpenWorld this year with Cristobal Pedregal-Martin from Oracle. Our session is titled “How to Best Configure, Size, and Monitor the Oracle Database Fast Recovery Area.” While it may be an afterthought for many DBAs, it is something that requires some planning, especially on Exadata environments. Cris will be speaking on guidelines for sizing and maintaining the FRA, while I’ll be adding nuggets of wisdom based on my experience in the field. It should be a good experience all around. We’re session number 13445, Moscone South 304, Thursday at 3:00. Plan accordingly, as I’m sure it will be a packed house.

The abstract of our talk is:

“The Oracle Database fast recovery area (FRA) provides storage and automated space management for recovery-related files and is a key piece of your high-availability strategy. This session covers best practices for configuring, sizing, and monitoring the FRA. It explains how your choice of logs and backups managed by the FRA affects database availability and discusses how to size the FRA to satisfy your recovery requirements, including those addressed by Oracle Flashback. It also explores how the FRA uses and recycles storage space to enable you to better estimate, define, and monitor your recovery retention policies and flashback windows. Finally, the session presents some common data protection scenarios and discusses how to configure the FRA in each for best results.”

Exadata Storage on Demand

One of the common refrains regarding Exadata storage is that there’s no real capacity for adding storage as your database grows.  The routine was always to let the storage guys dole out storage as needed, keeping tight reins on where their precious gigabytes (now terabytes) went.  When a database outgrew the storage it was allocated, a new LUN was requested, and after much gnashing of teeth, it was given to the systems group to present to the database.

Just like many things with Exadata, this process is turned on its head.  The standard Exadata way is to give all of the storage to ASM, and allow the DBAs to make sure that they don’t run around drunk off of the amount of raw storage available.  But what if you’re like most environments, where you’re going to grow into your storage requirements over time?  What many people won’t tell you is that you don’t necessarily have to license every component on an Exadata simply because it’s available for purchase (more on that in a future post).

Say that you’re in the market for an Exadata, and while a half rack may suit your needs today, in 12 or 18 months, you’ll be needing the space provided by a full rack.  While it is available to purchase an upgrade,  remember that you will be given whatever Oracle’s current Exadata hardware is at that time.  If you originally purchased a V2 last year, and Oracle is only offering X2-2 (or whatever gets announced at OpenWorld) components, you will end up with dissimilar compute and storage nodes.  Certain processes like decryption (due to the hardware assist on decryption available in the X2) will perform better on the X2 storage cells vs the V2 storage cells, which leads to sporadic performance.  If you need to have consistent hardware across the rack, but don’t have a need for all of it from day one (for either logistical or financial reasons), it is possible to license only what you need.  Granted, you will have to pay for all of the hardware up front, but the support and licensing costs are only paid for when you actually use the features.  Some people may balk at this approach, but it’s essentially what storage administrators have been doing for years.  This is just storage that’s isolated to a particular system, instead of being available to a larger group of systems.

But, what happens when you need to add storage?  Do you have to take an outage to add storage?  Do you need to bounce the cluster?  The answer is that it’s pretty simple.  In my case, we were working with a half rack that was only licensed for a 1/4 rack.  That means that we have purchased 7 storage cells, but are only licensing 3.  With the storage server licensing at $120,000 per cell (12 disks at $10,000 per disk), that’s a savings of $480,000 in licenses, not to mention the support costs.

The system was originally configured as a half rack, so all of the griddisks were created, and the ASM diskgroups were configured to use 7 storage servers.  To get back to the licensed number of storage servers, we removed cells 5 through 7 one at a time, and performed a rebalance in between.  The easiest way to do this was to set the DISK_REPAIR_TIME attribute for each diskgroup to 1 minute through sqlplus:

SYS:+ASM1>select g.name "Diskgroup", a.name "Attribute", a.value "Value" from v$asm_attribute a, v$asm_diskgroup g  where a.group_number=g.group_number and a.name='disk_repair_time' order by 1;
 
Diskgroup                      Attribute            Value
------------------------------ -------------------- ------------------------------
DATA_MOS1                      disk_repair_time     3.6h
DBFS_DG                        disk_repair_time     3.6h
RECO_MOS1                      disk_repair_time     3.6h
 
SYS:+ASM1>alter diskgroup DBFS_DG set attribute 'disk_repair_time'='1m';
 
Diskgroup altered.
 
SYS:+ASM1>alter diskgroup RECO_MOS1 set attribute 'disk_repair_time'='1m';
 
Diskgroup altered.
 
SYS:+ASM1>alter diskgroup DATA_MOS1 set attribute 'disk_repair_time'='1m';
 
Diskgroup altered.
 
SYS:+ASM1> select g.name "Diskgroup", a.name "Attribute", a.value "Value" from v$asm_attribute a, v$asm_diskgroup g  where a.group_number=g.group_number and a.name='disk_repair_time' order by 1;
 
Diskgroup                      Attribute            Value
------------------------------ -------------------- ------------------------------
DATA_MOS1                      disk_repair_time     1m
DBFS_DG                        disk_repair_time     1m
RECO_MOS1                      disk_repair_time     1m

By doing this, we ASM will dismount the disks and rebalance the diskgroup after a disk has been offline for 1 minute.  Setting the value this low is only to be used during the process of dropping the unlicensed storage servers from the grid.  After we have dropped them, the value will be reset to the default value of 3.6 hours.  Now, we can shut off one of the storage cells.  After ASM has noticed that the disks are no longer available, the disks are dismounted and a rebalance is started.  When the rebalance is complete, the process is repeated until we are down to the licensed number of cells.  After the storage servers have been removed from ASM, the rebalance timer is set back to default, and the /etc/oracle/cell/network-config/cellip.ora file on each compute node is modified to only search for the storage cells that are licensed.  While this isn’t required, it will prevent ASM from querying the cells that aren’t being used for Exadata storage, so the total disk discovery time will be shorter, as it’s not waiting for the unused cells to time out.

[acolvin@enkdb01 ~]$ cat /etc/oracle/cell/network-config/cellip.ora 
cell="192.168.12.5"
cell="192.168.12.6"
cell="192.168.12.7"
#cell="192.168.12.8"
#cell="192.168.12.9"
#cell="192.168.12.10"
#cell="192.168.12.11"

This is all fairly routine (boring) stuff.  The good part is what happens when we need to add capacity.  Say that something in the database has changed, and you need more space quickly.  You don’t have to wait to order a single storage cell, price out an expansion rack, or go through the process of ordering and installing an upgrade kit. Simply log in to each compute node and uncomment the line in /etc/oracle/cell/network-config/cellip.ora that relates to the storage cell you’re powering on, then boot up the cell. There is no need to bounce CRS to get the new value in the cellip.ora file to take. Once the cell has booted up and cellsrv is running, ASM will take over and notice the disks are available, add them to the relevant diskgroups, and start a rebalance to get the data moved over. You’ll see the following lines in the alert log for ASM:

Tue Sep 13 18:36:09 2011
ALTER SYSTEM SET local_listener='(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=X.X.X.X)(PORT=1521))))' SCOPE=MEMORY SID='+ASM2';
Tue Sep 13 18:52:47 2011
Starting background process XDWK
Tue Sep 13 18:52:48 2011
XDWK started with pid=29, OS id=10912 
Tue Sep 13 18:52:50 2011
NOTE: disk validation pending for group 2/0xe224a1b (DBFS_DG)
SUCCESS: validated disks for 2/0xe224a1b (DBFS_DG)
NOTE: disk validation pending for group 2/0xe224a1b (DBFS_DG)
NOTE: Assigning number (2,30) to disk (o/192.168.10.11/DBFS_DG_CD_07_mos1cel07)
NOTE: Assigning number (2,31) to disk (o/192.168.10.11/DBFS_DG_CD_09_mos1cel07)
NOTE: Assigning number (2,32) to disk (o/192.168.10.11/DBFS_DG_CD_05_mos1cel07)
NOTE: Assigning number (2,33) to disk (o/192.168.10.11/DBFS_DG_CD_10_mos1cel07)
NOTE: Assigning number (2,34) to disk (o/192.168.10.11/DBFS_DG_CD_04_mos1cel07)
NOTE: Assigning number (2,35) to disk (o/192.168.10.11/DBFS_DG_CD_02_mos1cel07)
NOTE: Assigning number (2,36) to disk (o/192.168.10.11/DBFS_DG_CD_03_mos1cel07)
NOTE: Assigning number (2,37) to disk (o/192.168.10.11/DBFS_DG_CD_11_mos1cel07)
NOTE: Assigning number (2,38) to disk (o/192.168.10.11/DBFS_DG_CD_06_mos1cel07)
NOTE: Assigning number (2,39) to disk (o/192.168.10.11/DBFS_DG_CD_08_mos1cel07)
SUCCESS: validated disks for 2/0xe224a1b (DBFS_DG)
NOTE: membership refresh pending for group 2/0xe224a1b (DBFS_DG)
Tue Sep 13 18:52:56 2011
GMON querying group 2 at 10 for pid 19, osid 29830
NOTE: cache opening disk 30 of grp 2: DBFS_DG_CD_07_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_07_mos1cel07
NOTE: cache opening disk 31 of grp 2: DBFS_DG_CD_09_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_09_mos1cel07
NOTE: cache opening disk 32 of grp 2: DBFS_DG_CD_05_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_05_mos1cel07
NOTE: cache opening disk 33 of grp 2: DBFS_DG_CD_10_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_10_mos1cel07
NOTE: cache opening disk 34 of grp 2: DBFS_DG_CD_04_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_04_mos1cel07
NOTE: cache opening disk 35 of grp 2: DBFS_DG_CD_02_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_02_mos1cel07
NOTE: cache opening disk 36 of grp 2: DBFS_DG_CD_03_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_03_mos1cel07
NOTE: cache opening disk 37 of grp 2: DBFS_DG_CD_11_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_11_mos1cel07
NOTE: cache opening disk 38 of grp 2: DBFS_DG_CD_06_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_06_mos1cel07
NOTE: cache opening disk 39 of grp 2: DBFS_DG_CD_08_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_08_mos1cel07
NOTE: Attempting voting file refresh on diskgroup DBFS_DG
GMON querying group 2 at 11 for pid 19, osid 29830
SUCCESS: refreshed membership for 2/0xe224a1b (DBFS_DG)

After the rebalance is complete, the storage has been added, and everything is ready to go. No downtime needed. Keep in mind that the same processes are in place for adding other storage through purchasing single storage cells, or adding Exadata expansion racks. In these cases, the griddisks will need to be configured to match the existing griddisk sizes.