Exadata V2 Battery Replacement
For some reason, I've been working on lots of Exadata V2 systems in the past few months. One of the issues that I've been coming across for these clients is a failure in the battery that is used by the RAID controller. It was originally expected for these batteries to last 2 years. Unfortunately, there is a defect in the batteries where they reach their end of life after approximately 18 months. The local Sun reps should have access to a schedule that says when the "regular maintenance" should occur. For one client, it wasn't caught until the batteries had run down completely and the disks were in WriteThrough mode. This can be seen by running MegaCLI64. Here is the output to check the WriteBack/WriteThrough status for 2 different compute nodes (V2 is first, X2-2 is second):
[enkdb01:root] /root > dmidecode -s system-product-name SUN FIRE X4170 SERVER [enkdb01:root] /root > /opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -LALL -aALL | grep "Cache Policy" Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteThrough, ReadAheadNone, Direct, No Write Cache if Bad BBU Disk Cache Policy : Disabled |
[root@enkdb03 ~]# dmidecode -s system-product-name SUN FIRE X4170 M2 SERVER [root@enkdb03 ~]# /opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -LALL -aALL | grep "Cache Policy" Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU Disk Cache Policy : Disabled |
If you have a V2 and you haven't replaced the batteries yet, it's worth running these commands to see what state your RAID controllers are in. To find out what this means for you, read on after the break.
New Exadata Full Stack Patches
Oracle has announced a new patching strategy for Exadata, starting with databases running 11.2.0.3. Oracle will be moving away from the monthly bundle patch philosophy, which was panned by many administrators as coming too often to keep up with, given the tight schedules held around most Exadata systems. Instead, Oracle will be releasing a Quarterly Database Patch for Exadata, or QDPE. The QDPE will most likely be released in conjunction with the standard critical patch updates (CPUs). Oracle will still release interim bundle patches, but recommends for customers to only install the QDPEs unless they have a specific need to install a bundle patch. Note that so far, the QDPEs are only being released for 11.2.0.3 - Linux x86_64, SPARC Solaris (supercluster), and Solaris x86_64.
In addition to the QDPE release, Oracle has announced a "full stack QDPE" - the Quarterly Full Stack Download Patch, or QFSDP. This "full stack" patch includes all of the latest software that can be found in MOS note #888828.1. The January 2012 QFSDP includes:
- Infrastructure Software
- Exadata Storage Server version 11.2.2.4.2
- Exadata Infiniband Switch version 1.3.3-2
- Exadata PDU firmware version 1.04
- Database
- 11.2.0.3 January 2012 QDPE
- Opatch 11.2.0.1.9
- OPlan
- Systems Management
- Patches for 11g OEM agents
- Management plugins for 11g OEM
- Patches for 11g OEM management server
No word on when Oracle will start including patches for the new OEM 12c. Keep in mind that these are just a collection of patches, they all still need to be installed as if they were downloaded separately. Oracle does not yet have a mechanism in place to apply the QDPE, storage server patches, Infiniband switch patches, etc in one swoop.
The current QDPE patch is January 2012 (patch #13513783), and the current DFSDP is January 2012 (patch #13551280).
Exadata Critical Patch for 11.2.2.3.x through 11.2.2.4.1
Oracle has released a critical patch for storage server versions 11.2.2.3.x through 11.2.2.4.1. While 11.2.2.4.1 was released last week, there were a few oneoff patches from 11.2.2.4.0 that didn't seem to make it in to the release. Oracle has since released 11.2.2.4.2 (patch #13513611, supplemental note #1388400.1). Similar to 11.2.2.4.1, this release looks to patch several outstanding issues. Here's the list of bugs fixed from the readme for 11.2.2.4.2:
12764521 INFINIBAND DIAG COMMANDS (LIKE IBDIAGNET AND IBNETDISCOVER) ARE NOT WORKING 13083530 10 GB-E BONDED INTERFACES FAILING- EXADATA 13410353 AFTER UPGRADE TO 11.2.2.4 INFINIBAND CMDS IBDIAGNET, IBNETDISCOVER NOT WORKING 13489032 CHECKHWNFWPROFILE DOES NOT DETECT FAILED FLASH FDOM 13489445 ORA-600 [OSSMISC:OSSMISC_TIMER] WHEN NTPD DETECTED 6 MILLISECOND TIME DIFFERENCE 13512932 FIX INSTALLED WORKAROUND FOR NTP UPDATE BUG 13489445
As you can see, the previously mentioned bugs have been fixed. There's another bug that was fixed in 11.2.2.4.1 that could be an issue for anybody running 11.2.2.3.x through 11.2.2.4.0. This bug (13454147) can remove the flashcache from a cell that has an uptime of 6 months or greater. Fortunately, Oracle has released a patch that includes these critical issues in the event that you can't quickly upgrade to 11.2.2.4.2 - I wouldn't advise running this version for at least a couple weeks...I always advise clients to wait that long for the early adopters to weed out any major issues.
Applying the critical patch only takes a minute, and doesn't take the storage servers or database instances offline. After it's done, a restart of cellsrv needs to be scheduled, but that can be done in a rolling fashion. Read on for an example of applying this patch. As always, do not apply any patch to a production system before appropriately testing against a non-production system!
Exadata Diskgroup Planning
As business has picked up since OpenWorld (didn't think that was possible, but that's another story for another day), we have been seeing more customers adopt or seriously look at Exadata as an option for new hardware implementations. While many will complain that there isn't enough room for customization in the rigid process of configuring an Exadata system, there are still many possibilities to make your Exadata your own, whether it's during the initial configuration phase or shortly thereafter. Of course, some of these modifications can be difficult to implement after the system is up and running with users logging in. I'm planning on starting a series of posts regarding a couple of the hot-button topics with regard to Exadata configuration - ASM diskgroup layout (the topic for today), role separated vs standard authentication, and so on. As these topics have no right answers, I'm more than open to a dialogue where you may disagree. On to the good stuff!
A Quick Primer - The Exadata Storage Architecture
Ok...so we're looking at Exadata specifically in this post. In the examples listed below, we'll discuss a quarter rack, since it's the easiest to diagram. To expand to half or full racks, just adjust the number of cells (7, 14) and disks (84, 168) accordingly. To see the relationship between the compute nodes (database servers), Infiniband switches, and storage servers refer to figure 1:

Figure 1 - Exadata Infiniband/Storage Connectivity
Exadata 11.2.2.4.0 10GbE Issue Resolved
It appears that Oracle has resolved the issue with the 10GbE drivers that were introduced in version 11.2.2.4.0. There is an updated note (1376664.1) that includes the patch to fix it. The issue was apparently related to TCP segmentation offloading, and can be fixed by installing the patch found in the note listed above. It does not require a reboot, and is similar to the fix for the IDT switch bug fixed in 11.2.2.4.0. Note that this bug only affected X2-2 systems utilizing 10 gigabit ethernet on the compute nodes. Oracle again recommends installing the 11.2.2.4.0 minimal pack on compute nodes.
After starting the service, users should see the following:
(root)# service disable10gigtso_13083530 start Skipping igb interface eth0 using driver version 2.1.0-k2-1 - TSO disable unnecessary Found ixgbe interface eth4 using driver version 2.0.84-k2 - Disabling TSO ... [SUCCESS] Found ixgbe interface eth5 using driver version 2.0.84-k2 - Disabling TSO ... [SUCCESS] |
Exadata Critical Issues with 11.2.2.4.0 Patch
Oracle has upgraded the supplemental note for 11.2.2.4.0 (1348647.1) with a handful of critical issues. First, there is an issue mentioned by Vishal Gupta here regarding 10 gigabit ethernet on the database servers. Currently, there is no workaround for this bug, and it is advised that customers utilizing 10 GbE on the database servers should stay on 11.2.2.3.5 for the minimal pack, while upgrading the storage servers to 11.2.2.4.0.
Additionally, there is a new issue that can be seen if the firmware on a disk fails to upgrade correctly during the patching process. If the firmware does not update correctly, it is possible that cellsrv will drop the celldisk (and corresponding griddisks), causing a loss of data on those disks. Before upgrading a cell to 11.2.2.4.0, check the status of your physical disks from cellcli with the following command:
cellcli -e 'list physicaldisk attributes luns where physicalInsertTime = null' |
The command should return no output. If it does, Oracle recommends to reboot each cell that returns output from this command. After the reboot has completed, run the command again to verify that the disks are ready to be patched. Here is the output from one of Enkitec's quarter rack systems (note that there is no output given, so the disks are ok to be upgraded):
[enkcel01:root] /root > dcli -g cell_group -l root cellcli -e 'list physicaldisk attributes luns where physicalInsertTime = null' [enkcel01:root] /root > |
Remember that these issues are not mentioned in the standard README files included in the 11.2.2.4.0 patch. Before applying any Exadata Storage Server patches, always consult the supplemental note that is referenced in the supported versions for Exadata note (#888828.1), as these notes are updated after a patch has been released.
Looking Back on 60+ Exadata Implementations
After seeing this press release, I couldn't help but think back on the last year and a half that I've been working on Exadata, and all of the interesting projects and implementations we've worked on. When you think about the number of Exadata systems that are out there (Oracle claims over 1,000), and we at Enkitec have sold - 29 - it's pretty impressive (75% of all Exadata systems in North America not sold by Oracle were sold by Enkitec), at least to me.
Going back over a few of them, we've worked with the following packaged applications:
- eBusiness Suite
- PeopleSoft
- OBIEE
- Informatica
- Oracle Warehouse Builder
Not to mention a number of custom applications based around code that was developed in house. There have been OLTP, data warehouse, and mixed load environments. We've moved 9.2 databases into Exadata using export/import, 11.2 databases using RMAN, and more than a few live migrations/upgrades using golden gate.
One of the first Exadata systems we worked on was our own, back when information was limited (if you think it's hard to get info today, imagine what it was like when there weren't many out there). We had no help going through the configuration worksheets. I'll always remember when looking it over and saying "You mean I need HOW many IPs for a quarter rack?!?!" From there, we learned about the system from building ours from the ground up. We chose not to purchase the Oracle installation service, and through a couple of "learning experiences" we picked up quite a few valuable skills on the internals and core of Exadata. Without having our own box to break and fix, we wouldn't have learned what we did. We ran through the quater rack to half rack upgrade, and learned the hard way that without labels for the cables, your upgrade isn't going to get very far.
From there, we started with a few engagements as Exadata started to take root in the Dallas area. We took on a project with a customer that had 2 half rack systems and wanted one of them split into 2 quarter racks. I even got to do a weekend-long patch-a-thon on a V2 system that was tabbed by Oracle as the "Exadata Basic" system that had 1 database server, 1 storage cell, and 1 infiniband switch. That was a really interesting process and setup. We had another client that was running on a maxed out T3 SPARC system, and needed to get off of it badly. Their database was dying a slow death as the number of active sessions hogged the CPUs until there weren't any more resources left. We quickly moved them over to an M5000 while we worked out a path to move them from 10.2 on SPARC to Exadata with limited downtime. We used golden gate to keep the M5 and Exadata databases in sync, then cut over once things were ready to go.
We took on clients needing to consolidate massive numbers of databases from various architectures and versions all onto one Exadata frame. One client migrated and consolidated 30 databases onto 2 quarter rack systems...all with the help of smart scans, and good resource management. We performed a few more split rack configurations along the way to help customers save on power costs, as buying 2 half racks wasn't feasible when looking at leasing costs for floor space in the datacenter.
2 of the more interesting implementations were more recent. One was a migration from a Sun e20k to an X2-8. The design included migrating a heavily transactional OLTP system with a separate data warehouse. In the past, they were unable to get both databases running on the same host, as one would completely overrun the other. We were able to combine the databases (~25TB) into one database and migrate them using golden gate, minimizing the cutover window to a couple of hours (mostly for application reconfiguration). Now that they're live on the X2-8, they're able to run reports that would never finish before. Processes that took hours now run in a matter of minutes. Full backups that took 48 hours to run now finish in under 10 hours. It's really cool to see the power of the system once you get it up and running.
The other interesting implementation was something you don't see very often. Exadata without RAC. I know, you probably wouldn't expect it, but it is possible (and supported) to run Exadata without RAC. From this standpoint, it becomes more of an HA, consolidation type of system. I'll have more on this in a future post, but basically, you create a clustered grid infrastructure (which means one set of ASM diskgroups if you so desire), and run single instance databases. That was definitely one of the coolest installs we've done, just because it's so unique.
All this to say - we've seen quite a bit of Exadata this past year or two, and I can't wait to see what's in store for the future. I'm sure that at some point, we'll see somebody running an Exadata on Solaris, a SPARC supercluster or two, and who knows what else Oracle is going to announce in the near future. Here's to another 60 implementations and beyond!
What’s New With Exadata – September 2011
Over the past few weeks, I've been working on some new (and older) installations of Exadata, and came across a few items that piqued my interest. Each of these things had been on my mind for a while, but it's nice to see them finally resolved.
The first is a small change to the installation tree of the Oracle homes on Exadata. With the release of 11.2.0.2, Oracle created a new "best practice" of performing all patch sets out of place into a new home. While this makes it really easy to roll back a patch, the default naming convention for Oracle homes on Exadata became a bit of a sticky situation. If your 11.2.0.2 Grid Infrastructure home was at /u01/app/11.2.0/grid, where would you put your 11.2.0.3 home when it's ready to come out? This was the topic of more than a few discussions around the Enkitec office. Do you extend the version out another digit to 11.2.0.3, or version the home (/u01/app/11.2.0/grid_11.2.0.2, etc). Well, Oracle has put this discussion to rest....Your new Oracle home directories on Exadata are:
Grid Infrastructure - /u01/app/11.2.0.2/grid Database - /u01/app/oracle/product/11.2.0.2/dbhome_1 |
read on for another change (it has to do with bundle patches)
Exadata Storage on Demand
One of the common refrains regarding Exadata storage is that there's no real capacity for adding storage as your database grows. The routine was always to let the storage guys dole out storage as needed, keeping tight reins on where their precious gigabytes (now terabytes) went. When a database outgrew the storage it was allocated, a new LUN was requested, and after much gnashing of teeth, it was given to the systems group to present to the database.
Just like many things with Exadata, this process is turned on its head. The standard Exadata way is to give all of the storage to ASM, and allow the DBAs to make sure that they don't run around drunk off of the amount of raw storage available. But what if you're like most environments, where you're going to grow into your storage requirements over time? What many people won't tell you is that you don't necessarily have to license every component on an Exadata simply because it's available for purchase (more on that in a future post).
Say that you're in the market for an Exadata, and while a half rack may suit your needs today, in 12 or 18 months, you'll be needing the space provided by a full rack. While it is available to purchase an upgrade, remember that you will be given whatever Oracle's current Exadata hardware is at that time. If you originally purchased a V2 last year, and Oracle is only offering X2-2 (or whatever gets announced at OpenWorld) components, you will end up with dissimilar compute and storage nodes. Certain processes like decryption (due to the hardware assist on decryption available in the X2) will perform better on the X2 storage cells vs the V2 storage cells, which leads to sporadic performance. If you need to have consistent hardware across the rack, but don't have a need for all of it from day one (for either logistical or financial reasons), it is possible to license only what you need. Granted, you will have to pay for all of the hardware up front, but the support and licensing costs are only paid for when you actually use the features. Some people may balk at this approach, but it's essentially what storage administrators have been doing for years. This is just storage that's isolated to a particular system, instead of being available to a larger group of systems.
But, what happens when you need to add storage? Do you have to take an outage to add storage? Do you need to bounce the cluster? The answer is that it's pretty simple. In my case, we were working with a half rack that was only licensed for a 1/4 rack. That means that we have purchased 7 storage cells, but are only licensing 3. With the storage server licensing at $120,000 per cell (12 disks at $10,000 per disk), that's a savings of $480,000 in licenses, not to mention the support costs.
The system was originally configured as a half rack, so all of the griddisks were created, and the ASM diskgroups were configured to use 7 storage servers. To get back to the licensed number of storage servers, we removed cells 5 through 7 one at a time, and performed a rebalance in between. The easiest way to do this was to set the DISK_REPAIR_TIME attribute for each diskgroup to 1 minute through sqlplus:
SYS:+ASM1>select g.name "Diskgroup", a.name "Attribute", a.value "Value" from v$asm_attribute a, v$asm_diskgroup g where a.group_number=g.group_number and a.name='disk_repair_time' order by 1; Diskgroup Attribute Value ------------------------------ -------------------- ------------------------------ DATA_MOS1 disk_repair_time 3.6h DBFS_DG disk_repair_time 3.6h RECO_MOS1 disk_repair_time 3.6h SYS:+ASM1>alter diskgroup DBFS_DG set attribute 'disk_repair_time'='1m'; Diskgroup altered. SYS:+ASM1>alter diskgroup RECO_MOS1 set attribute 'disk_repair_time'='1m'; Diskgroup altered. SYS:+ASM1>alter diskgroup DATA_MOS1 set attribute 'disk_repair_time'='1m'; Diskgroup altered. SYS:+ASM1> select g.name "Diskgroup", a.name "Attribute", a.value "Value" from v$asm_attribute a, v$asm_diskgroup g where a.group_number=g.group_number and a.name='disk_repair_time' order by 1; Diskgroup Attribute Value ------------------------------ -------------------- ------------------------------ DATA_MOS1 disk_repair_time 1m DBFS_DG disk_repair_time 1m RECO_MOS1 disk_repair_time 1m |
By doing this, we ASM will dismount the disks and rebalance the diskgroup after a disk has been offline for 1 minute. Setting the value this low is only to be used during the process of dropping the unlicensed storage servers from the grid. After we have dropped them, the value will be reset to the default value of 3.6 hours. Now, we can shut off one of the storage cells. After ASM has noticed that the disks are no longer available, the disks are dismounted and a rebalance is started. When the rebalance is complete, the process is repeated until we are down to the licensed number of cells. After the storage servers have been removed from ASM, the rebalance timer is set back to default, and the /etc/oracle/cell/network-config/cellip.ora file on each compute node is modified to only search for the storage cells that are licensed. While this isn't required, it will prevent ASM from querying the cells that aren't being used for Exadata storage, so the total disk discovery time will be shorter, as it's not waiting for the unused cells to time out.
[acolvin@enkdb01 ~]$ cat /etc/oracle/cell/network-config/cellip.ora cell="192.168.12.5" cell="192.168.12.6" cell="192.168.12.7" #cell="192.168.12.8" #cell="192.168.12.9" #cell="192.168.12.10" #cell="192.168.12.11" |
This is all fairly routine (boring) stuff. The good part is what happens when we need to add capacity. Say that something in the database has changed, and you need more space quickly. You don't have to wait to order a single storage cell, price out an expansion rack, or go through the process of ordering and installing an upgrade kit. Simply log in to each compute node and uncomment the line in /etc/oracle/cell/network-config/cellip.ora that relates to the storage cell you're powering on, then boot up the cell. There is no need to bounce CRS to get the new value in the cellip.ora file to take. Once the cell has booted up and cellsrv is running, ASM will take over and notice the disks are available, add them to the relevant diskgroups, and start a rebalance to get the data moved over. You'll see the following lines in the alert log for ASM:
Tue Sep 13 18:36:09 2011 ALTER SYSTEM SET local_listener='(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=X.X.X.X)(PORT=1521))))' SCOPE=MEMORY SID='+ASM2'; Tue Sep 13 18:52:47 2011 Starting background process XDWK Tue Sep 13 18:52:48 2011 XDWK started with pid=29, OS id=10912 Tue Sep 13 18:52:50 2011 NOTE: disk validation pending for group 2/0xe224a1b (DBFS_DG) SUCCESS: validated disks for 2/0xe224a1b (DBFS_DG) NOTE: disk validation pending for group 2/0xe224a1b (DBFS_DG) NOTE: Assigning number (2,30) to disk (o/192.168.10.11/DBFS_DG_CD_07_mos1cel07) NOTE: Assigning number (2,31) to disk (o/192.168.10.11/DBFS_DG_CD_09_mos1cel07) NOTE: Assigning number (2,32) to disk (o/192.168.10.11/DBFS_DG_CD_05_mos1cel07) NOTE: Assigning number (2,33) to disk (o/192.168.10.11/DBFS_DG_CD_10_mos1cel07) NOTE: Assigning number (2,34) to disk (o/192.168.10.11/DBFS_DG_CD_04_mos1cel07) NOTE: Assigning number (2,35) to disk (o/192.168.10.11/DBFS_DG_CD_02_mos1cel07) NOTE: Assigning number (2,36) to disk (o/192.168.10.11/DBFS_DG_CD_03_mos1cel07) NOTE: Assigning number (2,37) to disk (o/192.168.10.11/DBFS_DG_CD_11_mos1cel07) NOTE: Assigning number (2,38) to disk (o/192.168.10.11/DBFS_DG_CD_06_mos1cel07) NOTE: Assigning number (2,39) to disk (o/192.168.10.11/DBFS_DG_CD_08_mos1cel07) SUCCESS: validated disks for 2/0xe224a1b (DBFS_DG) NOTE: membership refresh pending for group 2/0xe224a1b (DBFS_DG) Tue Sep 13 18:52:56 2011 GMON querying group 2 at 10 for pid 19, osid 29830 NOTE: cache opening disk 30 of grp 2: DBFS_DG_CD_07_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_07_mos1cel07 NOTE: cache opening disk 31 of grp 2: DBFS_DG_CD_09_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_09_mos1cel07 NOTE: cache opening disk 32 of grp 2: DBFS_DG_CD_05_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_05_mos1cel07 NOTE: cache opening disk 33 of grp 2: DBFS_DG_CD_10_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_10_mos1cel07 NOTE: cache opening disk 34 of grp 2: DBFS_DG_CD_04_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_04_mos1cel07 NOTE: cache opening disk 35 of grp 2: DBFS_DG_CD_02_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_02_mos1cel07 NOTE: cache opening disk 36 of grp 2: DBFS_DG_CD_03_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_03_mos1cel07 NOTE: cache opening disk 37 of grp 2: DBFS_DG_CD_11_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_11_mos1cel07 NOTE: cache opening disk 38 of grp 2: DBFS_DG_CD_06_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_06_mos1cel07 NOTE: cache opening disk 39 of grp 2: DBFS_DG_CD_08_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_08_mos1cel07 NOTE: Attempting voting file refresh on diskgroup DBFS_DG GMON querying group 2 at 11 for pid 19, osid 29830 SUCCESS: refreshed membership for 2/0xe224a1b (DBFS_DG) |
After the rebalance is complete, the storage has been added, and everything is ready to go. No downtime needed. Keep in mind that the same processes are in place for adding other storage through purchasing single storage cells, or adding Exadata expansion racks. In these cases, the griddisks will need to be configured to match the existing griddisk sizes.
