One of the common refrains regarding Exadata storage is that there’s no real capacity for adding storage as your database grows. The routine was always to let the storage guys dole out storage as needed, keeping tight reins on where their precious gigabytes (now terabytes) went. When a database outgrew the storage it was allocated, a new LUN was requested, and after much gnashing of teeth, it was given to the systems group to present to the database.
Just like many things with Exadata, this process is turned on its head. The standard Exadata way is to give all of the storage to ASM, and allow the DBAs to make sure that they don’t run around drunk off of the amount of raw storage available. But what if you’re like most environments, where you’re going to grow into your storage requirements over time? What many people won’t tell you is that you don’t necessarily have to license every component on an Exadata simply because it’s available for purchase (more on that in a future post).
Say that you’re in the market for an Exadata, and while a half rack may suit your needs today, in 12 or 18 months, you’ll be needing the space provided by a full rack. While it is available to purchase an upgrade, remember that you will be given whatever Oracle’s current Exadata hardware is at that time. If you originally purchased a V2 last year, and Oracle is only offering X2-2 (or whatever gets announced at OpenWorld) components, you will end up with dissimilar compute and storage nodes. Certain processes like decryption (due to the hardware assist on decryption available in the X2) will perform better on the X2 storage cells vs the V2 storage cells, which leads to sporadic performance. If you need to have consistent hardware across the rack, but don’t have a need for all of it from day one (for either logistical or financial reasons), it is possible to license only what you need. Granted, you will have to pay for all of the hardware up front, but the support and licensing costs are only paid for when you actually use the features. Some people may balk at this approach, but it’s essentially what storage administrators have been doing for years. This is just storage that’s isolated to a particular system, instead of being available to a larger group of systems.
But, what happens when you need to add storage? Do you have to take an outage to add storage? Do you need to bounce the cluster? The answer is that it’s pretty simple. In my case, we were working with a half rack that was only licensed for a 1/4 rack. That means that we have purchased 7 storage cells, but are only licensing 3. With the storage server licensing at $120,000 per cell (12 disks at $10,000 per disk), that’s a savings of $480,000 in licenses, not to mention the support costs.
The system was originally configured as a half rack, so all of the griddisks were created, and the ASM diskgroups were configured to use 7 storage servers. To get back to the licensed number of storage servers, we removed cells 5 through 7 one at a time, and performed a rebalance in between. The easiest way to do this was to set the DISK_REPAIR_TIME attribute for each diskgroup to 1 minute through sqlplus:
SYS:+ASM1>select g.name "Diskgroup", a.name "Attribute", a.value "Value" from v$asm_attribute a, v$asm_diskgroup g where a.group_number=g.group_number and a.name='disk_repair_time' order by 1; Diskgroup Attribute Value ------------------------------ -------------------- ------------------------------ DATA_MOS1 disk_repair_time 3.6h DBFS_DG disk_repair_time 3.6h RECO_MOS1 disk_repair_time 3.6h SYS:+ASM1>alter diskgroup DBFS_DG set attribute 'disk_repair_time'='1m'; Diskgroup altered. SYS:+ASM1>alter diskgroup RECO_MOS1 set attribute 'disk_repair_time'='1m'; Diskgroup altered. SYS:+ASM1>alter diskgroup DATA_MOS1 set attribute 'disk_repair_time'='1m'; Diskgroup altered. SYS:+ASM1> select g.name "Diskgroup", a.name "Attribute", a.value "Value" from v$asm_attribute a, v$asm_diskgroup g where a.group_number=g.group_number and a.name='disk_repair_time' order by 1; Diskgroup Attribute Value ------------------------------ -------------------- ------------------------------ DATA_MOS1 disk_repair_time 1m DBFS_DG disk_repair_time 1m RECO_MOS1 disk_repair_time 1m
By doing this, we ASM will dismount the disks and rebalance the diskgroup after a disk has been offline for 1 minute. Setting the value this low is only to be used during the process of dropping the unlicensed storage servers from the grid. After we have dropped them, the value will be reset to the default value of 3.6 hours. Now, we can shut off one of the storage cells. After ASM has noticed that the disks are no longer available, the disks are dismounted and a rebalance is started. When the rebalance is complete, the process is repeated until we are down to the licensed number of cells. After the storage servers have been removed from ASM, the rebalance timer is set back to default, and the /etc/oracle/cell/network-config/cellip.ora file on each compute node is modified to only search for the storage cells that are licensed. While this isn’t required, it will prevent ASM from querying the cells that aren’t being used for Exadata storage, so the total disk discovery time will be shorter, as it’s not waiting for the unused cells to time out.
[acolvin@enkdb01 ~]$ cat /etc/oracle/cell/network-config/cellip.ora cell="192.168.12.5" cell="192.168.12.6" cell="192.168.12.7" #cell="192.168.12.8" #cell="192.168.12.9" #cell="192.168.12.10" #cell="192.168.12.11"
This is all fairly routine (boring) stuff. The good part is what happens when we need to add capacity. Say that something in the database has changed, and you need more space quickly. You don’t have to wait to order a single storage cell, price out an expansion rack, or go through the process of ordering and installing an upgrade kit. Simply log in to each compute node and uncomment the line in /etc/oracle/cell/network-config/cellip.ora that relates to the storage cell you’re powering on, then boot up the cell. There is no need to bounce CRS to get the new value in the cellip.ora file to take. Once the cell has booted up and cellsrv is running, ASM will take over and notice the disks are available, add them to the relevant diskgroups, and start a rebalance to get the data moved over. You’ll see the following lines in the alert log for ASM:
Tue Sep 13 18:36:09 2011 ALTER SYSTEM SET local_listener='(DESCRIPTION=(ADDRESS_LIST=(ADDRESS=(PROTOCOL=TCP)(HOST=X.X.X.X)(PORT=1521))))' SCOPE=MEMORY SID='+ASM2'; Tue Sep 13 18:52:47 2011 Starting background process XDWK Tue Sep 13 18:52:48 2011 XDWK started with pid=29, OS id=10912 Tue Sep 13 18:52:50 2011 NOTE: disk validation pending for group 2/0xe224a1b (DBFS_DG) SUCCESS: validated disks for 2/0xe224a1b (DBFS_DG) NOTE: disk validation pending for group 2/0xe224a1b (DBFS_DG) NOTE: Assigning number (2,30) to disk (o/192.168.10.11/DBFS_DG_CD_07_mos1cel07) NOTE: Assigning number (2,31) to disk (o/192.168.10.11/DBFS_DG_CD_09_mos1cel07) NOTE: Assigning number (2,32) to disk (o/192.168.10.11/DBFS_DG_CD_05_mos1cel07) NOTE: Assigning number (2,33) to disk (o/192.168.10.11/DBFS_DG_CD_10_mos1cel07) NOTE: Assigning number (2,34) to disk (o/192.168.10.11/DBFS_DG_CD_04_mos1cel07) NOTE: Assigning number (2,35) to disk (o/192.168.10.11/DBFS_DG_CD_02_mos1cel07) NOTE: Assigning number (2,36) to disk (o/192.168.10.11/DBFS_DG_CD_03_mos1cel07) NOTE: Assigning number (2,37) to disk (o/192.168.10.11/DBFS_DG_CD_11_mos1cel07) NOTE: Assigning number (2,38) to disk (o/192.168.10.11/DBFS_DG_CD_06_mos1cel07) NOTE: Assigning number (2,39) to disk (o/192.168.10.11/DBFS_DG_CD_08_mos1cel07) SUCCESS: validated disks for 2/0xe224a1b (DBFS_DG) NOTE: membership refresh pending for group 2/0xe224a1b (DBFS_DG) Tue Sep 13 18:52:56 2011 GMON querying group 2 at 10 for pid 19, osid 29830 NOTE: cache opening disk 30 of grp 2: DBFS_DG_CD_07_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_07_mos1cel07 NOTE: cache opening disk 31 of grp 2: DBFS_DG_CD_09_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_09_mos1cel07 NOTE: cache opening disk 32 of grp 2: DBFS_DG_CD_05_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_05_mos1cel07 NOTE: cache opening disk 33 of grp 2: DBFS_DG_CD_10_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_10_mos1cel07 NOTE: cache opening disk 34 of grp 2: DBFS_DG_CD_04_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_04_mos1cel07 NOTE: cache opening disk 35 of grp 2: DBFS_DG_CD_02_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_02_mos1cel07 NOTE: cache opening disk 36 of grp 2: DBFS_DG_CD_03_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_03_mos1cel07 NOTE: cache opening disk 37 of grp 2: DBFS_DG_CD_11_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_11_mos1cel07 NOTE: cache opening disk 38 of grp 2: DBFS_DG_CD_06_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_06_mos1cel07 NOTE: cache opening disk 39 of grp 2: DBFS_DG_CD_08_MOS1CEL07 path:o/192.168.10.11/DBFS_DG_CD_08_mos1cel07 NOTE: Attempting voting file refresh on diskgroup DBFS_DG GMON querying group 2 at 11 for pid 19, osid 29830 SUCCESS: refreshed membership for 2/0xe224a1b (DBFS_DG)
After the rebalance is complete, the storage has been added, and everything is ready to go. No downtime needed. Keep in mind that the same processes are in place for adding other storage through purchasing single storage cells, or adding Exadata expansion racks. In these cases, the griddisks will need to be configured to match the existing griddisk sizes.