Posts Tagged ‘Oracle RAC’

Oracle ASM and EMC PowerPath

Wednesday, May 28th, 2008

Setting up an Oracle ASM disks is rather simple, and the procedure can be easily obtained from here, for example. This is nice and pretty, and works well for most environments.

EMC PowerPath creates meta devices which utilize the underlying paths, as mod_scsi sees them in Linux, without hiding them (unlike IBM’s RDAC, for example). This results in the ability to view and access each LUN either through the PowerPath meta device (/dev/emcpower*) or through the underlying SCSI disk device (/dev/sd*). You can obtain the existing paths of a single meta devices through running the command

powermt display dev=emcpowera

where ‘emcpowera’ is an example. It can be any of your power meta devices. You will see the underlying SCSI devices.

During startup, Oracle ASM (startup script: /etc/init.d/oracleasm) scans all block devices for ASM headers. On a system with many LUNs, this can take a while (half an hour, and sometimes much more). Not only that, but since ASM scans the available block devices in a semi-random order, the chances are very high that the /dev/sd* will be used instead of the /dev/emcpower* block device. This results in degraded performance, where active-active configuration has been set for PowerPath (because it will not be used), and moreover – a failure of that specific link will result in failure to access the specific LUN through that path, with disregard to any other existing paths to the LUN.

To "set things right", you need to edit /etc/sysconfig/oracleasm, and exclude all ’sd’ devices from ASM scan.

To verify that you’re actually using the right block device:

/etc/init.d/oracleasm listdisks

Select any one of the DG disks, and then

/etc/init.d/oracleasm querydisk DATA1
Disk “DATA1″ is a valid ASM disk on device [120, 6]

The numbers are the major and minor of the block device. You can easily find the device through this command:

ls -la /dev/ | grep MAJOR | grep MINOR

In our example, the MAJOR will be 120, and the MINOR will be 6. The result would look like a single block device.

If you’re using EMC PowerPath, your block device major would be 120 and around that number. If you’re (mistakenly) using one of the underlying paths, your major would be 8 and nearby numbers. If you’re using Linux LVM, your major would be around the number 253. The expected result, when using EMC PowerPath is always with major of 120 – always using the /dev/emcpower* devices.

This also decreases the boot time rather dramatically.

HP MSA1000 controller failover

Tuesday, March 27th, 2007

HP MSA1000 is an entry-level disk storage capable of communicating via different types of interfaces, such as SCSI and FC, and can allow FC failover. This FC failover, however, is controller failover and not path failover. It means that if the primary controller fails entirely, the backup controller will “kick in”. However, if a multi-path capable client will fail its primary interface, there is no guarantee that communication with the disks through the backup controller.

The symptom I have encountered was that the secondary path, while exposing the disks (while the primary path was down for one of the servers) to the server, did not allow any SCSI I/O operations. This prevented the Linux server’s SCSI layer from accessing the disks. So they did appear when doing “cat /proc/scsi/scsi“, however, they were not detected using, for example, “fdisk -l“, and the system logs got filled with “SCSI Error” messages.

About a month ago, after almost two years, a new firmware update has been released (can be found here). Two versions exist – Active/Passive and Active/Active.

I have upgraded the MSA1000 storage device.

After installing the Active/Active firmware upgrade (Notice Linux users – You must have X to run the “msa1500flash” utility), and after power cycling the MSA1000 device, things start to look good.

I have tested performance with a person on-site disconnecting fiber connections on-demand, and it worked great. About 2-5 seconds failover time.

Since this system run Oracle RAC, and it uses OCFS2, I had to update the failed-node timeout to be 31 seconds (per this Oracle’s OCFS site, which includes some really good tips).

So real High Availability can be archived after upgrading MSA1000 firmware.