HP MSA1000 controller failover

HP MSA1000 is an entry-level disk storage capable of communicating via different types of interfaces, such as SCSI and FC, and can allow FC failover. This FC failover, however, is controller failover and not path failover. It means that if the primary controller fails entirely, the backup controller will “kick in”. However, if a multi-path capable client will fail its primary interface, there is no guarantee that communication with the disks through the backup controller.

The symptom I have encountered was that the secondary path, while exposing the disks (while the primary path was down for one of the servers) to the server, did not allow any SCSI I/O operations. This prevented the Linux server’s SCSI layer from accessing the disks. So they did appear when doing “cat /proc/scsi/scsi“, however, they were not detected using, for example, “fdisk -l“, and the system logs got filled with “SCSI Error” messages.

About a month ago, after almost two years, a new firmware update has been released (can be found here). Two versions exist – Active/Passive and Active/Active.

I have upgraded the MSA1000 storage device.

After installing the Active/Active firmware upgrade (Notice Linux users – You must have X to run the “msa1500flash” utility), and after power cycling the MSA1000 device, things start to look good.

I have tested performance with a person on-site disconnecting fiber connections on-demand, and it worked great. About 2-5 seconds failover time.

Since this system run Oracle RAC, and it uses OCFS2, I had to update the failed-node timeout to be 31 seconds (per this Oracle’s OCFS site, which includes some really good tips).

So real High Availability can be archived after upgrading MSA1000 firmware.

Tags: , , , , , , , ,

3 Responses to “HP MSA1000 controller failover”

  1. Ian Harper Says:

    Shalom,

    I have been looking at your blog and also your entries on the Redhat Certified forum.

    We also have the MSA1000 and recently two disks (which were a mirror off each other) came up with fail lights on and the Oracle database (on ASM on MSA1000) died and wouldnt recover – had to get Oracle in to recover data with thier DUL utility. Have you experienced anything like this ?

    Also have you any experience of RHEL on th DL145 G3 servers ?

    Finally, how easy is it to get work in Israel if your not Israeli or Jewish ?

    Toda raba
    Ian

  2. Ez-Aton Says:

    Hi.
    Answers to your questions:

    1. I have avoided successfully from using ASM due to the complicated procedure required to recover data. I know a person who had created a generic application using the tnslsnr just to allow access to this data, and he is one of the better Oracle DBAs I know.

    I know that hot-backup (which cannot happen in ASM env) or archive log backup can do a descent job in recovering DB, although I don’t know the way to do it (I could search for it on Google, but still – haven’t had to do it yet).

    It seems odd to me that two disks failed at the same time. Maybe they failed on different times, and you didn’t monitor the storage, and therefore didn’t get a warning in time?

    I have never used Oracle’s DUL utility, and have no experience with it.

    I have experience with DL145. What is the problem?

  3. sandrar Says:

    Hi! I was surfing and found your blog post… nice! I love your blog. 🙂 Cheers! Sandra. R.

Leave a Reply