Archive for the ‘Disk Storage’ Category

Raw devices for Oracle on RedHat (RHEL) 5

Tuesday, October 21st, 2008

There is a major confusion among DBAs regarding how to setup raw devices for Oracle RAC or Oracle Clusterware. This confusion is caused by the turn RedHat took in how to define raw devices.

Raw devices are actually a manifestation of character devices pointing to block devices. Character devices are non-buffered, so they act as FIFO, and have no OS cache, which is why Oracle likes them so much for Clusterware CRS and voting.

On other Unix types, commonly there are two invocations for each disk device – a block device (i.e /dev/dsk/c0d0t0s1) and a character device (i.e. /dev/rdsk/c0d0t0s1). This is not the case for Linux, and thus, a special “raw”, aka character, device is to be defined for each partition we want to participate in the cluster, either as CRS or voting disk.

On RHEL4, raw devices were setup easily using the simple and coherent file /etc/sysconfig/rawdevices, which included an internal example. On RHEL5 this is not the case, and customizing in a rather less documented method the udev subsystem is required.

Check out the source of this information, at this entry about raw devices. I will add it here, anyhow, with a slight explanation:

1. Add to /etc/udev/rules.d/60-raw.rules:

ACTION==”add”, KERNEL==”sdb1″, RUN+=”/bin/raw /dev/raw/raw1 %N”

2. To set permission (optional, but required for Oracle RAC!), create a new /etc/udev/rules.d/99-raw-perms.rules containing lines such as:

KERNEL==”raw[1-2]“, MODE=”0640″, GROUP=”oinstall”, OWNER=”oracle”

Notice this:

  1. The raw-perms.rules file name has to begin with the number 99, which defines its order during rules apply, so that it will be used after all other rules take place. Using lower numbers might cause permissions to be incorrect.
  2. The following permissions have to apply:
  • OCR Device(s): root:oinstall , mode 0640
  • Voting device(s): oracle:oinstall, mode 0666
  • You don’t have to use raw devices for ASM volumes on Linux, as the ASMLib library is very effective and easier to manage.

    Oracle RAC with EMC iSCSI Storage Panics

    Tuesday, October 14th, 2008

    I have had a system panicking when running the mentioned below configuration:

    • RedHat RHEL 4 Update 6 (4.6) 64bit (x86_64)
    • Dell PowerEdge servers
    • Oracle RAC 11g with Clusterware 11g
    • EMC iSCSI storage
    • EMC PowerPate
    • Vote and Registry LUNs are accessible as raw devices
    • Data files are accessible through ASM with libASM

    During reboots or shutdowns, the system used to panic almost before the actual power cycle. Unfortunately, I do not have a screen capture of the panic…

    Tracing the problem, it seems that iSCSI, PowerIscsi (EMC PowerPath for iSCSI) and networking services are being brought down before “killall” service stops the CRS.

    The service file init.crs was never to be executed with a “stop” flag by the start-stop of services, as it never left a lock file (for example, in /var/lock/subsys), and thus, its existence in /etc/rc.d/rc6.d and /etc/rc.d/rc0.d is merely a fake.

    I have solved it by changing /etc/init.d/init.crs script a bit:

    • On “Start” action, touch a file called /var/lock/subsys/init.crs
    • On “Stop” action, remove a file called /var/lock/subsys/init.crs

    Also, although I’m not sure about its necessity, I have changed init.crs script SYSV execution order in /etc/rc.d/rc0.d and /etc/rc.d/rc6.d from wherever it was (K96 in one case and K76 on another) to K01, so it would be executed with the “stop” parameter early during shutdown or reboot cycle.

    It solved the problem, although future upgrades to Oracle ClusterWare will require being aware of this change.

    Hot adding Qlogic LUNs – the new method

    Friday, August 8th, 2008

    I have demonstrated how to hot-add LUNs to a Linux system with Qlogic HBA. This has become irrelevant with the newer method, available for RHEL4 Update 3 and above.

    The new method is as follow:

    echo 1 > /sys/class/fc_host/host<ID>/issue_lip
    echo “—” > /sys/class/scsi_host/host<ID>/scan

    Replace “<ID>” with your relevant HBA ID.

    Notice – due to the blog formatting, the 2nd line might appear incorrect – these are three dashes, and not some Unicode specialy formatted dash.

    HP EVA SSSU and fixed LUN WWID

    Monday, July 14th, 2008

    Linux works perfectly well with multiple storage links using dm-multipath. Not only that, but HP has released their own spawn of dm-multipath, which is optimized (or so claimed, but, anyhow, well configured) to work with EVA and MSA storage devices.

    This is great, however, what do you do when mapping volume snapshots through dm-multipath? For each new snapshot, you enjoy a new WWID, which will remap to a new “mpath” name, or raw wwid (if “user_friendly_name” is not set). This can, and will set chaos to remote scripts. On each reboot, the new and shiny snapshot will aquire a new name, thus making scripting a hellish experience.

    For the time being I have not tested ext3 labels. I suspect that using labels will fail, as the dm-multipath over layer device does not hide the under layered sd devices, and thus – the system might detect the same label more than once – once for each under layered device, and once for the dm-multipath over layer.

    A solution which is both elegant and useful is to fixate the snapshots’ WWID through a small alteration to SSSU command. Append a string such as this to the snap create command:

    WORLD_WIDE_LUN_Name="6300-0000-0000-0000-0010-0000"

    Don’t use the numbers supplied here. “invent” your own :-)

    Mind you that you must use dashes, else the command will fail.

    Doing so will allow you to always use the same WWID for the snapshots, and thus – save tons of hassle after system reboot when accessing snapshots through dm-multipath.

    Oracle ASM and EMC PowerPath

    Wednesday, May 28th, 2008

    Setting up an Oracle ASM disks is rather simple, and the procedure can be easily obtained from here, for example. This is nice and pretty, and works well for most environments.

    EMC PowerPath creates meta devices which utilize the underlying paths, as mod_scsi sees them in Linux, without hiding them (unlike IBM’s RDAC, for example). This results in the ability to view and access each LUN either through the PowerPath meta device (/dev/emcpower*) or through the underlying SCSI disk device (/dev/sd*). You can obtain the existing paths of a single meta devices through running the command

    powermt display dev=emcpowera

    where ‘emcpowera’ is an example. It can be any of your power meta devices. You will see the underlying SCSI devices.

    During startup, Oracle ASM (startup script: /etc/init.d/oracleasm) scans all block devices for ASM headers. On a system with many LUNs, this can take a while (half an hour, and sometimes much more). Not only that, but since ASM scans the available block devices in a semi-random order, the chances are very high that the /dev/sd* will be used instead of the /dev/emcpower* block device. This results in degraded performance, where active-active configuration has been set for PowerPath (because it will not be used), and moreover – a failure of that specific link will result in failure to access the specific LUN through that path, with disregard to any other existing paths to the LUN.

    To "set things right", you need to edit /etc/sysconfig/oracleasm, and exclude all ’sd’ devices from ASM scan.

    To verify that you’re actually using the right block device:

    /etc/init.d/oracleasm listdisks

    Select any one of the DG disks, and then

    /etc/init.d/oracleasm querydisk DATA1
    Disk “DATA1″ is a valid ASM disk on device [120, 6]

    The numbers are the major and minor of the block device. You can easily find the device through this command:

    ls -la /dev/ | grep MAJOR | grep MINOR

    In our example, the MAJOR will be 120, and the MINOR will be 6. The result would look like a single block device.

    If you’re using EMC PowerPath, your block device major would be 120 and around that number. If you’re (mistakenly) using one of the underlying paths, your major would be 8 and nearby numbers. If you’re using Linux LVM, your major would be around the number 253. The expected result, when using EMC PowerPath is always with major of 120 – always using the /dev/emcpower* devices.

    This also decreases the boot time rather dramatically.

    dm-multipath and loss of all paths

    Tuesday, May 13th, 2008

    dm-multipath is a great tool. Its abilities were proven to me on many occasions, and I’m sure I’m not the only one. NetApp, for example, use it. HP use it as well (a slightly modified version, and still), and it works.

    A problem I have encountered is as follow – if a single path fails, the device-mapper continue to work correctly (as expected) and the remaining path becomes active. However – if the last link fails, all processes which require disk access become stale. It means that many tests which search for a given process pass correctly even when this process becomes stale through delayed (forever) access to the filesystem. Also – tests which attempt to write/read a file to/from such a stale filesystem, become stale themselves, which can bring down an entire system (assume we have a cron which creates a file every minute. Every new process becomes stale immediately, so after an hour, we’ll have 60 more processes, and after a day – 1440 additional processes – all stale (D) and waiting for the disk to come back).

    Certain detection systems actually fail to auto-detect cases of stale filesystems when using dm-multipath. This is caused by a (default) option called “1 queue_if_no_path”. I discovered that when this option is omitted, such as in the configuration below (only the “device” section):

    device
    {
    vendor “NETAPP”
    product “LUN”
    getuid_callout “/sbin/scsi_id -g -u -s /block/%n”
    prio_callout “/sbin/mpath_prio_ontap /dev/%n”
    # features “1 queue_if_no_path”
    hardware_handler “0″
    path_grouping_policy group_by_prio
    failback immediate
    rr_weight uniform
    rr_min_io 128
    path_checker readsector0
    }

    multiple disk failures will actually result in the filesystem layout reporting I/O errors (which is good). A disk mounted through these options can be mounted with special parameters, such as (for ext2/3): errors=continue ; errors=read-only or errors=panic – my favorite, as it ensured data integrity through self-fencing mechanism.

    HP EVA bug – Snapshot removed through sssu is still there

    Friday, May 2nd, 2008

    This is an interesting bug I have encountered:

    The output of an sssu command should look like this:

    EVA> DELETE STORAGE “\Virtual Disks\Linux\oracle\SNAP_ORACLE”

    EVA>

    It still leaves the snapshot (SNAP_ORACLE in this case) visible, until the web interface is used to press on “Ok”.

    This happened to me on HP EVA with HP StorageWorks Command View EVA 7.0 build 17.

    When sequential delete command is given, it looks like this:

    EVA> DELETE STORAGE “\Virtual Disks\Linux\oracle\SNAP_ORACLE”

    Error: Error cannot get object properties. [ Deletion completed]

    EVA>

    When this command is given for a non-existing snapshot, it looks like this:

    EVA> DELETE STORAGE “\Virtual Disks\Linux\oracle\SNAP_ORACLE”

    Error: \Virtual Disks\Linux\oracle\SNAP_ORACLE not found

    So I run the removal command twice (scripted) on an sssu session without “halt_on_errors”. This removes the snapshots correctly.

    Quick provisioning of virtual machines

    Friday, February 1st, 2008

    When one wants to achieve fast provisioning of virtual machines, some solutions might come into account. The one I prefer uses Linux LVM snapshot capabilities to duplicate one working machine into few.

    This can happen, of course, only if the host running VMware-Server is Linux.

    LVM snapshots have one vast disadvantage – performance. When a block on the source of the snapshot is being changed for the first time, the original block is being replicated to each and every snapshot COOW space. It means that a creation of a 1GB file on a volume having ten snapshots means a total copy of 10GB of data across your disks. You cannot ignore this performance impact.

    LVM2 has support for read/write snapshots. I have come up with a nice way of utilizing this capability to my benefit. An R/W snapshot which is being changed does not replicate its changes to any other snapshot. All changes are considered local to this snapshot, and are being maintained only in its COOW space. So adding a 1GB file to a snapshot has zero impact on the rest of the snapshots or volumes.

    The idea is quite simple, and it works like this:

    1. Create adequate logical volume with a given size (I used 9GB for my own purposes). The name of the LV in my case will be /dev/VGVM3/centos-base

    2. Mount this LV on a directory, and create a VM inside it. In my case, it’s in /vmware/centos-base

    3. Install the VM as the baseline for all your future VMs. If you might not want Apache on some of them, don’t install it on the baseline.

    4. Install vmware-tools on the baseline.

    5. Disable the service “kudzu”

    6. Update as required

    7. In my case I always use DHCP. You can set it to obtain its IP once from a given location, or whatever you feel like.

    8. Shut down the VM.

    9. In the VM’s .vmx file add a line like this:

    uuid.action = “create”

    I have added below (expand to read) two scripts which will create the snapshot, mount it and register it, including new MAC and UUID.

    Press below for the scripts I have used to create and destroy VMs

    create-replica.sh:

    #!/bin/sh
    # This script will replicate vms from a given (predefined) source to a new system
    # Written by Ez-Aton, http://www.tournament.org.il/run
    # Arguments: name

    # FUNCITONS BE HERE
    test_can_do () {
    # To be able to snapshot, we need a set of things to happen
    if [ -d $DIR/$TARGET ] ; then
    echo “Directory already exists. You don’t want to do it…”
    exit 1
    fi
    if [ -f $VG/$TARGET ] ; then
    echo “Target snapshot exists”
    exit 1
    fi
    if [ `vmrun list | grep -c $DIR/$SRC/$SRC.vmx` -gt "0" ] ; then
    echo “Source VM is still running. Shut it down before proceeding”
    exit 1
    fi
    if [ `vmware-cmd -l | grep -c $DIR/$TARGET/$SRC.vmx` -ne "0" ] ; then
    echo “VM already registered. Unregister first”
    exit 1
    fi
    }

    do_snapshot () {
    # Take the snapshot
    lvcreate -s -n $TARGET -L $SNAPSIZE $VG/$SRC
    RET=$?
    if [ "$RET" -ne "0" ]; then
    echo “Failed to create snapshot”
    exit 1
    fi
    }

    mount_snapshot () {
    # This function creates the required directories and mounts the snapshot there
    mkdir $DIR/$TARGET
    mount $VG/$TARGET $DIR/$TARGET
    RET=$?
    if [ "$RET" -ne "0" ]; then
    echo “Failed to mount snapshot”
    exit 1
    fi
    }

    alter_snap_vmx () {
    # This function will alter the name in the VMX and make it the $TARGET name
    cat $DIR/$TARGET/$SRC.vmx | grep -v “displayName” > $DIR/$TARGET/$TARGET.vmx
    echo “displayName = \”$TARGET\”" >> $DIR/$TARGET/$TARGET.vmx
    cat $DIR/$TARGET/$TARGET.vmx > $DIR/$TARGET/$SRC.vmx
    \rm $DIR/$TARGET/$TARGET.vmx
    }

    register_vm () {
    # This function will register the VM to VMWARE
    vmware-cmd -s register $DIR/$TARGET/$SRC.vmx
    }

    # MAIN
    if [ -z "$1" ]; then
    echo “Arguments: The target name”
    exit 1
    fi

    # Parameters:
    SRC=centos-base         #The name of the source image, and the source dir
    PREFIX=centos             #All targets will be created in the name centos-$NAME
    DIR=/vmware               #My VMware VMs default dir
    SNAPSIZE=6G              #My COOW space
    VG=/dev/VGVM3           #The name of the VG
    TARGET=”$PREFIX-$1″

    test_can_do
    do_snapshot
    mount_snapshot
    alter_snap_vmx
    register_vm
    exit 0

    remove-replica.sh:

    #!/bin/sh
    # This script will remove a snapshot machine
    # Written by Ez-Aton, http://www.tournament.org.il/run
    # Arguments: machine name

    #FUNCTIONS
    does_it_exist () {
    # Check if the described VM exists
    if [ `vmware-cmd -l | grep -c $DIR/$TARGET/$SRC.vmx` -eq "0" ]; then
    echo “No such VM”
    exit 1
    fi
    if [ ! -e $VG/$TARGET ]; then
    echo “There is no matching snapshot volume”
    exit 1
    fi
    if [ `lvs $VG/$TARGET | awk '{print $5}' | grep -c $SRC` -eq "0" ]; then
    echo “This is not a snapshot, or a snapshot of the wrong LV”
    exit 1
    fi
    }

    ask_a_thousand_times () {
    # This function verifies that the right thing is actually done
    echo “You are about to remove a virtual machine and an LVM. Details:”
    echo “Machine name: $TARGET”
    echo “Logical Volume: $VG/$TARGET”
    echo -n “Are you sure? (y/N): ”
    read RES
    if [ "$RES" != "Y" ]&&[ "$RES" != "y" ]; then
    echo “Decided not to do it”
    exit 0
    fi
    echo “”
    echo “You have asked to remove this machine”
    echo -n “Again: Are you sure? (y/N): ”
    read RES
    if [ "$RES" != "Y" ]&&[ "$RES" != "y" ]; then
    echo “Decided not to do it”
    exit 0
    fi
    echo “Removing VM and snapshot”
    }

    shut_down_vm () {
    # Shut down the VM and unregister it
    vmware-cmd $DIR/$TARGET/$SRC.vmx stop hard
    vmware-cmd -s unregister $DIR/$TARGET/$SRC.vmx
    }

    remove_snapshot () {
    # Umount and remove the snapshot
    umount $DIR/$TARGET
    RET=$?
    if [ "$RET" -ne "0" ]; then
    echo “Cannot umount $DIR/$TARGET”
    exit 1
    fi
    lvremove -f $VG/$TARGET
    RET=$?
    if [ "$RET" -ne "0" ]; then
    echo “Cannot remove snapshot LV”
    exit 1
    fi
    }

    remove_dir () {
    # Removes the mount point
    rmdir $DIR/$TARGET
    }

    #MAIN
    if [ -z "$1" ]; then
    echo “No machine name. Exiting”
    exit 1
    fi

    #PARAMETERS:
    DIR=/vmware                #VMware default VMs location
    VG=/dev/VGVM3            #The name of the VG
    PREFIX=centos              #Prefix to the name. All these VMs will be called centos-$NAME
    TARGET=”$PREFIX-$1″
    SRC=centos-base           #The name of the baseline image, LVM, etc. All are the same

    does_it_exist
    ask_a_thousand_times
    shut_down_vm
    remove_snapshot
    remove_dir

    exit 0

    Pros:

    1. Very fast provisioning. It takes almost five seconds, and that’s because my server is somewhat loaded.

    2. Dependable: KISS at its marvel.

    3. Conservative on space

    4. Conservative on I/O load (unlike the traditional use of LVM snapshot, as explained in the beginning of this section).

    Cons:

    1. Cannot streamline the contents of snapshot into the main image (LVM team will implement it in the future, I think)

    2. Cannot take a snapshot of a snapshot (same as above)

    3. If the COOW space of any of the snapshots is full (viewable through the command ‘lvs‘) then on boot, the source LV might not become active (confirmed RH4 bug, and this is the system I have used)

    4. My script does not edit/alter /etc/fstab (I have decided it to be rather risky, and it was not worth the effort at this time)

    5. My script does not check if there is enough available space in the VG. Not required, as it will fail if creation of LV will fail

    You are most welcome to contribute any further changes done to this script. Please maintain my URL in the script if you decide to use it.

    Thanks!

    iSCSI target/client for Linux in 5 whole minutes

    Tuesday, December 4th, 2007

    I was playing a bit with iSCSI initiator (client) and decided to see how complicated it is to setup a shared storage (for my purposes) through iSCSI. This proves to be quite easy…

    On the server:

    1. Download iSCSI Enterprise Target from here, or you can install scsi-target-utils from Centos5 repository

    2. Compile (if required) and install on your server. Notice – you will need kernel-devel packages

    3. Create a test Logical Volume:

    lvcreate -L 1G -n iscsi1 /dev/VolGroup00

    4. Edit your /etc/ietd.conf file to look something like this:

    Target iqn.2001-04.il.org.tournament:diskserv.disk1
    Lun 0 Path=/dev/VolGroup00/iscsi1,Type=fileio
    InitialR2T Yes
    ImmediateData No
    MaxRecvDataSegmentLength 8192
    MaxXmitDataSegmentLength 8192
    MaxBurstLength 262144
    FirstBurstLength 65536
    DefaultTime2Wait 2
    DefaultTime2Retain 20
    MaxOutstandingR2T 8
    DataPDUInOrder Yes
    DataSequenceInOrder Yes
    ErrorRecoveryLevel 0
    HeaderDigest CRC32C,None
    DataDigest CRC32C,None
    # various target parameters
    Wthreads 8

    5. Start iscsi-target service:

    /etc/init.d/iscsi-target start

    On the client:

    1. Install open-iscsi package. It will be called iscsi-initiator-utils for RHEL5 and Centos5

    2. Run detection command:

    iscsiadm -m discovery -t sendtargets -p <server IP address>

    3. You should get a nice reply. Something like this. <IP> refers to the server’s IP

    <IP>:3260,1 iqn.2001-04.il.org.tournament:diskserv.disk1

    4. Login to the devices using the following command:

    iscsiadm -m node -T iqn.2001-04.il.org.tournament:diskserv.disk1 -p <IP>:3260,1 -l

    5. Run fdisk to view your new disk

    fdisk -l

    6. To disconnect the iSCSI device, run the following command:

    iscsiadm -m node -T iqn.2001-04.il.org.tournament:diskserv.disk1 -p <IP>:3260,1 -u

    This will not allow you to set the iSCSI initiator during boot time. You will have to google your own distro and its bolts and nuts, but this will allow you a proof of concept of a working iSCSI

    Good luck!

    Aquiring and exporting external disk software RAID and LVM

    Wednesday, August 22nd, 2007

    I had one of my computers die a short while ago. I wanted to get the data inside its disk into another computer.

    Using the magical and rather cheap USB2SATA I was able to connect the disk, however, the disk was part of a software mirror (md device) and had LVM on it. Gets a little complicated? Not really:

    (connect the device to the system)

    Now we need to query which device it is:

    dmesg

    It is quite easy. In my case it was /dev/sdk (don’t ask). It shown something like this:

    usb 1-6: new high speed USB device using address 2
    Initializing USB Mass Storage driver…
    scsi5 : SCSI emulation for USB Mass Storage devices
    Vendor: WDC WD80 Model: WD-WMAM92757594 Rev: 1C05
    Type: Direct-Access ANSI SCSI revision: 02
    SCSI device sdk: 156250000 512-byte hdwr sectors (80000 MB)
    sdk: assuming drive cache: write through
    SCSI device sdk: 156250000 512-byte hdwr sectors (80000 MB)
    sdk: assuming drive cache: write through
    sdk: sdk1 sdk2 sdk3
    Attached scsi disk sdk at scsi5, channel 0, id 0, lun 0

    This is good. The original system was RH4, so the standard structure is /boot on the first partition, swap and then one large md device containing LVM (at least – my standard).

    Lets list the partitions, just to be sure:

    # fdisk -l /dev/sdk

    Disk /dev/sdk: 80.0 GB, 80000000000 bytes
    255 heads, 63 sectors/track, 9726 cylinders
    Units = cylinders of 16065 * 512 = 8225280 bytes

    Device Boot Start End Blocks Id System
    /dev/sdk1 * 1 13 104391 fd Linux raid autodetect
    /dev/sdk2 14 144 1052257+ 82 Linux swap
    /dev/sdk3 145 9726 76967415 fd Linux raid autodetect

    Good. As expected. Let’s activate the md device:

    # mdadm –assemble /dev/md2 /dev/sdk3
    mdadm: /dev/md2 has been started with 1 drive (out of 2).

    It’s going well. Now we have the md device active, and we can try to scan for LVM:

    # pvscan

    PV /dev/md2 VG SVNVG lvm2 [73.38 GB / 55.53 GB free]

    Activating the VG is a desired action. Notice the name – SVNVG (a note at the bottom):

    # vgchange -a y /dev/SVNVG
    3 logical volume(s) in volume group “SVNVG” now active

    Now we can list the LVs and mount them on our desired location:

    ]# lvs
    LV VG Attr LSize Origin Snap% Move Log Copy%
    LogVol00 SVNVG -wi-a- 2.94G
    LogVol01 SVNVG -wi-a- 4.91G
    VarVol SVNVG -wi-a- 10.00G

    Mounting:

    mount /dev/SVNVG/VarVol /mnt/

    and it’s all ours.

    To remove this connected the disk, we need to reverse the above process.

    First, we will umount the volume:

    umount /mnt

    Now we need to disable the Volume Group:

    # vgchange -a n /dev/SVNVG
    0 logical volume(s) in volume group “SVNVG” now active

    0 logical volumes active means we were able to disable the whole VG.

    Disable the MD device:

    # mdadm –manage -S /dev/md2

    Now we can disconnect the physical disk (actually, the USB) and continue with out life.

    A note: RedHat systems name their logical volumes using a default name VolGroup00. You cannot have two VGs with the same name! If you activate a VG which originated from RH system and used a default name, and your current system uses the same defaults, you need to connect the disk to an external system (non RH would do fine) and change the VG name using vgrename before you can proceed.