Posts Tagged ‘iSCSI’

Multiple iSCSI interface on the same subnet

Friday, April 14th, 2017

As far as routing goes, it is a very bad idea to place multiple network interfaces with IPs of a single network (subnet). The routing table, which decides which interface to route the data through, reads the table line by line, thus – all traffic goes through a single interface of the batch (-> of the interfaces “living” on the same network). A common way of working around it is by using network teaming (Linux = bonding) to handle round-robin or active/passive of multiple interfaces on a single network, and that means that the interfaces are “bonded” into a single logical interface, which, then, has no routing issues whatsoever.

Having multiple network interfaces with IPs of the same network does not help with throughput, but does it increase redundancy? The simple answer is “no” – It does not. In Linux, when an interface is disconnected, it does not go down automatically (unlike Windows, for example), meaning that the routing table does not get updated with remaining interfaces. The traffic would still be targeted at that “dead” interface. Moreover – even if Linux did do that, this is a solution for a single type of a problem. What if someone changed the switch to have a different network VLAN on that particular (one of two or more) port? Link remains up, but communication doesn’t get anywhere.

These problems are more noticeable when the single network is an iSCSI network, and the system needs both multiple path access to the iSCSI storage, and redundancy becomes an issue, because iSCSI network is commonly – a critical network.

It is common to connect multiple iSCSI networks and not multiple ports on a single network, maintaining fabric-like separation of networks, and devices, where possible. However – this article will point at what to do if the network layout is not under your control and you need to place multiple network interfaces on the same physical network.

As mentioned before, the routing table will make sure that all your communication to that particular network go by the first routed network interface. However – the iSCSI daemon (open-iscsi) has the capability to bind to specific interfaces. To do so, you need to define an interface to iSCSI. Do so by running the command:

iscsiadm -m iface -I NIC_NAME –op=new

The name is user defined. I recommend using the same names used by the OS, like ‘eth5’ or ‘ib0’ or ‘eno1’ – to keep it simple. This is a declarative definition only. To actually bind the ‘iface’ to the real interface device, you need to run the following command:

iscsiadm -m iface -I NIC_NAME  –op=update -n iface.hwaddress -v 00:AA:BB:CC:DD:EE

Use the real interface MAC address. Then you can discover and login to targets with the defined interfaces only:

iscsiadm -m discovery -t st -p TARGET_IP -I NIC1 -I NIC2

iscsiadm -m node -L all -I NIC1 -I NIC2

According to the documentations, this will require setting the sysctl net.ipv4.conf.default.rp_filter to 0.

This should do the trick, so now multipathing should show the right amount of paths.

iSCSI persistent configurations agains us all

Thursday, November 19th, 2009

Using iSCSI with dm-multipath is rather common setup. With iSCSI running over Ethernet cables, which are too easy to disconnect (either on purpose or by mistake), being cheap and common technology – multipath becomes a must. If you have multiple network links, this is only expected that you use multipath for your iSCSI configuration. It’s cheap, it’s easy and it works.

This, however, comes with a price tag. Not money – the components are cheap and common, but there are configuration acts which should take place.

It is easy to find info either in the open-iscsi documentations, the Internet, whatever, and I will go over them just below, but there are some catches which one should be aware of.

Per the common documentation, unlike regular iSCSI communication, when dealing with multipath, you would like iSCSI to fail rather quickly and let the SCSI layer handle the errors, thus letting dm-multipath handle the errors and do its work.

The configuration directives are rather simple. In the iscsid.conf file (on RHEL5 located in /etc/iscsi/iscsid.conf ), you need to change the value

node.session.timeo.replacement_timeout =

To a very short period of time. By default, it is set to 120 seconds, which are two minutes before anyone will notify the SCSI subsystem of any disk IO errors. A good value would be 5 seconds, which should allow for very short network disconnection (which could happen) and still – let the dm-multipath manage errors fast enough so that applications would not fail on disk timeouts.

Another two parameters which should be defined are the following (read the comments above them in the config file):

node.conn[0].timeo.noop_out_interval =
node.conn[0].timeo.noop_out_timeout =

These values control the interval in which the iSCSI layer tests communication to the targets.

Also, in multipath.conf you will need to set the following feature, so that IOPs will not be lost:

features		"1 queue_if_no_path"

These configuration directives can be found in these two pages from RedHat:



This is nice and pretty. However, if you have failed to do so at start, and defined your iSCSI targets based on the default configurations, you will notice that it still takes very long for iSCSI to notify the SCSI subsystem of the errors. You could check the values used by iSCSI through running the command:

iscsiadm -m node -T <target name>

Check out especially the line called node.session.timeo.replacement_timeout. Its value is the one which decides the actual behavior of iSCSI.

To change it, there are several methods. One of them is to clean up the iSCSI persistent configurations, located in /var/lib/iscsi and then re-login to the iSCSI targets. Only then you will have the new target configuration.

Check again with iscsiadm as described above, and check that this value matches.

Oracle RAC with EMC iSCSI Storage Panics

Tuesday, October 14th, 2008

I have had a system panicking when running the mentioned below configuration:

  • RedHat RHEL 4 Update 6 (4.6) 64bit (x86_64)
  • Dell PowerEdge servers
  • Oracle RAC 11g with Clusterware 11g
  • EMC iSCSI storage
  • EMC PowerPate
  • Vote and Registry LUNs are accessible as raw devices
  • Data files are accessible through ASM with libASM

During reboots or shutdowns, the system used to panic almost before the actual power cycle. Unfortunately, I do not have a screen capture of the panic…

Tracing the problem, it seems that iSCSI, PowerIscsi (EMC PowerPath for iSCSI) and networking services are being brought down before “killall” service stops the CRS.

The service file was never to be executed with a “stop” flag by the start-stop of services, as it never left a lock file (for example, in /var/lock/subsys), and thus, its existence in /etc/rc.d/rc6.d and /etc/rc.d/rc0.d is merely a fake.

I have solved it by changing /etc/init.d/ script a bit:

  • On “Start” action, touch a file called /var/lock/subsys/
  • On “Stop” action, remove a file called /var/lock/subsys/

Also, although I’m not sure about its necessity, I have changed script SYSV execution order in /etc/rc.d/rc0.d and /etc/rc.d/rc6.d from wherever it was (K96 in one case and K76 on another) to K01, so it would be executed with the “stop” parameter early during shutdown or reboot cycle.

It solved the problem, although future upgrades to Oracle ClusterWare will require being aware of this change.

iSCSI target/client for Linux in 5 whole minutes

Tuesday, December 4th, 2007

I was playing a bit with iSCSI initiator (client) and decided to see how complicated it is to setup a shared storage (for my purposes) through iSCSI. This proves to be quite easy…

On the server:

1. Download iSCSI Enterprise Target from here, or you can install scsi-target-utils from Centos5 repository

2. Compile (if required) and install on your server. Notice – you will need kernel-devel packages

3. Create a test Logical Volume:

lvcreate -L 1G -n iscsi1 /dev/VolGroup00

4. Edit your /etc/ietd.conf file to look something like this:

Lun 0 Path=/dev/VolGroup00/iscsi1,Type=fileio
InitialR2T Yes
ImmediateData No
MaxRecvDataSegmentLength 8192
MaxXmitDataSegmentLength 8192
MaxBurstLength 262144
FirstBurstLength 65536
DefaultTime2Wait 2
DefaultTime2Retain 20
MaxOutstandingR2T 8
DataPDUInOrder Yes
DataSequenceInOrder Yes
ErrorRecoveryLevel 0
HeaderDigest CRC32C,None
DataDigest CRC32C,None
# various target parameters
Wthreads 8

5. Start iscsi-target service:

/etc/init.d/iscsi-target start

On the client:

1. Install open-iscsi package. It will be called iscsi-initiator-utils for RHEL5 and Centos5

2. Run detection command:

iscsiadm -m discovery -t sendtargets -p <server IP address>

3. You should get a nice reply. Something like this. <IP> refers to the server’s IP


4. Login to the devices using the following command:

iscsiadm -m node -T -p <IP>:3260,1 -l

5. Run fdisk to view your new disk

fdisk -l

6. To disconnect the iSCSI device, run the following command:

iscsiadm -m node -T -p <IP>:3260,1 -u

This will not allow you to set the iSCSI initiator during boot time. You will have to google your own distro and its bolts and nuts, but this will allow you a proof of concept of a working iSCSI

Good luck!

Ontap Simulator, and some insights about NetApp

Tuesday, May 9th, 2006

First and foremost – the Ontap simulator, a great tool which surely can assist in learning NetApp interface and utilization, lacks in performance. It has some built-in limitations – No FCP, no disks (virtual disks) larger than 1GB (per my trial-and-error. I might find out I was wrong somehow, and put in on this website), and low performance. I’ve got about 300KB/s transfer rate both on iSCSI and on NFS. To make sure it was not due to some network hog hiding somewhere on my net(s), I’ve even tried it from the host of the simulator itself, but to no avail. Low performance. Don’t try to use it as your own home iSCSI Target. Better just use Linux for this purpose, with the drivers obtained from here (It’s one of my next steps into “shared storage(s) for all”).

Another issue – After much reading through NetApp documentation, I’ve reached the following concepts of the product. Please correct me if you see fit:

The older method was to create a volume (vol create) directly from disks. Either using raid_dp or raid4.

The current method is to create aggregations (aggr create) from disks. Each aggregate consists of raid groups. A raid group (rg) can be made up of up to eight physical disks. Each group of disks (an rg) has one or two parity disks, depending on the type of raid (raid 4 uses one parity, and raid_dp uses “double parity”, as its name can suggest).

Actually, I can assume that each aggregation is formatted using the WAFL filesystem, which leads to the conclusion that modern (flex) volumes are logical “chunks” of this whole WAFL layout. In the past, each volume was a separated WAFL formatted unit, and each size change required adding disks.

This separation of the flex volume from the aggregation suggests to me the possibility of multiple-root capable WAFL. It can explain the lack of requirement for a continuous space on the aggregation. This eases the space management, and allows for fast and easy “cloning” of volumes.

I believe that the new “clone” method is based on the WAFL built-in snapshot capabilities. Although WAFL Snapshots are supposed to be space conservatives, they require a guaranteed space on the aggregation prior to committing the clone itself. If the aggregation is too crowded, they will fail with the error message “not enough space”. If there is enough for snapshots, but not enough to guarantee a full clone, you’ll get a message saying “space not guaranteed”.

I see the flex volumes as some combination between filesystem (WAFL) and LVM, living together on the same level.

LUNs on NetApp: iSCSI and/or Fibre LUNs are actually managed as a single (per-LUN) large file contained within a volume. This file has special permissions (I was not able to copy it or modify it while it was online and I had root permissions. However, I am rather new to NetApp technology), and it is being exported as a disk outside. Much like an ISO image (which is a large file containing a whole filesystem layout) these files contain a whole disk layout, including partition tables, LVM headers, etc – just like a real disk.

Thinking about it, it’s neither impossible nor very surprising. A disk is no more than a container of data, of blocks, and if you can utilize the required communication protocol used for accessing it and managing its blocks (aka, the transport layer on which filesystem can access the block data), you can, with just a little translation interface, set up a virtual disk which will behave just like any regular disk.

This brings us to the advantages of NetApp’s WAFL – the ability to minimize I/O while maintaining a set of snapshots for the system – a list of per-block modification history. It means you can “snapshot” your LUN, being physically no more than a file on a WAFL-based volume, and you can go back with your data to a previous date – an hour, a day, a week. Time travel for your data.

There are, unfortunately, some major side effects. If you’ve read the WAFL description from NetApp, my summary will be inaccurate at best. If you haven’t, it will be enough, but still you are most encouraged to read it. The idea is that this filesystem is made out of multi-layers of pointers, and of blocks. A pointer can point to more than one block. When you commit a snapshot, you do not change the pointers, you do not move data, you just modify the set of pointers. When there is any change in the data (meaning a block is changed), the pointer points to the alternate block instead of the previous (historical) block, but keeps reference of the older block’s location. This way, only modified blocks are actually recreated, while any unmodified data remains on the same spot on the physical disk. An additional claim of NetApp is that their WAFL is optimized for the raid4 and raid_dp they use, and utilizes it in a smart manner.

The problem with WAFL, as can be easily seen, is fragmentation. For CIFS and NFS, it does not cause much of a problem, as the system is very capable of read-ahead just to solve this issue. However, A LUN (which is supposed to act as a continuous layout, just like any hard-drive or raid-array in the world and on which various file-system related operations occur) gets fragmented.

Unlike CIFS or NFS, LUN read-ahead is harder to predict, as the client tries to do just the same. Unlike real disks, NetApp LUNs do not behave, performance-wise, like the hard-drive layout any DB or FS has learned to expect and was best optimized for. It means, for my example, that on a DB with lots of small changes, that the DB itself would have tried to commit changes in large write operations, committed every so and so interval, and would thrive to commit them as close to each other, as continuous as possible. On NetApp LUN this will cause fragmentation, and will result in lower write (and later read) performance.

That’s all for today.