Tuesday, March 26, 2013

How to Cable an EMC VNXe for Failover

I really struggled with cabling a new VNXe for failover with vSphere. It's not difficult. There's just not a lot of documentation or explanation out there for it. The best practices docs and howtos treat iSCSI like a 3rd class citizen and assume you'd really rather be using Fiber or NFS. So here's how you really do it and a little explanation about why. We'll go into the details of the vSphere configuration after covering the hardware end of things.

*Disclaimer: I am not an EMC expert, nor am I guaranteeing nothing horrible will happen to your equipment or vSphere environment. Use this guide at your own risk.

At its very core, this is how your network will look with a VNXe for fail over. This assumes you have 2 physical NICs in the host to provide multiple physical hardware paths between the host(s) and the network switches. All cabling is single lines with no trunking.

Note that the VNXe has a single network line from each SP to each switch. This is the most simple way to cable. The SPs are able to handle NIC teaming using LACP trunks but we're going to use vSphere's round robin scheme to get us the same functionality. The requirement of LACP trunks means that you need to rely on a higher end switch, such as a Cisco Catalyst 3560 or an HP Procurve 2910al. Using vSphere's built in capabilities you can use a less switch but be careful about your throughput and switching capacities! You can get burned skimping on switches.

I chose to have each NIC on the SPs set to a different IP address. In this case:

  • SP A eth2 = 192.168.50.10
  • SP A eth3 = 192.268.51.10 
  • SP B eth 2 = 192.168.50.20
  • SP B eth 3 = 192.268.51.20
Be sure that when you cable the SP NICs that you cable the correct subnets together! We'll end up with a NIC in the 192.168.50.xxx subnet and another NIC in the 192.168.51.xxx subnet on each host. Make sure the 50.xxx NICs are on the same switch and the 51.xxxx NICs are on the other switch.


Moving on to the vSphere end of things. There is an iSCSI requirement that you ensure VMkernels are tied to a single specific port. You can assign multiple ports to a vSwitch but you need to ensure that each VMkernel has the switch failover order overridden to assign a specific physical port to the kernel. This can get confusing and since I don't have a whole lot of vSwitches in my environment I choose to create VMkernels one per vSwitch. VMware advises that the number of vSwitches can impact performance so be aware of how that may impact your environment.



Now that we've created 2 vSwitches, each with their own iSCSI VMkernel we can assign them to the iSCSI software initiator adapter. Configure the Storage Adapters on your ESXi host and view the properties for the iSCSI software adapter. Under the Network Configuration you will have the option to add ports to the initiator. Add both of the iSCSI VMkernels created earlier.


Once you've configured either Dynamic or Static discovery for the your LUNs and rescanned the adapter for devices we can go to configure storage.

Add your storage as you normally would and go to the new datastore's properties. Click on the Manage Paths button in the lower right.


If everything is cabled correctly and operating like it should you should see two paths listed and a Path Selection policy set to Fixed. To set your paths to Active/Active you'll need to choose Round Robin. If you want Active/Standby, choose Fixed. Make sure you hit the Change button before closing the window to make sure your Path Selection policy is saved.

Now that multipath is setup you can move on to testing. You're doing this in a non-production environment, right? Remember that if you pull cables while testing and something isn't right you'll lose Datastore connectivity!.

Go ahead and start pulling cables off the switch to see how the paths are affected. You should see that when you pull one cable from the primary SP for a LUN that the datastore stays online. If you pull the corresponding cable from the other SP you should see a dead link show up in the manage paths interface for storage configuration. Pull the second cable from the primary SP and you should still have connectivity.

Its worth mentioning that some folks online have said it can take up to 60-90 seconds for an SP to fail over to the backup SP. Make sure you take that into account while testing. I seem to have nearly immediate fail over. Your mileage may vary.

Explanation:

The VNXe has a nifty design. There's internal communication links between the SPs. When they detect over the SAN network that an SP NIC is down the network traffic for that particular port is redirected to the corresponding port on the other SP. For instance, If Eth 2 on SP A loses connectivity, the traffic is redirected to SP B Eth 2. Same goes for the other ports. This is why EMC tells you to cable each SP identically. 

The only time you have a true failover scenario is when an SP actually goes down, say for a reboot after an update for instance. Any NIC failures are handled by redirecting traffic.

Because of the massive amounts of traffic generated, you really must NOT do SAN over a network with other general traffic. Not only is that terrible design in a fairly epic fashion, you run the risk of cross SP traffic flooding your network. I had that happen once when I cabled to a port that I forgot to put in the proper VLAN. Not fun.