Hopelessly Over-provisioned

Tuesday, July 16, 2013

Hidden gem in vCOps Foundation or "How much can you really downsize?" - Part 1

Setting the stage

I've been working intensely with an enterprise licensed vCenter Operations Manager lately and find it a powerful tool to analyse, monitor and optimize vSphere landscapes (and possibly others, I have not had the opportunity to work with AWS adapter for instance). On the side I play around with a few foundation level instances with various customers.

Recently I stumbled across a very annoying issue with the custom interface that I will just briefly outline here, but not go into it too deeply just yet. vCOps allows you to create custom tags and assign them to resources. Google it, its well worth using this feature as it effectively groups resources. So instead of creating a dashboard and filtering for a whole bunch of resources, you end up filtering for your custom resource tag only. vCOps will then, whenever you refresh the dashboard, pull the metrics from the tagged resources. Tag another resource (say you add a datastore to your environment and tag it using your custom tag) your dashboard will automatically display the newly added resource and its metrics.

The problem is that it only works so well with the heatmap widget, but that's not the topic of the post. Should you want info on this, feel free to reach out via the comments below or Twitter @Str0hhut.

In a recent discussion on the issue with my friend Iwan (@e1_ang) he pointed out that the vApp version of vCOps allows custom grouping of resources while the Windows version does not. That got me thinking about what else I might have missed in the vApp version. So I started comparing the user interfaces. In the vApp version you can indeed create a group by clicking the configuration link in the top right corner and then "Manage Group Types". However at the moment it is beyond me to figure out how to assign resources to my newly created group. Info on this is also appreciated via comments or Twitter.

The hidden gem

Back to topic. While scouting out the menu of the vApp I discovered an interesting link that I highly recommend all aspiring downsizers and every one else interested to click on

What follows is a dashboard that is remarkably similar to the custom dashboards of a licensed vCOps edition. And aside from what vCOps says about itself in the regular dashboard this one will give you a lot more information in detail of whats happening. I have downsized my deployment by quite a bit (UI VM has been capped to 3GB Ram and 1vCPU, Analytics to 5GB and 1 vCPU as well), but have been thinking it has been running quite well so far. Sure, health of the Analytics VM is at 51, there is a memory constraint, but other than that it feels quite snappy, graphs are all up and running, data is there plentyful and (almost everything) works as expected. Well, almost. I've been suspecting it might be due to the restricted resources, that the "Normal"-calculation is not working for every resource. The newly found dashboard confirms that at a rather detailed level.

As you can see from the screenshot, both my collection and analytics tier appear to be in bad shape, whereas my presentation tier is happily buzzing along. Furthermore if you drill into the tiers you'll get an in depth view of which services are affected.

This is a the view you get when you drill into the tree as follows:

double click Collection
double click vCenter Operations Collector (the bottom most icon of the resulting tree)
double click vCenter Operations Adapter (top right corner of the resulting tree)

I then selected the OS resource from the health tree and expanded the Memory Usage folder in the metric selector to find some interesting metrics.

Interestingly enough its not the Analytics VM that is swapping like crazy, but the UI VM.

So I started asking myself how I could find out which of the two VMs is really undersized. Its obvious that the Analytics VM needs more power (or is it? Read on!), the analytics tier is on red alert. Going back to the default dashboard both VMs report that they are memory constrained. Both VMs are demanding to use 100% of their configured memory resources, however both are using only about 50% of what they have. Best guess at this point is a memory constraint of the host vCOps is running on. Both vCenter client and vCOps provide plenty evidence that this is the case. This does not come as a surprise, yes the host is very much constrained, that is the reason why I sized the vCOps VMs down to begin with.

I'm still curious about why the UI VM is swapping so much when the Analytics VM does not seem to be swapping at all. Better yet logging into each VM and seeing what the Linux kernel has to say about it I found very different suggestions:

Analytics VM

localhost:~ # free -m

total used free shared buffers cached

Mem: 4974 4937 37 0 12 789

-/+ buffers/cache: 4134 839

Swap: 4102 0 4102

UI VM:

vcops:~ # free -m

total used free shared buffers cached

Mem: 3018 2999 19 0 0 391

-/+ buffers/cache: 2607 411

Swap: 4102 1264 2838

Wait, what? So the Analytics VM is not swapping, we knew that already. In fact, the kernel of the analytics VM somehow even managed to allocate a small amount of memory as an I/O buffer! It seems to me that despite of what vCOps thinks of itself from an OS point of view the Analytics VM is sized just about right.

Resume

This is no resume as to what is happening in this particular environment and how well the vCOps vApp handles down sizing just yet. I have not gathered nearly enough information to fully analyze the situation and draw conclusiong, my brain is buzzing with ideas and paths to follow along. For now I have set memory reservations for both VMs to force the host to provide each with their entitlements. I will have to think this over and investigate some more in the days to come, as well as observe how the memory reservations change the picture. Stay tuned, this may get interesting.

To be continued...

Friday, June 14, 2013

Multi-NIC-vMotion (not so) deep dive

This morning when I opened my mail box I found a message regarding our Multi-NIC-vMotion setup failing. Despite of what KB2007467 says, there is another way of doing Multi-NIC-vMotion, that I will go into in the scope of this article. But what I find most interesting is the methodology vSphere applies when choosing the vMotion interface to migrate a VM.

Multi-NIC-vMotion on a single vSwitch

The above referenced KB article describes a Multi-NIC-vMotion setup on a single vSwitch with multiple uplinks. When you create a VMKernel port for vMotion, you need to override the default NIC teaming order of the vSwitch, as vMotion VMKernel ports can only utilize on physical switch uplink at a time (the same applies to VMKernel ports used for software iSCSI). Thus for a vSwitch with two uplinks you need to create two VMKernel ports with vMotion activated, where each VMKernel port uses one of the uplinks as active, the other is unused (not standby).

Alternative Multi-NIC-vMotion setup using multiple switches

In our environment we use multiple virtual switches to separate traffic. There is a vSwitch with two uplinks for customer traffic to the VMs, there are three vSwitches for admin, NAS and backup access to the VMs and there is a dedicated vSwitch for vMotion traffic. The dedicated switch sports two uplinks and has been configured with two VMKernel interfaces as described in KB2007467. It has recently been migrated to dvSwitch without any problems.

The other virtual switches, with the exception of the customer vSwitch, are heavily underutilized. Thus it seemed only logical to create VMKernel ports on each of those switches for vMotion usage. The ESX hosts have 1TB ram each, but unfortunately are equipped with 1GBit/s NICs only. Putting one of those monster hosts into maintenance mode is a lengthy process.

On the physical switch side we are using a dedicated VLAN for vMotion (and PXE boot for that matter).

Choosing the right interface for the job

After a lot of tinkering and testing, we came up with the following networking scheme to facilitate vMotion properly. Our vMotion IPs are spread over multiple network ranges as follows:

ESX01 - Last octet of management IP: 51

vmk1 - 172.16.1.51/24
vmk2 - 172.16.2.51/24
vmk3 - 172.16.3.51/24

ESX02 - Last octet of management IP: 52

vmk1 - 172.16.1.52/24
vmk2 - 172.16.2.52/24
vmk3 - 172.16.3.52/24

and so on. The network segments are not routed, as per VMware's suggestion. Thus the individual VMKernel ports of a single host cannot communicate with each other, there will be no "crossing over" into other network segments.

When we vMotion a VM from host ESX01 to ESX02 based on its routing table it will initiate network connections from ESX01.vmk1 to ESX02.vmk1, ESX01.vmk2 to ESX02.vmk2 and so on. Only if for some reason vMotion is not enabled on one of the VMkernel ports the hosts will try to connect to a different port. Thus only then the vMotion will fail.

The reason for splitting the network segments into class C ranges is simple: The physical layer is split into separate islands, which do not interconnect. For this specific network segment, a 22-netmask would do fine and all vMotion VMkernel ports could happily talk to each other. However since the "frontend" and "backend" edge switches are not connected this cannot be facilitated.

When we check the logs (/var/log/vmkernel.log), we can see however that all vMotion ports are being used:

2013-06-14T05:30:23.991Z cpu7:8854653)Migrate: vm 8854654: 3234: Setting VMOTION info: Source ts = 1371189392257031, src ip = <172.16.2.114> dest ip = <172.16.2.115> Dest wid = 8543444 using SHARED swap
2013-06-14T05:30:23.994Z cpu17:8854886)MigrateNet: 1174: 1371189392257031 S: Successfully bound connection to vmknic '172.16.2.114'
2013-06-14T05:30:23.995Z cpu7:8854653)Tcpip_Vmk: 1059: Affinitizing 172.16.2.114 to world 8854886, Success
2013-06-14T05:30:23.995Z cpu7:8854653)VMotion: 2425: 1371189392257031 S: Set ip address '172.16.2.114' worldlet affinity to send World ID 8854886
2013-06-14T05:30:23.996Z cpu13:8854886)MigrateNet: 1174: 1371189392257031 S: Successfully bound connection to vmknic '172.16.2.114'
2013-06-14T05:30:23.996Z cpu14:8910)MigrateNet: vm 8910: 1998: Accepted connection from <172.16.2.115>
2013-06-14T05:30:23.996Z cpu14:8910)MigrateNet: vm 8910: 2068: dataSocket 0x410045b36c50 receive buffer size is 563272
2013-06-14T05:30:23.996Z cpu13:8854886)VMotionUtil: 3087: 1371189392257031 S: Stream connection 1 added.
2013-06-14T05:30:23.996Z cpu13:8854886)MigrateNet: 1174: 1371189392257031 S: Successfully bound connection to vmknic '172.16.3.114'
2013-06-14T05:30:23.996Z cpu14:8854886)VMotionUtil: 3087: 1371189392257031 S: Stream connection 2 added.
2013-06-14T05:30:23.996Z cpu14:8854886)MigrateNet: 1174: 1371189392257031 S: Successfully bound connection to vmknic '172.16.4.114'
2013-06-14T05:30:23.996Z cpu14:8854886)VMotionUtil: 3087: 1371189392257031 S: Stream connection 3 added.
2013-06-14T05:30:23.996Z cpu14:8854886)MigrateNet: 1174: 1371189392257031 S: Successfully bound connection to vmknic '172.16.1.114'
2013-06-14T05:30:23.997Z cpu12:8854886)VMotionUtil: 3087: 1371189392257031 S: Stream connection 4 added.
2013-06-14T05:30:23.997Z cpu12:8854886)MigrateNet: 1174: 1371189392257031 S: Successfully bound connection to vmknic '172.16.0.114'
2013-06-14T05:30:23.997Z cpu12:8854886)VMotionUtil: 3087: 1371189392257031 S: Stream connection 5 added.
2013-06-14T05:30:38.505Z cpu19:8854654)VMotion: 3878: 1371189392257031 S: Stopping pre-copy: only 43720 pages left to send, which can be sent within the switchover time goal of 0.500 seconds (network bandwidth ~571.132 MB/s, 13975% t2d)

This makes for some impressive bandwidth ~571.132 MB/s on measly GBit/s NICs. The reason why our platform was having issues was a few ports, where vMotion was disabled. Thankfully a few lines of PowerCLI codes solved that problem:

$IPmask = "192.168.100."
$vmks = @();

for ($i=107;$i -le 118; $i++) {
   for ($j=1; $j -le 6; $j++) {
       if ($j -eq 5) { continue; }
       $vmk = Get-VMHost -Name $IPmask$i | Get-VMHostNetworkAdapter -Name vmk$j
       if ($vmk.VMotionEnabled -eq $false) { $vmks += $vmk }
   }
}

foreach ($vmk in $vmks) {
   $vmk | Set-VMHostNetworkAdapter -VMotionEnabled $true
}

Monday, June 10, 2013

One more thing I don't like about Debian Wheezy

I'm running a Wheezy iSCSI target, that already caused some headaches. Today I wanted to add two VMDKs to the Wheezy VM to be able to provide more storage to my test cluster. In the past that was easy. Just add your disks, login to the Linux box and issue

echo "scsi add-single-device a b c d" > /proc/scsi/scsi
(Usage: http://www.tldp.org/HOWTO/archived/SCSI-Programming-HOWTO/SCSI-Programming-HOWTO-4.html)

However with Wheezy there is no /proc/scsi/scsi. The reason being that is has been disabled in the kernel config.

root@debian:/proc# grep SCSI_PROC /boot/config-3.2.0-4-amd64
# CONFIG_SCSI_PROC_FS is not set

Wtf?! (pardon my French!)

The solution, however, is quite simple, and annoying in itself as well. All you need to do is install the scsitools package. Thankfully, the list of dependencies on a (relatively, I installed VMware tools, thus is has the Kernel headers, gcc, make, perl and iscsitarget incl. modules) fresh Debian installation is quite short...

fontconfig-config{a}
libdrm-intel1{a}
libdrm-nouveau1a{a}
libdrm-radeon1{a}
libdrm2{a}
libffi5{a}
libfontconfig1{a}
libfontenc1{a}
libgl1-mesa-dri{a}
libgl1-mesa-glx{a}
libglapi-mesa{a}
libice6{a}
libpciaccess0{a}
libsgutils2-2{a}
libsm6{a}
libutempter0{a}
libx11-xcb1{a}
libxaw7{a}
libxcb-glx0{a}
libxcb-shape0{a}
libxcomposite1{a}
libxdamage1{a}
libxfixes3{a}
libxft2{a}
libxi6{a}
libxinerama1{a}
libxmu6{a}
libxpm4{a}
libxrandr2{a}
libxrender1{a}
libxt6{a}
libxtst6{a}
libxv1{a}
libxxf86dga1{a}
libxxf86vm1{a}
scsitools
sg3-utils{a}
tcl8.4{a}
tk8.4{a}
ttf-dejavu-core{a}
x11-common{a}
x11-utils{a}
xbitmaps{a}
xterm{a}

That's all it takes for you to run "rescan-scsi-bus", which will discover your disks. That was easy, wasn't it?

Friday, June 7, 2013

Access denied. Your IP address [A.B.C.D] is blacklisted. - OpenVPN to the rescue!

Ok, so some of your ISP's fellow customers got their boxes infected and are now part of a botnet (in this specific case apparently the name of the trojan is "Pushdo", "Pushdo is usually associated with the Cutwail spam trojan, as part of a Zeus or Spyeye botnet." src.: http://cbl.abuseat.org). "Doesn't bother me" you may think. "I got all my gear secured" you may think.

Well, that's where you're wrong.

It does bother you!

Upon my morning round of blogs I realized I couldn't access http://longwhiteclouds.com/ any more. Instead I was being greeted with this friendly message:

Access denied. Your IP address [A.B.C.D] is blacklisted. If you feel this is in error please contact your hosting providers abuse department.

This is just one effect. I have been having a seriously choppy internet experience for the past two or three days that I'd like throw in the pot of symptoms I am seeing.

A bit of research quickly revealed what was going on. As a part time mail server admin for my company I know that we use spamhaus.org (among other services and mechanisms) for spam checking. A check in the Blocklist Remov al Center provided information about the source and reason for the blockage. Just enter the IP in question and click on Lookup. I find myself, both in the Policy Based Blocklist as well as the Composite Blocking List and possibly else where, too.

Suggestions

Well, firstly, lets be sociable and inform our ISP. They may know already and be working on the case, or not.

But that doesn't help me right now! I wanna read blogs now!

OpenVPN to the rescue

Luckily I have access to a corporate OpenVPN based network. Unlike other solutions this network does not per sé route all traffic but just provides access to the corporate network. However in this case I wish to do just that.

If all I am worried about, is longwhiteclouds.com I can just set a static route to the tun-interface IP like so

user@box> ip r | grep tun0
192.168.1.0/24 via 172.16.5.17 dev tun0
192.168.5.0/24 via 172.16.5.17 dev tun0
172.16.5.17 dev tun0 proto kernel scope link src 172.16.5.18
192.168.7.0/24 via 172.16.5.17 dev tun0
user@box> ifconfig tun0 | grep inet
          inet addr:172.16.5.18 P-t-P:172.16.5.17 Mask:255.255.255.255
user@box> sudo route add -host longwhiteclouds.com gw 172.16.5.18

But how do you route everything through the tunnel? Firstly you need to set a static route to your provider's VPN endpoint. Once that is out of the way you can reset your default gateway to your own tunnel.

user@box> ip r | grep default
default via 192.168.1.1 dev eth0
user@box> grep remote /etc/openvpn/corporate_vpn.conf
#remote vpn.example.com 1194
remote 1.2.3.4 1194
tls-remote vpn
user@box> sudo route add -host 1.2.3.4 gw 192.168.1.1
user@box> sudo route del default
user@box> sudo route add default gw 172.16.5.18user@box> ip r
default via 172.16.5.18 dev tun0 scope link
[...]
1.2.3.4 via 192.168.1.1 dev eth0

Now everything is swell again in network land, you requests are happily traversing through the VPN tunnel.

user@box> tracepath longwhiteclouds.com
1: 172.16.5.18                                          0.349ms pmtu 1350
1: 172.16.5.1                                         312.647ms
1: 172.16.5.1                                         314.739ms
[...] until they finally reach their destination

Hope that helps someone at some point...

Btw.: Excuse the formatting, I'm not too happy with blogger these days.

Monday, June 3, 2013

iscsitarget-dkms broken in Debian Wheezy

Now that was disappointing. An aging iSCSI bug has resurfaced in Debian's latest and greatest stable release, Wheezy, or in numbers 7. Its rendering Debian's iSCSI package useless. Upon scanning a Debian target using an initiator, e.g. ESXi's software iSCSI adaptor, the following messages pop up:

Jun 3 04:30:44 debian kernel: [ 242.785518] Pid: 3006, comm: istiod1 Tainted: G O 3.2.0-4-amd64 #1 Debian 3.2.41-2+deb7u2
Jun 3 04:30:44 debian kernel: [ 242.785521] Call Trace:
Jun 3 04:30:44 debian kernel: [ 242.785537] [<ffffffffa03103f1>] ? send_data_rsp+0x45/0x1f4 [iscsi_trgt]
Jun 3 04:30:44 debian kernel: [ 242.785542] [<ffffffffa03190d3>] ? ua_pending+0x19/0xa5 [iscsi_trgt]
Jun 3 04:30:44 debian kernel: [ 242.785550] [<ffffffffa0317da8>] ? disk_execute_cmnd+0x1cf/0x22d [iscsi_trgt]
[...]

With ESXi in particular eventually the lun will show up, after a bunch of timeouts, I suppose, but is not usable in any way and may disconnect at any time.

Solution:

Thankfully there is a solution to the dilemma. Some Googleing around I found this again rather old thread in a Ubuntu forum describing the very same issue. Combined with the knowledge of the aforementioned bug I followed the instructions, grabbed the latest set of iscsitarget-dkms sources, compiled them and whatdoyouknow, it works like a charm.

Tuesday, May 28, 2013

pvscsi vs. LSI Logic SAS

I've talked about me being a poor man with not much of a lab before, to great lengths. It shall suffice to say that this has not changed since I started this blog. However I do have access to quite a bit of infrastructure to test and play around with. And so I did.

In a recent innovations meeting one of my colleagues suggested the use of pvscsi over the default LSI Logic drivers. The idea being the same as with vmxnet3 over e1000g adapters to save CPU resources. However that does not automatically yield a performance improvement. Other people have talked about their findings in the past and VMware themselves have said something about it too. To my surprise my findings were a bit different than proposed by VMware.

The CPU utilization difference between LSI and PVSCSI at hundreds of IOPS is insignificant. But at larger numbers of IOPS, PVSCSI can save a lot of CPU cycles.

My setup was very simple, a 64bit W2K8R2 VM with 4GB Ram, 2 vCPUs on an empty ESX cluster and empty storage. I was running my tests during off hours so impact by other VMs on the possibly shared storage (I do not know for sure, unfortunately, how the storage is setup in detail. Thus I don't know if the arrays are shared or dedicated. The controllers will be shared however.) is unlikely, the assigned FC storage LUNs for test purposes only. Apart from the OS drives the VM had two extra VMDKs, each using its own dedicated virtual SCSI controller, pvscsi and LSI Logic SAS.

I might have done something seriously wrong but here's what I found:

Using iometer's Default Access Specification (100% random access at 2kb block sizes, 67% read) I did indeed find very significant differences, but not what I had expected:

pvscsi: Avg Latency: 6.28ms, Avg IOPS: 158, Avg CPU Load 53%
LSI: Avg Latency: 3.16ms, Avg IOPS: 316, AVG CPU Load 34%

Multiple runs confirmed these findings.

Later changing the access specs to a more real world scenario VMware's proposition became more and more true, the values approached each other. At 60% random IO both adapters managed roughly 300 IOPS at 10% CPU load.

Conclusion

I cannot conclude much as I know too little about the storage configuration. However I wanted to see what happened if I scaled up a little. Using the very same storage I deployed a NexentaStor CE, gave it 16GB Ram for caching and 2 VMDKs on the same datastores as the initial VM (each Eager Zeroed Thick) and configured a Raid0-ZPool. I configured 4 zVol LUNs inside the storage appliance and handed them out via iSCSI, migrated the W2K8 VM into the provided storage (and nested ESXi for that matter, just to make it a little more irrelevant) and ran the same tests again. Now utilizing multiple layers of caching I got quite different values:

pvscsi: Avg Latency 1.59ms, Avg IOPS 626, Avg CPU Load 11%
LSI: Avg Latency 1.72ms, Avg IOPS 582, Avg CPU Load 21%

The performance impact is indeed insignificant, none of this is interesting for enterprise workloads. The CPU utilization difference is significant however, as it nearly doubles! As I said before all of this is irrelevant and pretty much a waste of time, it just shows that the platform doesn't have the bang properly make use of a paravirualized scsi controller to begin with. To me that is a little disappointing and an eye opener.

Follow up

Overriding capacity management I migrated the VM onto a production cluster to see whether the storage systems there are a bit more capable. However again the results are not what I expected:

pvscsi: Avg Latency 0.58ms, Avg IOPS 1708, Avg CPU Load 17%
LSI: Avg Latency 0.47ms, Avg IOPS 2126, Avg CPU Load 21%

Again I conclude that none of this is relevant, unfortunately, and I'm going to have to go into questioning the engineering team who set up this storage platform to find some answers as to how they decided what to set up.

Invalid configuration for device '0'

This dreaded message came upon me just now when I tried to reconnect a VM. I had previously shut this VM down, exported it as a OVF and imported it into a test environment to run some iometer tests against a more powerful storage to compare pvscsi performance to LSI Logic SAS. After the test environment's trial license expired I threw the entire thing away and wanted to reconnect me original VM, only to find the above mentioned dreaded message.

Following VMware's KB 2014469 on this issue I first verified that the VM was indeed connected to a free port. I then migrated it using VMotion to a different host, still no good. The third option did however do the trick and thus helped me learn something new about ESXi. It can in fact reload a VMs configuration at runtime and thus resync it with vCenter. And thanks to awk being available Option 3 can easily be shorted to a one-liner:

vim-cmd vmsvc/reload $(vim-cmd vmsvc/getallvms | grep -i VMNAME | awk '{print $1}')