by Rick Dehlinger, Citrix
As I promised in my February Project Silverton update, the “NetScaler VPX on HyperFlex” project has been progressing nicely, albeit more slowly than I’d hoped or expected.
In some regards this might seem like an odd project – after all, pretty much every modern XenApp/XenDesktop deployment includes NetScaler, regardless of whether VPX’s or physical appliances are in use. This project, however, gives us a chance to explore many universal questions that come up when people begin considering investing in new on-premises virtual infrastructure for Citrix workloads.
I’m fortunate to have Esther Barthel (@virtuES_IT) – Citrix CTP and NetScaler automation guru, working with me on this one. I’ve deployed tons of these suckers, but still wouldn’t consider myself a NetScaler expert by anyone’s definition. The more you dig into them, the more you realize that they’re amazingly versatile as well as powerful. The deeper you get, the more you realize there’s SO much more functionality you know nothing about. Esther balances my skillsets out, giving us a good foundation for pulling this project off. And where our current skills drop off, we’re both fortunate to have a community of experts – both inside and outside of Citrix – to lean on for advice. Blessing counted!
As we were planning this project, we identified the following questions we would try to answer:
- Is running NetScaler VPX on an HCI appliance OK?
- If I run it alongside my desktop/application workload, what does it do to the scalability/user experience?
- How ‘big’ should these appliances be?
- How should they be configured for satisfactory results?
- What other things should I be concerned about if I deploy this way?
As you might expect, we’re learning a lot more on this journey than just the answers we were seeking! It seems we could already write a book on the topic, but since we both have ‘day jobs’, we’ll have to be satisfied distilling them into bite-sized chunks we can share with you here. Without further ado, let’s get started!
1 – Our Plan
Our plan started out pretty simply, and looked something like this:
- Create a reproducible baseline workload that would stress a Cisco HyperFlex cluster of 4 nodes (with no NetScaler).
- Measure this baseline workload, identifying where our bottlenecks lie.
- Re-route sessions through NetScaler appliances (running both on and off the HyperFlex cluster).
- Measure the impact on the baseline workload.
- Share what we learn.
As the details of our journey unfold, you’ll see it took us in quite a few directions we didn’t anticipate!
2 – Our Test Rig
Our test cluster consists of 4 servers which identify themselves as “Cisco HX240c M4SX HyperFlex System”. They’re 2U, ‘pizza box’ form factor hosts, which are managed by a redundant pair of Cisco UCS 6248UP Fabric Interconnects. It’s a pretty sweet setup, and one worthy of a follow-up blog to give you the details as the finished product (physical gear, ESX install/config, SDS layer install/config, etc.) is much more than the sum of the parts – it’s a very well configured, production ready bit of virtual infrastructure, ready to take whatever you throw at it.
In lieu of the full tour, I’ll share a couple other pertinent configuration details that are directly relevant to our testing. Each host sports two Xeon E5-2630v3 CPU’s (8 cores each at 2.40GHz) and 256GB of RAM (composed of multiple sticks of 32GB DDR4-2133-MHz RDIMM/PC4-17000/dual rank/x4/1.2v). Each host boots ESXi 6.0 Patch 3 from a pair of mirrored 64GB SD cards.
The storage configuration is interesting – and when I say interesting, I say that in a positive way (I hate dealing with storage!). Inside the chassis are a pair (I believe) of 120GB drives that are used for/by the Cisco HX Data Platform storage controller VM (version 1.8.1.c for these tests), as well as a bunch of externally accessible drives (including one 1.6TB 2.5 inch Enterprise performance 6G SATA SSD and eleven 1.2 TB 12G SAS 10K RPM SFF HDD’s). All of the externally accessible drives are managed by the storage controller VM, which ‘collects’ the local storage from each host, pools it into one big chunk of provisionable capacity, and presents it back to vSphere as blessedly simple shared storage. To the guy who hates having to think about this stuff, it’s nigh fantastic. All I’ve had to do is click through a simple UI (basically, how big do you want it?) and the rest of the work is done for me. Fantastic – and a great testament to this platform’s ability to break down management silos in IT departments.
3 – Our Baseline Workload
In my February Silverton Projects update I already spilled the beans on where we landed with the workload configuration: Server 2012/R2 with Office 2016, provisioned via MCS. We had quite a few options we could have chosen: we’ve got multiple OS options already configured via the automation framework Eric Haavarstein (@xenappblog) setup for us, including Server 2012/R2, Windows 10, and Server 2016. Also, the framework makes it easy to reproduce the same software stack across OS versions. Additionally, the framework makes it easy to leverage those builds across multiple different deployment methods (including traditional full clones, MCS full clones, MCS non-persistent clones, and PVS). The Cisco HyperFlex gear also gives us all of those options since the storage it provides is top notch – it performs very well, has plenty of capacity, and fantastic deduplication rates.
We settled on Server 2012/R2 with Office 2016 deployed via MCS/non-persistent clones, for one primary reason – that’s what Cisco ran for their latest HyperFlex study (1200 XenApp users on an 8-node HyperFlex Cluster), giving us both some valid numbers to check our results against AND some really bright folks we can compare notes with and learn from.
Before I move on, one quick note about the software stack on top of Server 2012/R2. We deliberately kept this as simple as possible, excluding some items I really want to have in there (such as the ControlUp agent and the Citrix Workspace Environment Manager agent) so it’s not only easily reproducible, but also to minimize the influence/impact such agents can have on scalability or performance. As such, our stack is limited to essentially the Citrix VDA, Office 2016, VMware Tools, and the LoginVSI guest tools.
4 – Our XenApp VM Configuration
Our next point of question came up as we were trying to decide on how many VM’s (of what CPU/memory configuration) we should run to ‘properly’ saturate the HyperFlex cluster. We may end up being debated on this, but we leveraged Cisco’s rule of thumb to get started. Their rule:
(Total # of cores including HyperThreading) – (# of cores used by the HyperFlex storage controller VM) – (4 cores for the hypervisor) = (# cores available to XenApp VM’s)
Our HyperFlex cluster is running Intel Xeon E5-2630 v3’s. As mentioned earlier, each of the 4 hosts in our test cluster has two, 8 core CPU’s, which gives us a total of 32 hyperthreaded CPU’s per host to start with. The HX Data Platform installer configures the virtual storage controllers with 8 vCPU’s. If we leave 4 vCPU’s for the ESX to play with, that leaves us with (32 – 8 – 4 =) 20 vCPU’s to allocate to XenApp server VM’s.
Note that Cisco’s latest CVD leverages Xeon E5-2690v4 processors with 14 cores per socket – MUCH more CPU power to put to work, and a worthwhile upgrade investment by most standards! But as the old adage states “beggars can’t be choosers” so we tested what we have… 😉
So now we had a target core count per host of 20 vCPU’s – the next questions we tried to answer were “how many vCPU’s per XenApp server?” and “how many XenApp servers per host?” Since this is far from Cisco’s first rodeo, we started with their optimal/tested configuration of 6 vCPU’s and 24GB of RAM per VM. We could have just left it at that, but that would have been too easy… 😉
If we would have rolled with 6 vCPU’s, that’d mean we could ‘fit’ 3 XenApp servers per host, consuming a total of 18 vCPU’s. Since we calculated we had 20 vCPU’s to work with – this configuration would seem to leave some processing power on the table… That started us on a quest to try to validate the ‘best’ configuration for our setup. We considered the following, but ended only testing 2 of the three configurations given our time constraints:
- 3 XenApp servers/host at 6 vCPU’s per server
- 4 XenApp servers/host at 5 vCPU’s per server (untested)
- 5 XenApp servers/host at 4 vCPU’s per server
A quick note on VM/host memory: each of the hosts in our cluster came with 256GB of 2133GHz RAM, and each storage controller VM consumes 72GB out of the box. For this first round of tests we’ll only be running a max of say 6 VM’s, so we weren’t super scientific about memory allocation as we expected to exhaust CPU well ahead of memory with this workload. Cisco’s 6 vCPU setup used 24GB/XenApp server, so we stuck with that, ratcheting it back to 20GB for the 4 vCPU configuration. We’ll get more scientific about memory usage at some point in the future, but given the competition for time from our day jobs, we’re leaving it at that for now. It’s important to note that Cisco recommends that memory is never over-subscribed for production systems, and that’s a recommendation I can happily abide by.
5 – Our Test Parameters
As you might expect, we used LoginVSI to generate and analyze our testing. We settled on the their pre-defined “Knowledge Worker” workload, and ran the tests in benchmark mode for consistency and repeatability. All sessions are launched during a 48-minute ramp-up time, and we’re not ‘counting’ a run as successful unless less than 2% of the sessions fail, regardless of whether or not VSIMax has been reached. For those that like to geek out on such things, here’s the VSI summary screen for a common test case:
A couple other tidbits worth sharing:
- For each of the test runs I’ll be sharing here, we ran them on ‘fresh’ VDA’s. We either re-provisioned the catalog or performed an “update machines” from Citrix Studio, ensuring that we had clean images for all passes.
- We did not pre-create profiles, leverage Citrix User Profile Manager or roaming user profiles, so the profile creation process is included in login times and server load. (We’ll probably come back to this one with a future test ‘cause inquiring minds will probably want to know!)
- We pushed the configuration as far as we could in 25 user increments, settling on the threshold you’ll see below. We pushed it harder/further than Cisco does in their CVD’s (which are built to be able to handle the target user load even with the failure of one physical host in the cluster) because we wanted to be able to measure the impact of running NetScaler VPX’s alongside the hosted shared desktops. If we’re not pushing it close to failure it’d be tougher to do. …and I like to watch infrastructure bleed. Can’t help myself!
- We let the XenApp servers ‘settle’ for at least 30 minutes prior to each run. By watching individual VM resource consumption, you can pretty clearly see that the Windows Server has some ‘busy work’ it performs on startup, and that they settle down after about 20 minutes. In the screenshot below, the VM was powered shortly after 5:15PM, and settled down to an idle about 20 minutes later:
Before you go comparing these numbers to Cisco…
One last thing – as once I get into the numbers I know some of y’all are going to immediately crack open Cisco’s latest test report with XenApp 7.11 on HyperFlex and compare numbers. I sure as heck have! I won’t fully analyze and compare the numbers in this blog (there are some important lessons that can come of this, worthy of their own article) but I’ll leave you with a few things to chew on:
- We’re testing the same workload (VSI knowledgeworker on XenApp 7.11/Server 2012 R2/Office 2016 on HyperFlex/vSphere 6) so there are absolutely parallels to be drawn.
- Cisco’s test-bed was an 8 node HyperFlex cluster, running on the HX220 nodes. Here we’re testing against a 4 node cluster running on the HX240 nodes. Since the HX Data Platform is essentially taking local storage from each node, pooling it together, then presenting it back to the cluster as one network attached storage cluster, logic would suggest that datastores presented by an 8 node HX cluster would perform better than a 4 node cluster doing the same, though Cisco’s testing suggests that it’s the CPU’s that make the difference.
- Cisco’s test-bed has far superior CPU’s. They’ve not only got a generational leap (Xeon v4’s vs. our Xeon v3’s) they’ve also got 12 more physical cores per host running at a higher clock speed (2.6 vs. 2.4 GHz).
- Cisco’s test-bed has more RAM (inconsequential for our scenario, as we have plenty) running at a higher clock speed (2400 MHz vs. 2133 MHz).
- As I previously alluded to, Cisco’s testing is at N+1, meaning that the top load they report can be served by a cluster that’s down by 1 node. This is not inconsequential – if they were testing the same way I am, I expect they’d be reporting back a ‘high water mark’ closer to 1370 users vs. 1200 users!
- We’ve deliberately set the bar lower by only taking basic measures to ‘tune’ the XenApp build stack underneath these results. We haven’t done this to Bogart any good stuff, but in future tests we’re going to attempt to measure the impact of some stack optimizations. Cisco, on the other hand, has already performed some of these optimizations in their stack.
6 – Our Results – Baseline Workload
Had enough of the background details yet? If not, feel free to ask away – I’m easy to find! (Twitter – @rickd4real, email – mailto:email@example.com, or comment on this article). For the rest of y’all impatient types, let’s get to the meat of the results!
Configuration 1: 20 XenApp servers (4 vCPU, 20GB RAM, 40GB vdisk)
First up – our 4 vCPU configuration. My hunch was that this was going to provide us with the best results as the configuration leveraged all 20 of our ‘available’ vCPU’s per host, and didn’t push the sessions/VM count into the ridiculous category:
Let’s start with the Summary chart, straight out of the LoginVSI Analyzer tool:
For the un-initiated, let’s take a look at the key numbers and metrics from this run, which is a representative sample from many similar passes captured while performing this exercise. The first number to look at is the VSIBase, which was recorded at 638. This measurement equates to the amount of time it takes for a test pass to complete on a system before it’s under a load. This is a “very good” baseline score per LoginVSI, and aside from the 616 Cisco recorded on their 8 node HyperFlex cluster, it’s the best I’ve personally seen yet! I’ve heard rumors of scores as low as 530(!) but haven’t gotten the details yet.
Next let’s notice that this run was a total of 525 sessions. Only two of them got stuck (well under the 2% failure threshold we’re using to render a test invalid) giving us a successful run at 523 sessions.
Now let’s look at the VSIMax Overview chart from the LoginVSI report:
Let’s focus on the blue line – the VSI Index Average. This line gives us a running average of the test execution time as the user count on the system goes up. It goes up smoothly as the user load increases, and tops out at 1380, still well below the VSI threshold of 1639. My first reaction to this chart is usually “VSIMax not reached? Clearly not pushing it hard enough!”. By most people’s standards (including LoginVSI’s measure) I’d be correct, but keep in mind that we’re looking for a clean run here – i.e. a run with less than 2% stuck sessions and no failures. When we took this test another notch up (to 550) we couldn’t generate predictable or consistent results – sessions started failing left and right. I’m pretty confident we’ll revisit this one in the future, but for now let’s call it and move on.
Now let’s take a look at the storage cluster performance up to peak workload (525 users on at 11:45):
This chart shows detailed measurements over the last hour, and I caught it leading up to the peak of 525 users. The most important thing to notice here is the bottom graph – latency. As you’d expect it has an occasional spike here and there, but for the most part hovers at or under 5ms throughout the test.
Now let’s take a look at the CPU and memory usage at the cluster level. The chart below shows cluster wide CPU/Memory utilization, captured at the high point during the workload (525 users):
My back of the napkin math says that about 79.77% of the CPU is consumed cluster-wide, as is about 67.56% of the total memory. I did capture ESXTOP data during these runs as well (and may fold them into a later blog) but this gives us a good, high level picture of what’s going on with our cluster.
Configuration 2: 12 XenApp Servers (6vCPU, 24GB RAM, 40GB vDisk)
With a strong showing already on the books for the 20 XenApp server config at 4 vCPU each, it was time to run the same tests against a 6 vCPU configuration. When we bump up the vCPU counts per VM, we’ve got to drop the VM count down so we don’t over-provision CPU, which leaves us with a total of 12 XenApp VM’s in this test:
Here’s how it faired, starting with the VSI Analyzer Summary:
Let’s start with the VSI baseline score of 648. That’s still considered “very good” and within the range of normalcy we’ve seen in our testing. 638 is the best, with most results (even different configs) hovering anywhere between 638 and 650. Next, let’s move on to the VSImax. We had 3 stuck sessions this time – still underneath the 2% threshold – but this time we actually hit VSImax – at 515 sessions.
Now let’s check the VSImax detail chart:
This chart shows some variability in the VSI base as the user count goes up – notice the drops in the performance line? This’ll become more clear once we compare it with the 4 vCPU load in a bit. This chart also shows us the intersection point where we crossed the VSIMax threshold at 515 users.
Now let’s look at the storage performance chart, again starting at the beginning of the test through to peak user load:
The numbers don’t look too far off from what we saw with the 4vCPU test. Importantly, the latency figures stay well under 10ms, bobbing around the 5ms line at the end. In all – very comparable results to what we saw with 4 vCPU, though I’ve got a feeling that upon closer inspection we’ll notice some differences. Hopefully my ESXTOP analysis will clear things up (when I get around to it!).
Finally, let’s take a look at a cluster level snapshot of CPU and Memory utilization, again at peak user load:
Applying my same back of the napkin math to get a general idea, I get CPU at 89.41% utilized and about 56.62% memory utilization.
7 – …and the Winner Is:
Let’s take these results and see if we can draw a conclusion on who the winner is, starting with the VSI numbers:
|Test Name||vWS12MCS-4_525_+ESXTOP 2017-02-13-1042||vWS12MCS-6_525_+ESXTOP 2017-02-13-1858|
|Test Description||20 XenApp Servers,|
4 vCPU each
|12 XenApp Servers,|
6 vCPU each
|Corrected VSImax v4||523 Sessions||515 Sessions|
|VSI Threshold reached?||NO||YES|
|VSIbaseline average response time (ms)||638||648|
|VSImax avg. response time (ms)||1380||1636|
|VSImax threshold reached at (sessions)||WAS NOT REACHED||518|
|Sessions not responding||2||3|
If we overlay the VSIMax comparison chart between the two runs, it also tells us a pretty clear tale:
For those of you who can’t read the micro print, the light blue line is the 4 vCPU configuration, and it’s clearly performing better than the 6 vCPU configuration (lower response time is better).
For the sake of simplicity, let’s say the storage performance was a wash, though I’ve got a feeling the ESXTOP metrics will show the 6 vCPU configuration to have higher average IOPS, throughput, and latency numbers…
Finally, let’s look at the cluster resource consumption numbers, using my napkin math:
|Cluster level utilization at Peak||4 vCPU x 20 XenApp VM’s||6 vCPU x 12 XenApp VM’s|
|CPU utilization (% of total CPU)||79.77%||89.51%|
|Memory utilization (% of total)||67.62%||56.62%|
…and there we have it – from our tests on this specific hardware, a 4 vCPU configuration for XenApp VM’s appears to be the clear winner. The 4 vCPU configuration didn’t reach VSIMax, had fewer stuck sessions, and gave us more total users than the 6 vCPU configuration, while utilizing @ 6% less overall CPU on the cluster!
8 – Next Steps
Let’s do a quick recap – shall we? We’ve now got a workload configuration we can lay down consistently (regardless of provisioning tech in play) thanks to Eric’s automation framework. We’ve got a process for executing consistent load/scale tests using LoginVSI and some more automation goodness from Eric. We’ve got a baseline performance benchmark for our cluster configuration that’s pretty well documented and reproducible, and we’ve determined that a VM configuration of 20 XenApp servers with 4 vCPU’s each will give us great performance while not over-committing our most in-demand resource – CPU.
So – what’s next? Well, more articles of course! Here’s where my head’s at currently, and what you can expect soon:
Part 2: Testing the Baseline Workload through NetScaler VPX (off-cluster)
In the next post of this series, we’ll talk through how we managed to send sessions through a NetScaler Gateway, even though LoginVSI’s built-in launchers can’t do it. We’ll also run our base workload through an external NetScaler Gateway and see if we can notice any impact on the performance results.
Part 3: Testing the Baseline Workload through NetScaler VPX (on HyperFlex)
In the third post of this series, we’ll go over the basic configuration of the NetScaler VPX we’ll be running alongside our hosted shared desktop workload. We’ll tweak our LoginVSI launch parameters to use it, and finally we’ll execute and analyze some tests to see how performance and scale compare when you run your NetScaler VPX’s alongside your desktop workload.
Part 4: Post-game Analysis and Lessons Learned
In the final post of this series (for now! ;-)) I’ll run back through what we learned from this exercise. I may even bring ESXTOP numbers into the mix to see if we can find anything interesting! Finally, I’ll lay out our plan for answering the remaining questions on our list.
Stay tuned for some more Silverton goodness – coming to a browser near you!