Citrix introduced a feature with XenApp/XenDesktop 7.7 called “Local Host Cache” (LHC). When this feature was released, there were some limitations, but as time wore on and Citrix began to understand the technology, limitations were removed or reduced.
There are still some limitations on the technology compared to 6.5 IMA LHC, but it looks much better today than a year ago, and undoubtedly it will be better a year from now. However, the documented limitations of the Local Host Cache has me quite curious.
As an example, going through the Local Host Cache documentation, one of the big “limitations” is the number of VDA’s. When LHC was introduced, the limit was 5,000 VDAs, but there was never a limit on the number of RDS sessions. The closest information I could find was this old LHC sizing guide that is now out of date. In it, the author tested 100,000 RDS users and 5075 VDI VDAs. His findings for RDS sessions were difficult to parse, if only because the raw data is not available, but still, some information can be gleaned from the images.
The ‘theoretical min’ row is the absolute minimum time 100,000 users would take to log on if the environment was able to process 20 launches per second, giving 1 hour 23 minutes 20 seconds. In these tests, the 0 applications row managed 1:30:57 in the 6 vCPU case, and 1:30:48 in the 8 vCPU case. The performance of the Active Directory domain will have some impact on how quickly users are authenticated.
We can kind of see that enumerating 400 applications at 20 requests per second (8/LHC on) appears to have taken 1:55:12(!)
I had a couple concerns regarding the article.
The first concern is the testing was done using a very old processor, an AMD 8431 that was released June 1st 2009. This processor was six and a half years old when the article came out. The processor the author was using has a single thread rating of 854, a CPU three years newer (Intel E5-2680, — still quite old!) scores a single thread performance of 1674. Nearly 2x the performance! It would be extremely handy to understand the actual performance limitation of LHC in a few CPU scenarios (low, medium and high performing tiers). I would imagine that this is more realistic today than deciding that your controllers will live on hardware that is sorely out of date.
The second concern, he tested with 7.12. There is nothing wrong with that in itself, as that was the release at the time he tested, but 7.14 brought dramatic performance improvements to LHC. Enough improvements that a 2x VDA density was achieved for a single zone and 8x the number of VDAs for a site! Retesting with the improved LHC would have been nice and is even more important now because these improvements are in the 7.15 LTSR and people maybe sizing their brokers and/or farms with now outdated and potentially incorrect assumptions.
I had experimented with testing the performance of the brokers at enumerating applications (400 applications actually!) on XenApp 7.13, and I found I could enumerate the applications at a crazy concurrent rate of 200/sec and it completed all requests in less than 1000ms. To offer some real world perspective on a fair-sized environment (concurrent user count of ~14,500), I examined the peak user logon rate and the rate was ~14/sec during peak logon time. This means that over a 15-minute timespan in the morning, this 6.5 environment satisfies 12,600 RDS logons. When I tested a 7.X playground environment, it stayed in lockstep with 6.5 until it outperformed it when pushed with extremely high concurrent logon rates.
With everything I’ve said, I’m going to test the LHC performance on 3 classes of processors.
Low Performance Tier, Intel E5-2670, 2.60GHz Sandy Bridge (Released Q1 2012), CPUMark single thread rating 1587.
Mid Performance Tier, Intel E5-2660 v4 2.00GHz Broadwell (Released Q1 2016), CPUMark single thread rating 1826.
High Performance Tier, Intel E5-2690 v4 2.60GHz Broadwell (Released Q1 2016), CPUMark single thread rating 1927.
In order to remove storage from factoring into this testing, I’m going to put the LHC database on a 1.5GB RAMDisk.
I’m going to configure my Broker VM for best performance. The broker will be configured as a 2 socket, 8 core system (16 vCPU total). The hosts this VM will reside on will have 2 sockets with their respective processors with no other VMs residing on them with Hyper-threading enabled. With LHC on, the theory is 4 of those cores will go to the SQL Server Express instance. The VM will have 8GB RAM. The VM will be NUMA aware.
In order to test performance, I’m going to use WCAT to spin up a fixed number of concurrent application enumeration requests against the broker.
Citrix has some excellent performance counters that measure the load against the LHC. “Citrix High Availability XML Service – Concurrent Transactions” accurately measures the load wcat was reporting, so this is excellent! It means that the number of user enumeration requests was spot on. The counter “Citrix High Availability XML Service – Avg. Transaction Time” measured how long it took before I got back a response for the requests. With these two counters I can measure my load and how long it took for my RDS session application enumeration to respond.
I configured wcat to add 10 concurrent connections every minute up to 10 minutes, to a maximum of 100 concurrent users requesting enumeration. Why concurrent? Well, that’s just what wcat does. However, concurrent testing makes this test very different compared to what was described in the original Citrix test. The Local Host Cache Sizing article states testing was not at a fixed concurrent amount, but at a rate. That rate was 20 enumeration attempts per second. My testing at 20 enumeration attempts per second shows that the LHC, on these processors, chew through them like they are nothing. If the processor can finish the task before the rate (per second in this instance) then you’ll just come up with the “theoretical limit.” Example:
Each packet was processed before the “second” was complete, thus the performance will always be at the “theoretical limit” unless the processing time exceeds the “rate.”
Testing the “concurrent enumeration, however, can process more transactions per second because it ensures there are always 5 enumerations requests occurring. If these transactions take less than a second then more users could be processed per second.
In my simplified examples, the 5 enumeration request each second will only do 5 requests. For the concurrent enumeration requests, the range is 9-10 requests per second. Twice the work accomplished!
Can we find out how many concurrent enumeration requests are completed in a given second then?
Yes! Citrix offers a 3rd counter that I will key in on. “Citrix High Availability XML Service – Transactions/sec.“ This counter will tell me how many requests were completed in a given second. This counter provides me with an actual count of the “real work.”
Here is an image of the raw data:
The red line is the number of transactions done per second, the green line is the concurrent number of users requesting enumeration and the blue line is how long it took to satisfy each request.
The raw data:
|Conc. requests||Trans/sec||Avg. Trans. Time (ms)||Trans/sec||Avg. Trans. Time (ms)||Trans/sec||Avg. Trans. Time (ms)|
And now, the pretty graphs:
Average Transaction Time. Lower is better
Number of transactions per second. Higher is better.
Per processor comparisons:
Satisfying a theoretical 100,000 XenApp users at 20 concurrent requests would take:
So, what is all of this information telling us? Knowing your peak concurrent rate is important. Ensuring your VM that will be your LHC is configured correctly and is on the best hardware possible will help ensure you have the best possible performance in the event you need to use your LHC.
These are very surprising results! The performance processor is over 2x faster than the low performance tier! I wasn’t expecting such a discrepancy when you look at the CPUMark numbers.
The LHC does appear to operate at optimum performance at less than 20 concurrent connections. Unfortunately, I do not know of a way to govern LHC to a maximum number of concurrent connections. You can create an artificial limit by creating zones to carve up your environment. If the LHC needs to kick in for that scenario, having the environment carved up will reduce the number of connections as the load will be spread over multiple zones, thus improving performance.
In the end, the performance of the LHC appears to be quite good when viewed from a XenApp workload perspective. Ensuring users get their applications efficiently and quickly, even during a major outage like a database outage, is important and the local host cache implementation of XenApp 7.15 looks to be up to the task.