Wednesday, May 18, 2016

Updates - Sprinting down the yellow-brick road

It feels like a whole geologic age has passed since my last, triumphant return to this blog  that I never updated any thereafter.  Old stuff has been fixed and revamped, new stuff has come online, and plenty of new mysteries have been revealed and illuminated in equal measure.

Given what all's happened, and what all is happening soon, and since I don't get much work done sitting in a touch-down away from my desk and monitors and Han Solo, I thought it a good opportunity to regroup and update this thing.

VMWare Horizon View

One of the big projects that got built, rebuilt, expanded, and will soon have to be migrated, is the Horizon View project.  For my part, it was a lot of learning the types of traffic and methods for moving it and building it accordingly on the NetScalers.

Here are some of the highlights:
  1. The Workspace aggregate portal, as well as the VMWare-only View portal, and all their accessory pieces were stood up on our NetScaler MPX 7500's.  Using separate DNS zones, our internal traffic leverages the internal configuration directly, while external clients proxy through a special View AccessPoint proxying configuration via our DMZ appliances.
  2. The final config involves separate VIP for separate View configurations based on the current separation of our Carrollton and Ausitn datacenter resources.  Frankly, this split led to a pretty messy configuration - one that I hope to change in future revisions of the build - but each, separate TM piece is in itself pretty simple TM build.
  3. Though I didn't get to finish developing it, I remain convinced CSW is absolutely viable with this product so long as two things are true: people at the table aren't afraid to buck VMWare's recommended build documentation, and you're fully offloading SSL at the ADC despite said documentation.  Workspace is particularly simple, and most of View's complexity lies in their insistence that you use SSL (with stritch thumbprint verification) all the way to the back-end.
  4. Persistence groups pretty soundly manage persistence requirements of the product, something I was glad to see after fears of persistence were on track to drive an even more [unnecessarily] expansive implementation.  Adding each View-related virtual server in the config (TCP 4172, UDP 4172, BLAST 8443, SSL 443) to a SOURCEIP-enabled persistence group (b/c you can't use cookies when a UDP resource is added) keeps the Access Point appliances pretty happy.
We've learned a lot, and there's yet more to figure out in future revisions.  Since the whole things' currently built on NetScaler MPX appliances for which I've an active project (described in a later section) to evacuate and decommission, we'll have a few more opportunities to further refine the config.  I plan to add GSLB support and hopefully cut down on the separate Carrollton/Austin namespaces we have today in the process.  I also hope to test using the NetScaler as the IDP for authentication and pass through authenticated clients to both Workspace and View, mostly as a larger endeavor to test using central, front-end Universal Gateway portals for collections of CHST applications that ease the experience with recently mandated two-factor authentication.

New SDX platform

Arguably the biggest new thing that's come along is our new NetScaler SDX 11520 environment.  Each of our datacenters has two of them, and I've now stood up several internal- and DMZ-facing virtual instances to use for all manner of fun stuff.  

I've had to learn a ton about NetScaler as a network appliance to make all this happen.  one of the biggest, hardest lessons so far has been routing and network interfaces.  The short version of this lesson goes something like this: the default gateway being bound to the management interface (eg, int 0/1) means data traffic you'd rather go along an LACP channel (eg, LA/1) absolutely will use the management interface instead without the proper routing build.  

What exactly constitutes "proper routing build" is still a subject of learning for me, but in my own testing I've found it to include a couple of things:
  • Policy-based routes that ensures traffic to and from the NSIP use the same MAC both directions (ie, int 0/1), and that anything that's not management-related traffic (ie, sourced from the NSIP or the management network SNIP and VIP network traffic) uses the same MAC both directions (ie, LA/1).
  • Static routes that define a gateway for subnets for which the appliance has no subnet IP.  I found this particularly important when trying to use a subnet IP for one network to route to a host in another network. 
The PBR side of things definitely took some trial and error, since I'm not hugely familiar with PBR's (or, more honestly, routing in general).  I did ultimately settle upon a configuration that, according to a slew of NSTRACE evaluation, successfully keeps SRC and DST MAC addresses looking as they should.

The static routes were a bit more straight-forward; the hardest part was figuring out why/when one was necessary.  There is a hierarchy to how the NetScaler determines which SNIP to use to communicate to which network: it will, unless explicitly configured to do otherwise, always use the SNIP that's logically closest to the target network.  

I knew this going into the configuration, but I didn't account for how that affected which interface it used .  Defining a SNIP adds that SNIP as the default gateway for that particular network; however, to use that SNIP to talk to another network, the NetScaler moves up its routing chain to figure out where to go.  Without the addition of a static route that explicitly defines the gateway as a resource in the subnet of the SNIP you wish to use, the NS has no other choice but use its default route.  Its default route, however, leverages its management interface by default, resulting in traffic entering one way but leaving another.

My challenge now, of course, is taking what I've recently learned about all this and retroactively correcting the instances I'd already stood up.  I didn't notice the problem initially because I was dealing only with internal-only instances, and there was no internal firewall stopping these asymmetrically transmitted packets.  Everything looked OK until we got to the DMZ, where a FW was blocking traffic from the management network to the interior.  It wasn't until we started coming through packet capture data that the SRC and DST MACs were very obviously not matching up, and down the rabbit hole we went.

AirWatch

The main center-stage attraction in the works right now is AirWatch, and it's aimed at replacing the current McAfee mobile management suite in use today.  The build on my end has been pretty interesting, since it's involved a lot of content switching.  I was, in this build, able to leverage some new skills involving string maps to build the CSW action and policy.  This build allows me to build a single action, and a single policy, both of which trigger based  upon the key:value pairs kept in the referenced string map.  The net effect is any new additions to the config we wish to add simply require the addition of a new key:value pair (and any LB build that's required, obviously), rather than a whole new CS action, policy, and binding.  

Computationally, as I understand it, evaluating string maps is a lot cheaper than evaluating even a basic policy expression, especially when there are a lot of policy expressions through which to iterate for a given request.  Realistically, for the amount of traffic we push, this difference in computational overhead - if any exists - likely has no bearing on end user experience - it's just a much more elegant approach to an already cool build process.  :)

The product is comprised of a manaement console, device management servers, secure email gateway ("SEG") servers, some application tunneling proxies, and a few other extraneous pieces.  Each set of servers leverages a distinct web app or service path, so the CSW logic has been fairly straight forward to define.  It has taken some trial and error, some awkward questions to the vendor, and some good ol' fashioned Fiddler tracing to figure some pieces out, but overall the build has gone well.  We've successfully tested all components in an internal-only build, and we've just this week tested some of the public-facing components as well.

There's a ton more, and a tone more detail in which I plan to explore all this - I've just gotten woefully behind in documentation in recent weeks.

More to come!

~Fin~

Thursday, November 19, 2015

Back in the saddle, just in time for my head to explode

I've no sooner gotten back onto the NetScaler horse than experienced the feeling of that horse freaking out and taking off at full tilt over an unforgiving pasture of new things!

Crazy horse metaphor aside, I've a lot to figure out and not a whole lot of time in which to do so:
  • Architect and implement a band new SDX environment.  All of it.
  • AAA policies and their proper build and application, including those intended for use with 2-factor authentication (Duo, apparently?)
  • GSLB between primary and secondary datacenters
  • Proper implementation of FIPS cert signing services for Epic ePrescription
  • App delivery/load balancing for brand new VMWare Horizon/View/Workspace VDI
:|

All challenges I look forward to tackling, and I plan on capturing as much of that experience as I can here.  I've already benefited from my previous documentation efforts; Past Me saved Current Me from having to completely relearn how the 10.5+ UI works with Policy Labels. 

Thanks, Past Me!

Dusting things off (without getting it in my hair)

Something else I look forward to doing here is what I did at BMHCC: consolidating Epic's web applications behind a shiny new SDX/VPX environment while keeping our SSL Labs scores as high as possible.  Content Switch ALL THE THINGS!

I've already begun this work using HSWeb, since it's by far the easiest to manage.  I've had to dust off some of the CSW policy expression stuff and learn a few new things in this new 11.x interface, but so far the build is going well.  My biggest challenge now is getting the names of all production and non-production servers, what applications they run, and which versions of those applications they run.  Currently, most are built behind a set of F5 appliances, and those VIP's will either need to move or be replaced. 

More to follow!

Tuesday, May 6, 2014

MyChart Performance Weirdness - Part 1


Here’s the scenario:


The Epic MyChart patient portal is accessible to clients through two logical pathways: to the patient population at large from the public internet using the Epic DMZ NetScaler appliance and to Hyperspace users from the internal network using the existing McKesson NetScaler appliance.


Each path to MyChart uses the core services available to it, like DNS.  Patients connecting from the outside to mychart.baptistoncare.org resolve the name publicly, while Hyperspace users connecting from the inside to mychart.bmhcc.org resolve the name using internal DNS resources.


Check it.

Graph 1: External Access




Graph 2: Internal Access




Wat. :|


Notice that the “stair step” between each transfer is almost exactly 1 second.  I’m not good enough at math to appreciate with any accuracy the transfer performance difference here.  Instead, my plebian mind conjures terms like terribad, or silly-bad, to fill the gap.

The game is afoot!


The first thing I was able to confirm with certainty was that performance to the server itself was just like it was from the outside.  My problems apparel lie in the NetScaler config for this particular application, since James confirmed we’re not aware of any reports from any of the other applications running through it.

Despite no other reports of trouble, I checked NetScaler resource utilization first.  It was pretty uninteresting.


From here, I started pulling off SSL at the various steps of the config to identify any problems that might exist in my certificate build.


The first thing was to create an HTTP-only service group.




I then had to create a few different test virtual servers to host different configurations without affecting actual MyChart traffic.  James had already established a TCP test server, but I needed something a bit more fanciful.  Enter my test HTTP and SSL virtual servers.




I used different combinations of my SSL and HTTP virtual servers and service groups to better understand where in the transaction my slowness was manifesting itself.

Here’s the breakdown of the results:


Virtual Server
Service Group
Result (Pass = Normal/Fail = Slow)
HTTP
HTTP
Pass
HTTP
SSL
Pass
SSL
HTTP
Pass
SSL
SSL
Fail



Wat wat. :|


Not 100% sure what this means, I performed an NSTRACE while using the problematic config.

Weeding through an NSTRACE file sucks


It is, however, a useful exercise, as it reveals all manner of network gossip going on around the resources in question.  I tried to configure a trace that only looked at the traffic for a particular virtual server, but it gave me everything anyway.  


There are only a couple of things that really stand out to me from other applications’ traffic:

  • Lots of out-of-order TCP traffic
  • Tons of TLSv1 “Server Hello” and cipher renegotiation

What these two facts mean is still beyond me.  I’ve reached out to Chris Lancaster, Citrix Bro and NetScaler Extraordinaire, to meet with me later this afternoon and work through my methods.


To sum this thing so far, there is something in how the 7500’s configured that creates major per-transaction delays when brokering a client HTTPS connection to an HTTPS-enabled server.  That problem does not exist in isolation on either side; that is, only HTTPS from the client or only HTTPS to the server creates the issue.


Stay tuned.

Thursday, April 17, 2014

Tracking the clickahertz and jiggarams - Part 1

I assembled this list some time ago with every intention to turn it into actual Collections and Reports in PerfMon on all my IIS boxes. This has yet to happen, mainly because there were a lot of ongoing discussions about systems monitoring, Epic's SystemPulse product, BMHCC's enterprise SCOM implementation, and others.  And also because my mind - and, thus, sense of time and prioritization - is a hot mess.

Cobbled together from various source, this list represents a best first attempt at a universally applicable list of key Windows metrics regardless of a server's stated purpose.


My plan for "Part 2" involves a similar collection of metrics specific to IIS and ASP.NET applications. Those waters are considerably harder to navigate for me as a non-developer, but they're no less critical to answering questions about concurrent client connections, number of unique requests to an application, tracking memory leaks or other weirdness in worker processes, etc. I'll post that once I have it.  

Without further ado:

Memory

Counter Description Use Notes
Memory\Available Mbytes Available system memory in megabytes (MB) <10% considered low

<5% considered critically low

10MB negative delta per hour indicates likely memory leak
Memory\Committed Amount of committed virtual memory. Also called the "commit charge" in TaskManager
Memory\System Cache Resident Bytes Amount of memory consumed by the system file cache. Shows as "Metafile" in a memory explorer.

In 64-bit systems, memory addressing allows system file cache to consume almost all physical RAM if unchecked.
Memory\Pages Input/sec Rate of total number of pages read from disk to resolve hard page faults >10 considered high

Compare to Memory\Page Reads/sec to determine number of pages read into memory for each read operation.

Will be => Page Reads/sec, and large delta between them might indicate need for more RAM or smaller disk cache
Memory\Pages/sec Rate of total number of pages read from and written to disk to resolve hard page faults >1000 considered moderate, as memory may be getting low

>2000 considered critical, as system likely experiencing delays due to high reliance on slow disk resources for paging

Sum of Pages Input/sec and Pages Output/sec
Memory\Page Faults Total number of hard and soft page faults Pages/sec can be calculated as a % of Page Faults to determine what % of total faults are hard faults
Process(*)\Handle Count Total count of concurrent handles in use by the specified process. As this number constantly fluctuates, the delta between high and low values is most important.

Max - Min = !> 1000

Consistent large number or aggressive upward trend of handle count commonly causes memory leaks

  
Processor
Counter Description Use Notes
System\Processor Queue Length Total number of queued threads waiting to be processed for all processors. =>10 considered high for multi-processor system

PQL/n, where n = number of logical processors, gives per-core queue length.

If % Processor Time is high (=>90%) and per-core PQL is =>2, there is a performance bottleneck.

It is not uncommon to experience low % Processor Time and a per-core PQL => 2 depending on efficiency of the requesting application's threading logic
Processor\%Processor Time Percentage of time the specified CPU is executing non-idle threads. >75% considered moderate and should be closely monitored

>90% considered high, may begin to cause delays in performance

>95%-100% considered critically high and will cause major delays in performance

This value should be tracked per logical processor.

If % Processor Time is high (=>75%) while disk and network utilization is low, consider upgrading or adding processors.
Processor\%Interrupt Time Percentage of time the specified CPU is receiving and servicing hardware interrupts from network adapters, hard disks, and other system hardware Interrupt rates 30%-50% or higher may indicate a driver or hardware problem


Disk I/O

Counter Description Use Notes
PhysicalDisk\%Disk Time Percentage of time the selected disk spends servicing read or write requests. If this value is high relative to nominal CPU and network utilization figures, it is likely disk performance is a problem.
PhysicalDisk\Disk Writes/sec Average number of disk writes per second Used in conjunction with Disk Reads/sec, general indicator of disk I/O activity
LogicalDisk(*)\Avg Disk sec/Writes Average time in seconds a specified disk takes to process a read request >15ms considered slow and worth close evaluation

>25ms considered very slow and likely to negatively impact system performance
PhysicalDisk\Avg Write Queue Length Average number of write requests waiting to be processed Used in conjunction with Avg Read Queue Length, this gives an idea of disk access latency.

AWQL/n <= 4, where n is the number of disks in RAID.
PhysicalDisk\Disk Reads/sec Average number of disk reads per second Used in conjunction with Disk Writes/sec, general indicator of disk I/O activity
LogicalDisk(*)\Avg Disk sec/Reads Average time in seconds a specified disk takes to process a write request >15ms considered slow and worth close evaluation

>25ms considered very slow and likely to negatively impact system performance
PhysicalDisk\Avg Read Queue Length Average number of read requests waiting to be processed Used in conjunction with Avg Write Queue Length, this gives an idea of disk access latency.

ARQL/n <= 4, where n is the number of disks in use.


Network I/O

Counter Description Use Notes
Network Interface(*)\Total Bytes/sec Measure of total bytes sent and received per second for the specified network adapter If >50% of Current Bandwidth value under typical load, problems during peak times are likely.
Network Interface(*)\Current Bandwidth Estimate of the current bandwidth in bits per second (bps) available to the specified NIC.

Considered "nominal bandwidth" where accurate estimation impossible or where bandwidth doesn't vary
To estimate current NIC utilization, use the following formula:

Nic Utilization = ((Max Bytes Total/Sec * 8) / (Current Bandwith)) * 100

>30% NIC utilization on shared network considered high
Network Interface(*)\Output Queue Length Total number of threads waiting for outbound processing by the specified NIC >1 sustained considered high

>2 sustained considered critically high

Tuesday, April 15, 2014

From (F)ailure to (A-)wesome on SSLLabs

I'm once again shamelessly copy/pasting a new post on here.  I don't feel too ashamed of it, since I did actually write the original post.  It turns out we have MySites at work, and that the Blog feature is enabled!  I'll very likely be posting in both places as the life of this Blog thing draws onward.
 ***

There's been a lot of chatter about the Heartbleed SSL vulnerability in the last couple of weeks, and rightfully so. One place folks seem to love going is over to SSLLabs, since they have a server tester you can run to determine what kind of safety grade – A to F – you get.
At the outset, my tests of the BOC Link and MyChart sites generated giant, terrifyingly red "F" results. This was not due to Heartbleed, thank goodness, since the NetScalers do not use an affected version of OSSL, and none of my web servers use OSSL at all. What failed me instead was another, slightly older vulnerability: SSL renegotiation.
Both BOC Link and MyChart run behind a NetScaler virtual VPX appliance running v10.0.x of the software. Out of the box, NetScalers are configured to allow all SSL renegotiation in all forms, whether initiated from the client connection or the server. A quick check at the console will tell you the current status of the parameter:

> show ssl parameter
Advanced SSL Parameters
-----------------------
SSL quantum size: 8 kB
Max CRL memory size: 256 MB
Strict CA checks: NO
Encryption trigger timeout 100 mS
Send Close-Notify YES
Encryption trigger packet count: 45
Deny SSL Renegotiation NO
Subject/Issuer Name Insertion Format: Unicode
OCSP cache size: 10 MB
Push flag: 0x0 (Auto)
Strict Host Header check for SNI enabled SSL sessions: NO
PUSH encryption trigger timeout: 1 ms
Global undef action for control policies: CLIENTAUTH
Global Undef action for data policies: NOOP 


Citrix has a pretty handle article on what exactly the –denySSLReneg parameter is, what its options are, and how to change it. See it here.
Here's the command:

> set ssl parameter -denySSLReneg NONSECURE
Done 

By setting the Deny SSL Renegotiation option to NONSECURE, I've corrected the renegotiation vulnerability without (hopefully) creating any compatibility issues for our Link and MyChart users. This setting appears to be global, so affecting this change raised the scores of both sites from "F" to "A-" (RC4 ciphers, indeed!) simultaneously.

> show ssl parameter
Advanced SSL Parameters
-----------------------
SSL quantum size: 8 kB
Max CRL memory size: 256 MB
Strict CA checks: NO
Encryption trigger timeout 100 mS
Send Close-Notify YES
Encryption trigger packet count: 45
Deny SSL Renegotiation NONSECURE
Subject/Issuer Name Insertion Format: Unicode
OCSP cache size: 10 MB
Push flag: 0x0 (Auto)
Strict Host Header check for SNI enabled SSL sessions: NO
PUSH encryption trigger timeout: 1 ms
Global undef action for control policies: CLIENTAUTH
Global Undef action for data policies: NOOP

Monday, January 6, 2014

I hate dirty Application Logs - PerfMon counters and IIS Advanced Logging

I posted this one on Epic's UserWeb entity portal and have in my guilt for not posting in a while blatantly copied and pasted it here.  Methods aside, it's good info, especially if you're OCD about what shows up in your server's error logs like I apparently am.  Source link at the bottom.

Enjoy!

-------------
I have been configuring IIS Advanced Logging on all the IIS servers I've been building ahead of our go-live in January. It works swimmingly and solves a couple of problems I'd always hated about standard IIS logging:
1. The logging happens in real-time, instead of on a 3 minute delay.
2. You can add custom fields, like "Client-IP," that work a lot more smoothly with ADC's and load balancers that might otherwise mask information about a logged session.
3. You can include basic performance counters in your logs, like the W3WP CPU and memory utilization.

That #3 is why I'm posting here tonight. Even though those fields were disabled in my default log definition, I'd still get the following error for each metric in my Windows Application log:
____
Log Name: Application
Source: IIS Advanced Logging Module
Date: 12/17/2013 10:22:29 AM
Event ID: 1008
Task Category: None
Level: Error
Keywords: Classic
User: N/A
Computer: ECEPPRINTC01.ad.bmhcc.org
Description:
Failed to initialize performance counter \Process(w3wp)\Private Bytes. Data for this performance counter data will not be recorded until the counter is available. PdhCollectQueryData: 0x0X800007D5.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">;
<System>
<Provider Name="IIS Advanced Logging Module" />
<EventID Qualifiers="0">1008</EventID>
<Level>2</Level>
<Task>0</Task>
<Keywords>0x80000000000000</Keywords>
<TimeCreated SystemTime="2013-12-17T16:22:29.000000000Z" />
<EventRecordID>2513</EventRecordID>
<Channel>Application</Channel>
<Computer>MYSERVERNAME</Computer>
<Security />
</System>
<EventData>
<Data>\Process(w3wp)\Private Bytes</Data>
<Data>0X800007D5</Data>
</EventData>
</Event>
____
The short of it: this error showed up in the logs of those servers whose application pools I'd configured to use ApplicationPoolIdentity to authenticate instead of the old standby NetworkService. It occurs because ApplicationPoolIdentity has no rights to Performance Monitor, and so no access to log using Performance Monitor counters.

The fix is to add your application pool's identity ("IIS APPPOOL\APPPOOLNAME") to the built-in Performance Monitor Users group. Doing so eliminates the errors in Windows Application log, and makes the metrics actually show up correctly in the log (instead of "-" like they were originally).

Here's where I found the info:
http://blogs.microsoft.co.il/idof/2013/08/20/fixing-iis-advanced-logging-performance-counters-errors/

Sunday, November 3, 2013

The Tits: IIS Advanced Logging Module

The first time I thought "Wow, NetScalers sure fill my IIS server logs with a bunch of shit data" was about twenty minutes after my first encounter with the technology.  I had checked every setting I could think of on the IIS Logging feature and found nothing of use.  It's an all or nothing proposition: enabled, or off completely.

Tonight, the clouds parted, and the moon shone brightly.

I came in tonight to try to get some things ready for some server build audits we have coming up.  My two main tasks were as follows:
  1. Get three of the core web application servers built on the NetScaler (that I finally have access to) and load balanced in a basic capacity.
  2. Get the current version of the application deployed to those same three servers.
The first task I accomplished pretty handily, despite my apprehension logging into a production NetScaler for the first time.  I've been through the training, and I've been reading through their eDocs, of course, but I went very slowly for the first few minutes. 

I had not long completed #1 and successfully tested it when my thoughts again returned to my servers' IIS logs.  They were filled with 0-byte HEAD requests from the NetScaler.  By default, the NetScaler's built-in http monitor executes a HEAD method request against the target system every five seconds.  I knew this from earlier study and my general understanding of monitoring systems.  What I had not considered before the last month or so was what this method of monitoring looks like on the receiving end.

I turned to the Intergoogle for aid. There had to be an answer to this problem.  The legions of IIS administrators that I still think might be out there somewhere surely could not have, all this time, merely tolerated this situation.

Right?

Enter, the solution

There is, indeed, a solution to this problem, and it is called IIS Advanced Logging.  I've only just begun to explore its capabilities; however,I determined almost immediately that I must have this module on my servers.

All of them.

The reason is simple: it lets me filter out unwanted log traffic.  Further, should I determine a sound need to (which I'm still researching), I can separate out different log data into separate files based upon what I want to actually see.

If this module did nothing else of value, this one feature would justify its existence.  From the reading I've already done, though, this thing is all kinds of flexible.  There are some limitations, but at least for now they do not affect what I'm doing.

I have installed, enabled, configured, and tested AdvL on all three of tonight's target servers, and it's running like a champ.  I did find it odd that you have to manually disable standard IIS Logging, else it logs data in both places simultaneously.  I didn't initially notice that IIS Logging still existed as a separate feature and worried about the resource impact of such an arrangement.  Fortunately, that is now a non-issue.


~Fin~