Thursday, April 17, 2014

Tracking the clickahertz and jiggarams - Part 1

I assembled this list some time ago with every intention to turn it into actual Collections and Reports in PerfMon on all my IIS boxes. This has yet to happen, mainly because there were a lot of ongoing discussions about systems monitoring, Epic's SystemPulse product, BMHCC's enterprise SCOM implementation, and others.  And also because my mind - and, thus, sense of time and prioritization - is a hot mess.

Cobbled together from various source, this list represents a best first attempt at a universally applicable list of key Windows metrics regardless of a server's stated purpose.


My plan for "Part 2" involves a similar collection of metrics specific to IIS and ASP.NET applications. Those waters are considerably harder to navigate for me as a non-developer, but they're no less critical to answering questions about concurrent client connections, number of unique requests to an application, tracking memory leaks or other weirdness in worker processes, etc. I'll post that once I have it.  

Without further ado:

Memory

Counter Description Use Notes
Memory\Available Mbytes Available system memory in megabytes (MB) <10% considered low

<5% considered critically low

10MB negative delta per hour indicates likely memory leak
Memory\Committed Amount of committed virtual memory. Also called the "commit charge" in TaskManager
Memory\System Cache Resident Bytes Amount of memory consumed by the system file cache. Shows as "Metafile" in a memory explorer.

In 64-bit systems, memory addressing allows system file cache to consume almost all physical RAM if unchecked.
Memory\Pages Input/sec Rate of total number of pages read from disk to resolve hard page faults >10 considered high

Compare to Memory\Page Reads/sec to determine number of pages read into memory for each read operation.

Will be => Page Reads/sec, and large delta between them might indicate need for more RAM or smaller disk cache
Memory\Pages/sec Rate of total number of pages read from and written to disk to resolve hard page faults >1000 considered moderate, as memory may be getting low

>2000 considered critical, as system likely experiencing delays due to high reliance on slow disk resources for paging

Sum of Pages Input/sec and Pages Output/sec
Memory\Page Faults Total number of hard and soft page faults Pages/sec can be calculated as a % of Page Faults to determine what % of total faults are hard faults
Process(*)\Handle Count Total count of concurrent handles in use by the specified process. As this number constantly fluctuates, the delta between high and low values is most important.

Max - Min = !> 1000

Consistent large number or aggressive upward trend of handle count commonly causes memory leaks

  
Processor
Counter Description Use Notes
System\Processor Queue Length Total number of queued threads waiting to be processed for all processors. =>10 considered high for multi-processor system

PQL/n, where n = number of logical processors, gives per-core queue length.

If % Processor Time is high (=>90%) and per-core PQL is =>2, there is a performance bottleneck.

It is not uncommon to experience low % Processor Time and a per-core PQL => 2 depending on efficiency of the requesting application's threading logic
Processor\%Processor Time Percentage of time the specified CPU is executing non-idle threads. >75% considered moderate and should be closely monitored

>90% considered high, may begin to cause delays in performance

>95%-100% considered critically high and will cause major delays in performance

This value should be tracked per logical processor.

If % Processor Time is high (=>75%) while disk and network utilization is low, consider upgrading or adding processors.
Processor\%Interrupt Time Percentage of time the specified CPU is receiving and servicing hardware interrupts from network adapters, hard disks, and other system hardware Interrupt rates 30%-50% or higher may indicate a driver or hardware problem


Disk I/O

Counter Description Use Notes
PhysicalDisk\%Disk Time Percentage of time the selected disk spends servicing read or write requests. If this value is high relative to nominal CPU and network utilization figures, it is likely disk performance is a problem.
PhysicalDisk\Disk Writes/sec Average number of disk writes per second Used in conjunction with Disk Reads/sec, general indicator of disk I/O activity
LogicalDisk(*)\Avg Disk sec/Writes Average time in seconds a specified disk takes to process a read request >15ms considered slow and worth close evaluation

>25ms considered very slow and likely to negatively impact system performance
PhysicalDisk\Avg Write Queue Length Average number of write requests waiting to be processed Used in conjunction with Avg Read Queue Length, this gives an idea of disk access latency.

AWQL/n <= 4, where n is the number of disks in RAID.
PhysicalDisk\Disk Reads/sec Average number of disk reads per second Used in conjunction with Disk Writes/sec, general indicator of disk I/O activity
LogicalDisk(*)\Avg Disk sec/Reads Average time in seconds a specified disk takes to process a write request >15ms considered slow and worth close evaluation

>25ms considered very slow and likely to negatively impact system performance
PhysicalDisk\Avg Read Queue Length Average number of read requests waiting to be processed Used in conjunction with Avg Write Queue Length, this gives an idea of disk access latency.

ARQL/n <= 4, where n is the number of disks in use.


Network I/O

Counter Description Use Notes
Network Interface(*)\Total Bytes/sec Measure of total bytes sent and received per second for the specified network adapter If >50% of Current Bandwidth value under typical load, problems during peak times are likely.
Network Interface(*)\Current Bandwidth Estimate of the current bandwidth in bits per second (bps) available to the specified NIC.

Considered "nominal bandwidth" where accurate estimation impossible or where bandwidth doesn't vary
To estimate current NIC utilization, use the following formula:

Nic Utilization = ((Max Bytes Total/Sec * 8) / (Current Bandwith)) * 100

>30% NIC utilization on shared network considered high
Network Interface(*)\Output Queue Length Total number of threads waiting for outbound processing by the specified NIC >1 sustained considered high

>2 sustained considered critically high

Tuesday, April 15, 2014

From (F)ailure to (A-)wesome on SSLLabs

I'm once again shamelessly copy/pasting a new post on here.  I don't feel too ashamed of it, since I did actually write the original post.  It turns out we have MySites at work, and that the Blog feature is enabled!  I'll very likely be posting in both places as the life of this Blog thing draws onward.
 ***

There's been a lot of chatter about the Heartbleed SSL vulnerability in the last couple of weeks, and rightfully so. One place folks seem to love going is over to SSLLabs, since they have a server tester you can run to determine what kind of safety grade – A to F – you get.
At the outset, my tests of the BOC Link and MyChart sites generated giant, terrifyingly red "F" results. This was not due to Heartbleed, thank goodness, since the NetScalers do not use an affected version of OSSL, and none of my web servers use OSSL at all. What failed me instead was another, slightly older vulnerability: SSL renegotiation.
Both BOC Link and MyChart run behind a NetScaler virtual VPX appliance running v10.0.x of the software. Out of the box, NetScalers are configured to allow all SSL renegotiation in all forms, whether initiated from the client connection or the server. A quick check at the console will tell you the current status of the parameter:

> show ssl parameter
Advanced SSL Parameters
-----------------------
SSL quantum size: 8 kB
Max CRL memory size: 256 MB
Strict CA checks: NO
Encryption trigger timeout 100 mS
Send Close-Notify YES
Encryption trigger packet count: 45
Deny SSL Renegotiation NO
Subject/Issuer Name Insertion Format: Unicode
OCSP cache size: 10 MB
Push flag: 0x0 (Auto)
Strict Host Header check for SNI enabled SSL sessions: NO
PUSH encryption trigger timeout: 1 ms
Global undef action for control policies: CLIENTAUTH
Global Undef action for data policies: NOOP 


Citrix has a pretty handle article on what exactly the –denySSLReneg parameter is, what its options are, and how to change it. See it here.
Here's the command:

> set ssl parameter -denySSLReneg NONSECURE
Done 

By setting the Deny SSL Renegotiation option to NONSECURE, I've corrected the renegotiation vulnerability without (hopefully) creating any compatibility issues for our Link and MyChart users. This setting appears to be global, so affecting this change raised the scores of both sites from "F" to "A-" (RC4 ciphers, indeed!) simultaneously.

> show ssl parameter
Advanced SSL Parameters
-----------------------
SSL quantum size: 8 kB
Max CRL memory size: 256 MB
Strict CA checks: NO
Encryption trigger timeout 100 mS
Send Close-Notify YES
Encryption trigger packet count: 45
Deny SSL Renegotiation NONSECURE
Subject/Issuer Name Insertion Format: Unicode
OCSP cache size: 10 MB
Push flag: 0x0 (Auto)
Strict Host Header check for SNI enabled SSL sessions: NO
PUSH encryption trigger timeout: 1 ms
Global undef action for control policies: CLIENTAUTH
Global Undef action for data policies: NOOP