Don't you just hate when your server suddenly stops responding?
To start off, this is an issue that we, more or less, experienced as what we felt was suddenly (We = my place of work).
A lot of services complained that they couldn't connect to external services, or even internal ones (which baffled us quite much), and the only solution we found when we had torrents of exceptions just falling into our Exceptional instance.
You can see if you also have these issues by looking for something that looks like this SocketException: Only one usage of each socket address (protocol/network address/port)
There might be other exceptions as well, but this is the one that we had tons of, and they just kept coming.
Anyway, we had to do something about it, and the only thing we could do, was either take the affected server, and cycle the different services to see which one was the culprit. This is when it turns out, that we've used HttpClient
, the wrong way, even though we've had talks at work about it, shared articles about using singletons/statics (even if it's bad for DNS resolves, that's why you use IHttpClientFactory
).. So we had, well quite frankly, a lot of sockets in TIME_WAIT
, just waiting to be reused, by a HttpClient
that was long gone and disposed.
Our first try to fix it, was to increase the amount of ports available to use from the normal 16k ports, to 32k ports.. which worked.. for a while, until we hit the next flood of exceptions.
So we're currently (while almost every developer at work is away on vacation), trying to rewrite all usages of our HttpClient
instances, to either use singleton/statics, or worst case add: .DefaultRequestHeaders.ConnectionClose = true;
to make the faulty HttpClient
at least try to close the connection.. but we're aiming on replacing everything we can with IHttpClientFactory
, so that it is handled automatically with socket reuse and a pool of HttpClients.
So, how do we debug this in a good way?
I swear, I tried to find something good to monitor socket states, but came up empty handed (if you know something, please tell me, I'm all ears).
After looking both here and there, found some helpful articles on StackOverflow that used something called GetExtendedTcpTable
, so I tried it out, modified it a bit to get the info I wanted, tested it out on the servers and my own computer through LINQPad, and even tested using HttpClients both the good and the bad way, just to see that I could recreate the problem locally, and I could. But that extended TCP-table wasn't enough for me, I wanted to know what process (and process owner) was responsible for the different socket states.
Spoiler, TIME_WAIT
will often have PID 0, as it's just the system, chilling and waiting for the socket to be reused, by a HttpClient
that.. wait I already explained this.
But one other good metric would be to look at the ESTABLISHED
sockets as well, because those sockets often have an owner, so that we can see which application/service is creating and maintaining a lot of connections that could potentially be the culprit.
TL;DR, I made a NuGet-package with the methods I'm gonna use to try to fix our port exhaustion at work, and to monitor/alert if we're getting close to port exhaustion, before it happens. You can find it by searching for ItsSimple.NetStatData
(only works on Windows as of now, as I'm using lots of P/Invoke), and source is available on GitHub
So, what happened after the monitoring started?
Well, first of, I was surprised by the amounts of TIME_WAIT
and ESTABLISHED
we had across the entire cluster.
I'm thinking that I've configured our HAProxy (amongst other things) wrong.
But reading through even more documentation on troubleshooting port exhaustion, I added some things to the registry to each server.
The things I added were 2 DWORD-values, namely TcpTimedWaitdelay
(value of 30
) and StrictTimeWaitSeqCheck
(value of 1
).
Then I rebooted the machines for the changes to have effect, and well, initially it looked good.. I mean, since I rebooted them, the connections dropped down significally. But we still have issues with the continued increase of established connections, unless it caps out at some point.
Above is a screenshot that uses the data that I get from my NuGet-package, that I push directly into a SQL server, until I find something better.
But sadly, when I started monitoring what application kept all ESTABLISHED
connections, it ended up being a system service in Windows, specifically Service Host: Remote Procedure Call
, which handles all dynamic ports.
This is why I think I might have configured HAProxy wrong, since the amount of connections just keep increasing, but that just means I have more job security, as I have to find out what is wrong with this setup.