Approximately 0.0014 of requests to their server farm fail. This is normal — there are plenty of sources of malformed, malicious, or just plain broken requests on the Network. But it would be nice if it were lower, and Uþenor doesn't have any urgent reliability fires to put out at the moment.
He pulls up their aggregate fleet statistics. At any given time, 0.01 of the servers have debug logging for any given component turned on. High enough to get useful statistical aggregates, low enough not to negatively impact throughput or latency very much.
He runs a query — which debug trace events are most strongly correlated with failures anywhere in the protocol stack?
It's a query he runs about 31 times a year, but the answers are always changing as their fleet and the software running on it evolve.
Today, the answer comes from their authentication proxy — a tracepoint called "client sent more headers than there is available memory". That seems straightforward enough.