Attacking System Performance Issues – When to Stop Asking “Why?” and Start Asking “What?”
We have all been there – the dreaded “all-hands-on-deck-the-sky-is-falling“ performance issue call. You can predict the play-by-play of what happens next.
The support lead asks why performance is slow. The DBA says the database looks normal, it must be an application issue. The developer says there are no errors in the log, it must be a network issue. The network admin, well, we are still trying to track down who is on-call… but they will eventually let us know it is not a network issue. All the while, people have stopped answering “what exactly is the issue?”
UNDERSTANDING THE CHALLENGE
Failure to guide your resources to answer the right questions does not just slow progress, it will often prevent it! As the minutes turn to hours and management involvement rises, theories abound as to why user performance has dropped off a cliff. Invariably, people look for facts to support their theories rather than let the facts guide them to a solution. It is easy to unintentionally represent what could be a root cause as something that is likely the root cause and, consequently, worth immediate investigation.
The added cost comes as a result of having to evaluate each proposed theory, and methodically proving each one to be false before moving on to the next one. And, sure, if you allow 10 people to provide their opinions, one of them may be right. But you will have taken the longest route possible to get there. Furthermore, even a correct theory has no value if you cannot prove it is true. Your support manager is not going to deploy a hot fix to a production server without convincing evidence.
In the event of a true and severe performance problem which is not immediately diagnosable, the root cause and corresponding solution is often a complex puzzle (even if the resolution is simple). Typically, only a few people in your organization are positioned with the knowledge and skillset to diagnose and resolve such issues. Unfortunately, it is all too common to end up with people tossing around theories (Why) and few people focusing on the facts (What).
It is easy to look from the outside and recognize this is the exact opposite of how your resources should be focused. Front-line resources should work to provide actionable data points to help Subject-Matter-Experts (SMEs) put the pieces together. In a storm of user explicatives and update requests, it is easy to find yourself with too many Sherlocks and not enough Watsons!
DEFINING AN APPROACH
In the absence of a neon sign reading, “your application has run out of memory” or “the database is locked on this query”, it is crucial to continually work to describe the issue even as root causes are being evaluated. To streamline team efforts, keep the following themes on the forefront:
- Keep the discussion fact-based – Document and circulate the facts as they develop. Clearly define the problem with solid evidence. This valuable input can instill confidence in your users.
- Test the negative of your theories – Use process of elimination to narrow your focus. The second-best thing to proving a component is the culprit, is proving it is not the culprit.
- Link your issue to your resolution – Once you’ve clearly described the issue, you can give your technical SMEs a measurable aiming point, and give your support team a defined test case to validate the solution.
Some of this may sound like common sense, but logic has a way of becoming elusive during these performance firedrills. MRE has worked countless performance issues across nearly every application profile. We understand there is a global set of questions and diagnostic techniques applicable to almost every one. While the answers may change, the approach is absolutely repeatable.
To keep your team focused, it is critical to have a Performance Issue Plan in place and to educate your support team on that plan. Define your own reusable set of questions and triage actions to help capture the issue. You will find that this approach goes a long way in supporting your technical SMEs rather than distracting them. With the added benefit of allowing everyone involved to know they have contributed to the solution rather than just guessing at it.
David Heidt, Director, MRE Consulting, Ltd.
David provides nearly 20 years of experience in designing, implementing, and managing custom enterprise-class trading and risk systems. His experience has exposed him to a wide range of technologies, software development models, team structures, and user communities. This allows him to adapt process and design to match a given environment, whether it supports thousands of end-users or zero end-users, measures in days or milliseconds, executes in isolation or through extensive integration.
Contact me if you have questions or would like to discuss your network’s performance.
Click the link below to download the PDF version.