When it comes to monitoring, some of the most important things can also be the simplest. Whether it be small analyst activities like basic triage or larger automated ones like whitelisting, actions speak louder than words. When deciding what the most helpful actions are, though, it’s important to first think of ways to measure the effects that those actions have on your network.
This is where metrics and statistical analysis come in. With a few simple measurements, any organization can make an efficient plan for detection, triage, and response, even on limited resources. The biggest problems become easier to prioritize and the smaller ones are easier to plan for later.
The following is a brief examination of the top five metrics for effective triage, case creation, and response, as compiled from both an operational and managerial perspective.
Coverage and Visibility
Before an analyst even enters a case, it’s important to make sure that the cases they’re receiving cover everything desired by the organization. Operationally, this scope confirms that detections aren’t missing any key points in the network, that responses are full and complete, and that analysts have enough data to make a proper judgement. From a managerial standpoint, metrics towards visibility (like a listing/count of visible systems, applications, and/or users) create a better roadmap towards integration of the rest of the environment and can provide valuable data on network blind spots.
As actions are taken on cases and tuning is performed, visibility should increase. This is a metric that requires constant awareness and goals for maintenance to remain healthy.
Sometimes, the ability of an analyst to research their case can be hindered. Perhaps they don’t have all the data they need at hand, or the method to find that data is obscured. Both issues will cause an increase in response time and are likely to make metrics based on escalations, visibility, and overall effectiveness that much more obscure.
To avoid problems with Time-to-Decision metrics, it’s important to keep an eye on documentation accuracy and ease of access to resources for analysts. When either is lost, so is time towards what could be a crucial detection.
We’ve all had our overwhelming moments: too many projects, too little sleep, too little coffee. For an analyst, that ‘burnout’ when facing a mountain of cases can result in a poor response rate - the rate at which an analyst makes useful responses (or any response at all). The underlying causes for this burnout may vary from an excess of volume, a lack of details in a case, or even an excess of detail. No matter what the underlying cause, any operation suffers from a lack of useful responses in more ways than one.
All other metrics suffer in some way from a poor response rate, for instance. Other analysts looking for more information on a case may be left disappointed and needing to perform research over again, the overall time to complete a case may suffer because more cases need to be redone, and visibility stops mattering when no useful product comes from it.
In order to avoid a poor response rate, operations teams should make a special effort to increase automation, decrease false positives, decrease overall case count, and decrease complexity of threat investigation tools.
Unlike Time-to-Decision, the Time-to-Escalation metric focuses on the speed at which cases move past being researched. Are methods of remediation ready at hand? Is interoperability with departments outside of the SOC sufficient for remediation the SOC can’t perform? Are the tools to escalate a problem fast and easy to use?
Each of these questions is an important consideration when thinking about the time it takes to escalate an incident. Without the tools to escalate quickly, cases are left hanging in limbo, neither being actioned nor waiting to be remediated.
Ratio of Success
This metric may have been one of the first to come to mind when considering measurements of effectiveness. When a case arrives, the primary goal is for the problem leading to that case to be resolved. If a SOC’s Ratio of Success is high, the chances of a case turning into a remediation are therefore higher and should reduce the number of cases in the future past remediation. This success rate works hand-in-hand with the prior metrics to quickly decimate the number of cases and time worked to only the most necessary interactions.
A higher ratio of success will also result in better visibility over time. As cases are resolved and issues patched, more of the network becomes familiar.