If you’re a CISO who has invested in a SOAR (Security Orchestration, Automation and Response) platform, you might be wondering if you’ve actually made your organization safer. Sure, you’ve deployed the SOAR platform and integrated it with your key security tools like your firewall and your IDS system. The platform is running playbooks (scripts of commands to collect and act on alert data), and it’s providing the security analysts in your Security Operations Center (SOC) with information for alert triage and threat detection.
But is your SOAR platform really helping your SOC team detect and resolves threats more quickly? If so, how can you tell? Can you measure the improvement? If someone asks, can you provide hard numbers to demonstrate the platform’s effectiveness?
What I’d like to offer you here is a methodology for measuring outcomes of your use of SOAR platform. If the outcomes are great, this methodology will make that clear. And if they aren’t great yet, this methodology will give you guidelines for making incremental improvements that yield better results over time.
If you adopt this iterative methodology and use it to focus your SOAR platform on the right things, you should see improvements in Mean Time to Resolution (MTTR) and overall SOC productivity—goals that should be appreciated by almost any CISO.
Measuring the Results of Your SOAR Platform
This methodology requires a SOAR platform, which is going to receive alerts from security tools such as a SIEM system or IDS system, possibly enrich the data associated with alerts (for example, by checking the reputation of IP addresses mentioned in alerts), and signal to the SOC team that a case or trouble ticket should be opened.
The methodology also requires a case management system, which might be popular ticketing platform like Atlassian Jira or which might be system built into your SOAR platform. When the SOC team discovers as an issue to be investigated, they’re going to open a ticket and track the case in the case management platform. When the problem is resolved, they’ll note that in the case management system and close the ticket.
The key measurement here is MTTR, which we can measure by the length of time a ticket for a particular alert remains open.
Now, it’s true that in some complex threats may involve multiple alerts, multiple incidents being tracked, and hence multiple tickets. But it’s also true that security analysts can look at these alerts and tickets individually and identify the ticket that is taking the longest to resolve.
This recognition leads to the first step in our methodology.
Pick a metric to optimize. What gets measured, improves. Let's pick MTTR. Pull metrics on tickets in your case management system. MTTR is the span of time between when the ticket was created and when it was resolved. Take a look at your MTTR results across all your tickets. Which tickets are in the 50% decile (the median decile)? Which are at 80%? Which are at 95%? Start with 50% tickets. Is that length of time acceptable. Does it need to improve? This still might be too broad a metric. Can you break it down by ticket type? Focus on malware alert triage or phishing triage, for example. You may run into issues with this part of the process. For example, you might discover that you can’t easily pull creationTime and resolutionTime from your case management system, or that you can’t create a simple dashboard showing metrics by decile, or that it’s difficult to categorize cases altogether. If you run obstacles like these, simply follow the advice above and focus on common threats, such as malware alerts and phishing. Are you satisfied with the MTTRs you see for those tickets?
Pick a target for 30 days. Make sure that the delta is statistically significant. If you are shooting for 3% improvement, and in the last 90 days the variance in this metric is 17%, you cannot be sure if you have really improved. So note the variance in tickets you’re tracking and set a goal for improvement that is both meaningful and measurable.
Pick someone to own and drive that metric. Without clear ownership, this will get done “someday,” which means NEVER. Pick someone who has the right organizational pull to drive changes. We will call this person the “Tech Lead.”
Now, we are going to dive one level deeper. This is a recommended technique for the Tech Lead to deliver the target efficiencies.
Pick one alert type to optimize. Based on a very simple analysis of case types, and the median time it takes to resolve a case of that type, you can determine where to focus. Let us assume that you can shave off 90% of the time with automation. You will want to focus on the cases that happen more frequently. Hence both the factors are important for every case type: namely - (current median MTTR) X (case frequency in the last 90 days). This is your upper limit on how much savings you can achieve with automation. Order the list of case types by this metric to see which case types you should automate first.
Build a playbook for the first case. Take a case, and build a playbook for it. Take another case of a very similar type. Does it meet the same playbook? If not, adapt the playbook so that it’s flexible enough to have covered both cases. Once you have gone through 3-4 examples, you probably have captured most of the steps for a playbook for handling that type of case.
Automate the tasks that will save us the most time. Determine which tasks in the playbook are the most time-consuming. The answer will be obvious to a security analyst. Automate those tasks, using your SOAR or SOAR+ platform.
Measure the new MTTR in 7 days after automation. Assuming that you picked a use case/case type that happens frequently enough, you should start to see reductions in MTTR. If not, you want to look deeper. One common reason we have seen that even though a step is automated, teams are still manually doing that step. Sometimes, old habits just take a bit of time to change. This is where leadership and management can help drive a behaviour/cultural change.
Pick the next case type to target. Now that the biggest bottleneck has been optimized - find the new bottleneck and target it for the next sprint (with a 2- or 4-week duration).
Run automation in parallel for a while Treat the automation as a junior (virtual) analyst. It’s fast, but not going to be as good as your skilled, experienced analysts. As such, you want it to be an assistant first, and as you become confident about its effectiveness, you can rely more and more on the automation. You should measure the effectiveness over a period. If the confidence is high, reduce the percentage of cases that have to be reviewed. If the confidence goes down, increase the sample size that you manually review, and invest in updating the automation to increase the confidence in automation.
Calculate your ROI. How much time did you save in manual work? How much of an improvement in MTTR did you get? What about consistency of triage and response - that should get better? What did it cost to implement this automation? Did you get a commensurate amount of returns from your investment.
Conclusion: Measurements Yield Results for SOAR
Much about the methodology above is common sense. But far too often, companies fail to collect metrics and focus on improving them, in spite of investing hundreds of thousands of dollars in a security automation initiative. Or they proceed in a more haphazard way that takes many months before showing any improvements. By following the steps listed above, a security team can leverage its SOAR platform to realize measurable improvements in MTTR, strengthening the security of the organization overall.