Ask any person who has interacted with a security operations center (SOC) and they will tell you that noisy detections (false positives) are one of the biggest challenges. There have been many companies that have tried to solve this problem but virtually all attempts have come up short. This article will attempt to promote a better solution using artificial intelligence (AI) & machine learning (ML) while remaining highly understandable and easily comprehensible.
First, to understand the challenge facing blue teams - those defenders charged with identifying and responding to attacks - you realize that almost any indicator will fit into one of two buckets. All detections/indicators can either be categorized as signature-based or anomaly-based.
Signature-based detections are manifested with things like:
Look for a running process named “mimikatz.exe”
Look for 50 failed logins in less than 60 minutes
Signature-based detections are trivial for attackers to circumvent in most cases. Using the two examples above, an attacker might rename their malicious mimikatz.exe executable to notepad.exe to avoid to detection. Similarly, if they execute 30 failed logins/hour, they also remain under the radar of detection because the threshold of concern was 50.
The effectiveness of signature-based detections depends highly on the breadth of detections and maintaining the secrecy of what is being monitored for. A non-technical analogy would be laying a field with tripwires and landmines; if the attacker knows the locations of your defenses, they can successfully navigate through them.
A second bucket of detections are anomaly-based detections. Anomaly based detections don’t rely on signatures but instead look for things that aren’t normal. Using the two examples above, anomaly detections might be something like:
Look for uncommon running process names
Look for statistically interesting volumes of failed logins
These anomaly detections are more difficult for attackers to circumvent but have challenges of their own. Specifically, just because something is anomalous doesn’t make it malicious.
Actions like quarterly backups appear statistically similar to data exfiltration, as an example. If a defender makes these anomaly detections too sensitive, then they are bombarded with noise. If they make the thresholds too high, they risk missing attacks.
Over the years, there have been companies that try to solve this problem by aggregating these indicators. Examples include:
A vendor that aggregates first-time events such as, “the first time a user logged on from a foreign country,”“the first time a user setup a scheduled task,” and “the first time a user sent 1GB of data.”
Assigning points to indicators and looking at those that accumulate the most points.
Mapping indicators to an industry standard (e.g., MITRE) and identifying actors that are exploiting multiple tactics/techniques.
But advances in computer technology have allowed us to develop a better way. Artificial intelligence and machine learning solutions are well within reach and less complicated than you might believe. To demonstrate this, we’ll pivot to an example that isn’t a cyber security issue.
Ask the question “Will my spouse get home from work before 6:00 PM?” Where my spouse gets off work at 5:00pm, and it takes 30 minutes to drive home. To answer this question, there are several questions that must be considered such as, “Did they leave work on time?” or “Was there traffic on the way home?” These questions are known as FEATURES.
The result of comparing features to outcome is rather intuitive:
Programmatically, this can be expressed like:
SELECT COLLECT_SET (Actual Outcome) FROM TRIPS GROUP BY F1, F2
As long as the collection of outcomes based on previous features is limited to a single outcome, we can accurately predict [in theory] the outcome is that my spouse will arrive home on time (Outcome=Yes).
However, the problem starts to grow in complexity when the outcome doesn’t match. Consider this scenario: my spouse did NOT leave work at 5:00pm, but traffic was good, and my spouse still made it home by 6:00pm. In this scenario, we have the same values in Feature 1 (F1) and Feature 2 (F2) but the actual outcome is different.
Said another way, the predicted outcome and the actual outcome are different. One hypothetical explanation for this difference might be because the question allows one hour to make the trip, and without challenges, it is in fact a 30-minute trip. Technically, we have 30 minutes of “cushion.”
In this case, the model would be more accurate if we express features as numeric values like, “How many minutes after 5:00pm did my spouse leave?” (F1) or “How many minutes was my spouse detained in traffic?” (F2)
In our scenario, because our spouse left only 15 minutes after 5:00pm, there is enough cushion to predict he or she will still arrive before 6:00pm. Consequently, our model can be improved if we replace yes/no values with numerical values. Now we get a model that works:
LESSON # 1 - How you define features impacts the accuracy of the outcome.
More powerful yet, I can now create additional features combining F1 & F2. Now I will add a new feature (F99) called “Total Delay” that is the sum of F1 & F2. My outcome is determined by joining these two features. This new feature (F99) allows the system to “guess” the answer for previously unseen scenarios not considered before.
Suppose that my spouse was 15 minutes late leaving (F1) and then delayed 20 minutes in traffic (F2). Even though this is a scenario not previously observed, the system accurately predicts the outcome based on similarity of F99 values:
LESSON # 2 - Features may be combined to create additional features to improve accurate outcomes to unknown scenarios.
There is one more consideration when building an AI/ML learning. Suppose my spouse stopped at the grocery store for 35 minutes on the way home. Even leaving on time and without traffic, the resulting table has a conflict. Notice when F99 is matched, the actual outcome and predicted outcome is different.
This is because there is additional information that we must consider that was not reflected in our original model. We need to add a third feature, “How many minutes did they stop before coming home?” (F3) and modify our F99 formula to be F1+F2+F3. The resulting table becomes:
With the new feature added, our F99 values are mapped and once again, the model works.
LESSON # 3 - When outcomes are not accurate, the most common explanation is that a necessary additional feature was not considered in the model.
Finally, even when numbers don’t exactly match, we can still perform predictions based on the closest match, a principle called “nearest neighbor.” Now we have added two more scenarios.
Notice the nearest neighbor to 37 is 35, so we predict an outcome of “No.” In contrast, the nearest neighbor to 14 is 15, so we predict an actual outcome of “Yes.” In both scenarios, we were correct. When our estimates based on nearest neighbor are incorrect, we can simply enlarge the size of our training data to get more accurate predictions.
LESSON #4 - Increasing the size of the training data is another way to increase the accuracy of predictions.
It is the position of this author and LogicHub that the industry could significantly advance detection quality if we take additional steps beyond the initial signature/anomaly detection.
Rather than simply aggregating the indicators or attempting to directly respond to individual indicators, we would benefit from building a knowledge base of the features associated with indicators. By using these features in machine learning and artificial intelligence systems we can better predict what is actionable for the SOC.
LogicHub offers a platform that allows users to create the detections, determine the features, and leverage pre-written machine learning functions like nearest neighbor. The platform includes integrations to hundreds of security tools for enrichment and actionable response.