5 Truths about AI – Truth #1: Anomalies Aren’t (Necessarily) Threats

5 Truths about AI – Truth #1: Anomalies Aren’t (Necessarily) Threats

Introduction

Artificial Intelligence (AI) can be a tremendous asset when it comes to cybersecurity.

This is true for several reasons. AI can search a large amount of information to find interesting signals and similarities among the data. It can recognize previously unknown threats and can find hidden connections among related events. For example, consider having AI look at pictures of cats. You can show it a number of photos so it recognizes roughly what a cat looks like. And then you can show it a picture of a cat that it has never seen before. And it will still recognize it as a cat.

Similarly, we can show the AI examples of threats and say, “If you see anything that’s similar to this, then alert me.” It can generalize from there to recognize previously unseen attacks even if we did not show it explicitly what an attack looks like. Of course, AI also can filter and sift automatically through a lot of data, reducing the amount of work that you might need to do manually. And because it’s doing it fast, and with a good degree of precision, you can enable automatic response.

This is excellent, but it’s not a simple as this makes it sound.

The question is, how well does this work? And here we have potential problems. When considering using AI for cybersecurity it’s essential to understand what it can and cannot do, and what’s needed for it to be effective. To help you with that, I would like to explain 5 truths about AI and cybersecurity. When you understand these, you will have a much better appreciation of what’s necessary to use AI for detecting attacks.

Truth #1: Anomalies Aren’t (Necessarily) Threats

Just looking for anomalies is often not good enough, because that approach can create a lot of false positives. The simple truth is that an anomaly is simply something different from what you expect. But this does not necessarily make it bad.

Typically, when you use machine learning (the most popular type of AI), you work with either labeled data or unlabeled data. Let me start by describing unsupervised machine learning (ML), which is used to identify structure among unlabeled data.

Unsupervised Machine Learning

An unsupervised ML system is basically given some data, and you let the machine learning find structure in that data. It finds groups of things that are similar, and it identifies things that are different than these groups. However, it can’t tell you what these groups and outliers in the data represent, because it’s unlabeled. It doesn’t know if something is good or bad because it has no knowledge of that; it can only tell you that this object is similar to these other things, or different.

It can’t answer, for example, “Is this a cat or a dog?” An unsupervised machine learning can find structure and find similar images, such as more cats and more dogs, but it can’t tell you what these are because it has no knowledge about what a cat is or what a dog is.

Supervised Machine Learning

In order to inject this knowledge, you need something that supplies labeled data to a supervised machine learning algorithm. Why is it called supervised? Because it has this labeled data. It’s been given labels that are attach to the input that it uses for learning (or training).

This approach allows you to classify data. It knows now what cats look like and what dogs look like, so it now can say about a picture that this is a cat, and this is a dog. The same applies to threats. After being trained with labeled threat data, the system knows how threats look like. It knows what good behavior looks like. I can now distinguish between good and bad, not just between normal, common, and different.

That’s a big step. If you only use unsupervised machine learning, if you just look for anomalies, you will have problems because it only knows “different,” but not good vs. bad.

Here’s an example (Figure 1). All these blue dots are data points in a two-dimensional space. If you apply unsupervised machine learning, what will you learn? The system will figure out that there are a bunch of dots (at the bottom) that are somewhat close, and it will put them together into a cluster. And there’s one dot (on top) that’s an outlier, that’s removed from the others.

The unsupervised machine learning system would cluster, would group, the many, and it will call them good. This is indicated by the green grouping in Figure 2. It will also find that outlier, that anomaly, and it will say it’s bad. That’s what you can do with unsupervised machine learning. If you just do clustering and grouping, you are limited to finding things that are common and calling them good, and things that are different, and calling them bad.

Of course, that’s not always helpful. Let’s add the true nature of these dots, shown in Figure 3 as green and red. Green dots are good, red are bad. What we see here is that there are red dots in that green cluster that the system assumed was all good.

We considered something that is bad, i.e. red, as good just because it looks similar enough to the nearby dots. These are false negatives. And that outlier is actually green. It’s good, it’s benign. However, we assumed that it is bad because it is an outlier. This is a false positive.

Unsupervised machine learning creates these false positives and false negatives because it has a fairly simplistic view of the world.

Training a (Linear) Classifier

Now, if you compare it to supervised machine learning, we can do better. We are given the color of the dots because we labeled the data as either good or bad to begin with (see Figure 4), and we are asking the supervised ML system to learn the difference. We are asking the system to learn a classifier that says, “Distinguish the green from the red.”

In this particular case, we can build a very simple classifier – a linear classifier – that creates, in this case a line, or it might create a plane in a higher dimensional space. This classifier can separate the green from the red (see Figure 5).

Specifically, the line would separate all of the dots into the green side and the red side. Any subsequent dots that are on the red side would consider bad, and new dots that are on the green side are consider good. It would avoid those false positives and false negatives, because we have given it information that an unsupervised machine learning algorithm did not have. The supervised machine learning got labeled data, and as a result, it can make better decisions, it can make tighter decisions. This is great; a definite improvement.

Let’s look at an example closer to the security world – detecting command‑and‑control traffic. If you have a particular piece of malware that creates command‑and‑control traffic, and you only look at the number of bytes it transfers, then if a command‑and‑control connection doesn’t transfer more bytes than your normal web browser, it looks legitimate. If, on the other hand, someone on your network legitimately uploads five gigabytes of data as a backup, it will look like an outlier because it’s just so much more data than normal. You might be classifying that as potential data exfiltration, when in reality it’s not.

Unfortunately, many network security solutions today only use unsupervised machine learning to try to understand what’s happening on your network. They look at what’s normal in your network, they look at these blue dots, and then try to build a baseline of what’s normal, which is presumed to be safe or benign. They try to find what is happening frequently in your network, and then they look at all the outliers and label them as threats. The results are false positives and missed threats.

Again, the first truth is that anomalies are not necessarily threats. It’s important to understand how a security vendor recognizes good and bad – malicious vs. benign anomalies – and not just assume that whatever is different from a baseline is a threat.

The problem with this example is that using a linear classifier is too simplistic for the real world. In my next post, I’ll explain Truth #2, which is that the world is too complex for linear classifiers.  

 

Please see the other posts in this series:

5 Truths about AI in Cybersecurity – Truth #2: The World is Too Complex For Linear Classifiers
5 Truths about AI in Cybersecurity – Truth #3: Good Training Data Can Be Hard To Get

Coming Soon:

5 Truths about AI in Cybersecurity – Truth #4: We Need a Signal in the Data to Train AI
5 Truths about AI in Cybersecurity – Truth #5: AI Can be attacked
Dr. Christopher Kruegel

Dr. Christopher Kruegel

Currently on leave from his position as Professor of Computer Science at UC Santa Barbara, Christopher Kruegel’s research interests focus on computer and communications security, with an emphasis on malware analysis and detection, web security, and intrusion detection. Christopher previously served on the faculty of the Technical University Vienna, Austria. He has published more than 100 peer-reviewed papers in top computer security conferences and has been the recipient of the NSF CAREER Award, MIT Technology Review TR35 Award for young innovators, IBM Faculty Award, and several best paper awards. He regularly serves on program committees of leading computer security conferences. Christopher was the Program Committee Chair of the Usenix Workshop on Large Scale Exploits and Emergent Threats (LEET, 2011), the International Symposium on Recent Advances in Intrusion Detection (RAID, 2007), and the ACM Workshop on Recurring Malcode (WORM, 2007). He was also the head of a working group that advised the European Commission (EC) on defenses to mitigate future threats against the Internet and Europe's cyber-infrastructure.
Dr. Christopher Kruegel