Machine Learning, Artificial Intelligence, and How the Two Fit into Information Security

Machine Learning, Artificial Intelligence, and How the Two Fit into Information Security

Everywhere I look, someone’s talking about machine learning (ML) or artificial intelligence (AI). These two technologies are shaping important conversations in multiple sectors, especially marketing and sales, and are at risk of becoming overused and misunderstood buzzwords. The technologies are also drawing the attention of security professionals, with some believing that AI is poised to transform information security.

Despite this hype, there’s still a lot of confusion around ML, AI and their utility for information security. In this blog post, I would like to correct these misperceptions. Let’s start with differentiating machine learning and artificial intelligence in general.

Machine Learning vs. Artificial Intelligence: Understanding the Difference

Artificial intelligence is the science of trying to replicate intelligent, human-like behavior. There are multiple ways of achieving this – machine learning is one of them. For example, a type of AI system that does not involve machine learning is an expert system, in which the skills and decision process of an expert are captured through a series of rules and heuristics.

Machine learning is a specific type of AI. An ML system analyzes a large data set in order to categorize the data and create rules about what datum belongs in what category. For example, machine learning can be used to analyze network behavior data and categorize it as normal or anomalous.

Given these definitions, all ML systems are also AI systems. However, not all AI systems use machine learning. It’s similar to saying that while humans are mammals, not all mammals are humans. However, today’s trend is that few of the other AI techniques are being used. If we find ourselves in the situation where the only AI systems are those that use ML, then the two terms would be synonymous. It’s like if there were no mammals other than humans, then saying human and mammal would be synonymous.

There are two main branches of machine learning: supervised and unsupervised. Supervised ML involves mapping input variables to output variables in order to make accurate predictions about the data it is analyzing. In terms of threat detection, an ML algorithm could use known suspicious behaviors and a “malicious” category assignment as the ground truth for developing a threat classifier. It can then use that classifier to analyze new samples.

In unsupervised ML, the second branch of machine learning, a system tries to cluster groups of data together, based on the data features. In this case, the result is the identification of groups of similar elements, which allows an analyst, for example, to handle a large number of (similar) samples with a single decision (e.g., all these emails have similar attachments which are all malicious).

There’s also deep learning, a specific type of machine learning that uses neural networks instead of statistical analysis for analyzing data. Deep learning is particularly good at finding classifications in large amounts of data. But deep learning is disadvantaged by its reduced explanatory power for why something belongs in a particular grouping, such as why an executable is dangerous.

What Are the Challenges of AI and ML in Information Security?

Machine learning faces a unique challenge in information security: in the effort to take data sets that are representative of malicious behavior and extract knowledge, algorithms must grapple with data that’s attempting to fight back. This is known as adversarial learning, which is data that’s deliberately trying to avoid being classified, especially when it’s something malicious that’s attempting not to be seen as such. Malware authors learn what algorithms are looking for and tweak their samples or try to re-educate the model until the wrong classification is given so that they can avoid detection, and thereby infect more users. In so doing, they use what algorithms have learned against security professionals and subsequently users.

To account for this adversarial setting, security professionals need to develop machine learning techniques that look for outliers and false flags. They must be extra cautious about the process they use to source and characterize data. Otherwise, the results could be terrible.

Take the packing of an executable, for example. Lots of malware use packing as a way to look different and avoid detection from antivirus software, while benign code uses packing more seldom (e.g., in cases in which the authors want to protect their intellectual properties, as it happens in video games). If you apply machine learning to the programs without first performing unpacking, the algorithm will learn that packing is bad and flag everything that’s packed as malicious, leading to a large number of false positives.

Such a development highlights the reality that AI and ML aren’t silver bullets. There are a lot of unrealistic expectations that AI and ML can do anything. But that’s not the case. As illustrated above, these technologies can’t automatically detect outliers and false positives without some form of human input, guidance, decisions, or intervention.

Even more importantly, there’s this ongoing tension between “precision” and “recall” for machine learning and artificial intelligence in information security. Recall, as it relates to information security, is the ability to identify all the possible malicious programs, whereas precision is the aim to single out only the dangerous samples. Usually, a precise algorithm ends up letting a lot of malware through because of the programmed desire to not make too many mistakes. The alternative, which is high recall with low precision, will generate lots of false positives in the attempt to protect against all threats. These problems are characteristic of the imprecise, statistical means of analysis found in machine learning algorithms. There will always be these types of errors. It’s an unsolvable dilemma.

Other limitations exist for AI and ML in information security. Overall, it’s impossible to encapsulate all the information of a human malware analyst and distill it into an AI system. There are just too many variables (i.e., types of data) in the way. At the same time, the world is always changing, so machine learning algorithms need constant retraining and relearning in order to stay current with the latest threat developments, changes, trends, and capabilities.

How to Address the Challenges of Security-Related AI and ML

The adversarial setting of AI and ML in information security, not to mention the technological limitations of security-related algorithms discussed above, reveals that artificial intelligence and machine learning aren’t enough to keep organizations safe. To train these technologies, security professionals need to supply them with hundreds of thousands of known samples that non-AI tools like signature-based detection technologies and heuristics utilities have deemed malicious and continue doing so to keep the models up-to-date. It therefore makes sense to partner AI and ML with these other methodologies.

Lastline couldn’t agree more, which is why our products use a combination of technologies to detect threats and network breaches. In addition to machine learning (as our preferred artificial intelligence technology), we draw upon the input of anomaly detection and expert systems to analyze millions of samples a day. Through this synthesis of information, Lastline can provide a user with a complete picture of a breach that’s not distorted by false positives. More than that, our technology can tell them how severe each incident is by bringing seemingly disparate events together for greater context of an attack when it occurs. This is a crucial benefit for security professionals who don’t have time to deal with everything at once and who need to triage security alerts in order to focus on the highest risk threats. Therefore, how Lastline has implemented AI can save companies time and money, allowing security professionals to remediate each threat more quickly and completely.

Learn more about Lastline’s AI-powered malware detection solutions.

Giovanni Vigna

Giovanni Vigna

Giovanni Vigna is one of the founders and CTO of Lastline as well as a Professor in the Department of Computer Science at the University of California in Santa Barbara. His current research interests include malware analysis, web security, vulnerability assessment, and mobile phone security. He also edited a book on Security and Mobile Agents and authored one on Intrusion Correlation. He has been the Program Chair of the International Symposium on Recent Advances in Intrusion Detection (RAID 2003), of the ISOC Symposium on Network and Distributed Systems Security (NDSS 2009), and of the IEEE Symposium on Security and Privacy in 2011. He is known for organizing and running an inter-university Capture The Flag hacking contest, called iCTF, that every year involves dozens of institutions around the world. Giovanni Vigna received his M.S. with honors and Ph.D. from Politecnico di Milano, Italy, in 1994 and 1998, respectively. He is a member of IEEE and ACM.
Giovanni Vigna