Machine Learning for Cybersecurity: Good, but Imperfect

Machine Learning for Cybersecurity: Good, but Imperfect

Machine Learning FIMachine learning has found application in a breadth of industries, from cancer detection to voice recognition to identity theft prevention. Perhaps unsurprisingly, cybersecurity is another area where machine learning can increase the efficiency and accuracy of operations.

This is a topic generating a lot of discussion at the RSA Conference – machine learning and the opportunities it presents for cybersecurity. While this idea is true, there are notes of caution that sometimes get overlooked. With this post, we’d like to share the opportunities, but perhaps more importantly, the dark underside of machine learning that can undermine its effectiveness.

Machine Learning Basics

Conceptions of artificial intelligence promise broad problem-solving potential, such as curing diseases and garnering profound insight into the universe around us. While these possibilities might come sooner than we think, machine learning’s current capabilities remain very task-specific. Image recognition, speech recognition, and news feed curation are just some such function-specific uses of machine learning. Unlike with a traditional algorithm, the processes are far more adaptive.

So, how does it work?

Generally speaking, we give “training” data to a machine learning model; we teach it to notice patterns among and across that data, and then we test and refine those classification processes.

There are many ways this can be achieved, from a variety of code libraries and languages to supervised and unsupervised training styles. The main takeaway, though, is that machine learning can enable much faster decision-making; it’s just not necessarily in more “intelligent” ways than those used by a human.

How does this relate to security? Well, just as we can teach machine learning to distinguish between images of cats and dogs, we can also teach a machine learning model to distinguish between malicious and non-malicious cyber behavior. It is, in some senses, another classification problem.

Intrusion Detection

Using old network logs, for instance, we can teach a machine learning model to study IP addresses, timestamps, and connection IDs to detect anomalous behavior. We “label” certain blocks of network activity as acceptable or normal; we then label other blocks of network activity as unacceptable or unusual. All of this data gets fed into the machine learning model, which then examines the data for patterns – after which we test its understanding with fresh information.

As time goes on and the model examines more of this data, it should (hopefully) become more robust. Thus, once some baseline of accuracy and precision is achieved, we can integrate the ML model into an existing security system to make faster, real-time decisions about intrusion detection – increasing the difficulty for attackers to remotely enter and move within a system.

This general process can be applied to many other cybersecurity decisions.

Insider Threats and Fraud Detection

We can apply machine learning to insider threats. By studying excessive printing, excessive downloads, and strange remote connections (e.g., late-night over VPN), ML algorithms can use information from previous incidents to guide future security decisions.

We can apply machine learning to fraud detection. If we’re looking at a user login, for instance, we can examine data such as IP address, geographic location, previous logins, cookies, and typing speed. The model can then compare the real-time behavior of a user to previous knowledge about fraudulent activity and alert us when something is unusual, and therefore suspicious.

Other Applications

Malware analysis, network analysis, data manipulation, and spam filtering are just some of the other applications that we can imagine. The pairing of machine learning with behavioral analysis is an up-and-coming area that could soon top the list. To the extent that cybersecurity is data-driven, the use of machine learning can allow for scalable, efficient solutions not quite possible with standard security technologies (e.g., packet filters).

At the same time, correctly-trained machine learning models can reduce false positive rates – meaning we get better at detecting cyber attacks at the same time we reduce inconvenience for employees and customers. For obvious reasons, these can all benefit both public- and private-sector organizations who purchase cybersecurity technologies or build them internally.

Bolstering the Workforce

These capabilities should by no means (and currently cannot) replace humans, just in the ways they shouldn’t replace firewalls, antivirus products, or other “traditional” IT security technologies. However, machine learning can certainly bolster the cybersecurity workforce.

Machine learning can perform data analysis much faster than humans – which just allows analysts to use ML for real-time threat intelligence and incident analysis; in an environment where time is of the essence, it’s a powerful tool for accelerating the decision-making process and taking away the “heavy lifting” from humans. As a result, we can aim to reduce the decision fatigue that comes with manually sorting through data.

Machine learning can also provide unique insights. By convoluting data millions of times over, these models may discover patterns that humans have yet to identify. Once again, humans can focus their time on analyzing alerts rather than reading every incoming IP address.

“Junior” cybersecurity professionals can also use ML models built by “senior” professionals to guide their threat analysis, effectively punching above their weight class. It might not be the company’s network expert at the keyboard, but if a machine learning model can provide some assistance, organizations can better mitigate security staffing shortages.

Drawbacks: Labeling and Bias

Of course, in order to maximize the use of any technology, we must understand its weaknesses as much as its strengths. And when it comes to machine learning for cybersecurity, there are definitely some drawbacks.

For one, part of the machine learning “process” is labeling data. Sometimes, however, this cannot be done automatically – meaning data has to be manually labeled by a human being to complete the initial training for the ML system. This is often the case with cybersecurity, which means we’re always introducing some degree of human error into the baseline model. Hence, relying entirely on machine learning can be dangerous when models have high accuracy but inaccurate training.

Machine learning models, as with any algorithm, only know what we teach them. “Algorithmic bias” comes from this exact fact; if we only feed a computer biased information, it will make biased decisions without any knowledge or understanding of the repercussions. When we can’t possibly create a “full picture” of cybersecurity, something similar results.

We don’t have data on every possible phishing email, just as we don’t have data on every type of network attack. Consequently, there are not only areas in cybersecurity that struggle to use machine learning, but the machine learning algorithms will never be perfect. Much like a human being, there is always going to be some rate of error; threats will slip through the automatic cracks and need to be caught via human intervention.

Drawbacks: Exploitability and Reverse-Engineering

Machine learning models can also be tricked into making incorrect decisions. By “injecting” malicious data into the algorithm, attackers can confuse and disrupt the model’s decision-making process – and it’s often not too difficult. Researchers have achieved such disruption with facial recognition software, for instance, through only slight tweaks on input images.

Further, machine learning models can be reverse-engineered. Researchers demonstrated in 2016 that they could replicate an Amazon ML model with near-perfect accuracy, simply by logging its responses to a few thousand queries. This has obvious implications for how attackers can expose training data and threat information simply by probing publicly-available algorithms, and then design attacks so that they appear normal to the ML-driven analysis.


So, is machine learning useful for cybersecurity? As many RSA Conference discussions have highlighted, absolutely. Pairing human analysis with machine learning can be quite powerful in the cybersecurity domain. That said, we must be aware that ML models are still imperfect, from bias to exploitability to a misunderstanding of the “complete picture.” If we’re going to rely on this technology to keep organizations secure, accepting its limits is a must.