Detecting Malware Without Feature Engineering Using Deep Learning
Nowadays, machine learning is routinely used in the detection of network attacks and the identification of malicious programs. In most ML-based approaches, each analysis sample (such as an executable program, an office document, or a network request) is analyzed and a number of features are extracted. For example, in the case of a binary program, one might extract the names of the library functions being invoked, the length of the sections of the executable, and so forth.
Then, a machine learning algorithm is given as input a set of known benign and known malicious samples (called “the ground truth”). The algorithm creates a model that, based on the values of the features of the samples is the ground-truth dataset, is able to classify known samples correctly. If the dataset from which the algorithm has learned is representative of the real-world domain, and if the features are relevant for discriminating between benign and malicious programs, chances are that the learned model will generalize and allow for the detection of previously unseen malicious samples.
The Role of Feature Engineering
Even though the description above is an oversimplification of the actual process, the basic issue is that some “feature engineering” is necessary. Namely, (human, expensive) data scientists have to decide which features need to be extracted from each sample, and this decision is guided by their domain knowledge, or, more prosaically, by a gut feeling of what features are really useful for detection. However, what if they didn’t get it “right”?
For example, in a recent experiment some security experts were able to evade the Cylance detection system by embedding strings associated with a benign video game.
A New Approach
In our research, Lastline has collaborated with UCSB to explore a novel approach to malware detection that does not require feature engineering. The approach relies on an information-rich representation of programs: the report produced by sandboxing technology. These reports detail the actions performed by a program when executed in a controlled environment (called, aptly, “the sandbox”).
Not all sandbox are created equal, but they share a common feature: instead of focusing on the static aspects of a program (that is, its code, or the way in which its data is packaged), sandboxes focus on the dynamic aspects related to the program’s actual execution (e.g., which files were accessed, which processes were created, which network connections were established). In the end, these reports can be seen as lengthy, detailed documents about the actions performed by programs.
Our approach, called Neurlux, uses these documents as input, and applies deep learning techniques to create a classifier that is able to discriminate between malicious programs and benign ones. More precisely, Neurlux treats these reports like it would treat any other document: a series of words. These words are transformed into vectors in a process called “embedding.” Finally, these vectors are given as input to a neural network that combines several techniques: technically speaking, a Convolutional Neural Network (CNN), a Bidirectional Long Short-Term Memory Network (BiLSTM), and an Attention Network.
In order to understand if this approach is indeed effective at detecting malware, we compared Neurlux to a state-of-the-art approach that relies on feature engineering (see the technical paper for the details). Neurlux showed an accuracy of 96.8% compared to the state-of-the-art that had an accuracy of 89.2%.
These results show that an approach that does not rely on feature engineering can be extremely effective in real-world settings.
In addition, the fact that no human was involved in determining which specific features had to be extracted and encoded allows the approach to be “future-proof”: if suddenly a new aspect of the execution will become relevant to the detection of malicious programs, the system does not need to be modified, but simply re-trained.
Given that these systems are operating in continuous training mode to address the ever-changing threat landscape, a feature-less approach provides great effectiveness without requiring human experts to continually tweak the system.
And we know how overwhelmed our data analysts already are.
You can find all the details in the following paper:
Neurlux: Dynamic Malware Analysis Without Feature Engineering
Authors: Chani Jindal, Christopher Salls, Hojjat Aghakhani, Keith Long, Christopher Kruegel, Giovanni Vigna
Proceedings of the Annual Computer Security Applications Conference (ACSAC)
San Juan, Puerto Rico December 2019
Latest posts by Giovanni Vigna (see all)
- Detecting Malware Without Feature Engineering Using Deep Learning - February 26, 2020
- Countering the Rise of Adversarial ML - October 16, 2019
- Network Security Challenges Create a Commercial Imperative for AI - May 9, 2019