Detecting Malware Without Feature Engineering Using Deep Learning

Detecting Malware Without Feature Engineering Using Deep Learning

Nowadays, machine learning is routinely used in the detection of network attacks and the identification of malicious programs. In most ML-based approaches, each analysis sample (such as an executable program, an office document, or a network request) is analyzed and a number of features are extracted. For example, in the case of a binary program, one might extract the names of the library functions being invoked, the length of the sections of the executable, and so forth.

Then, a machine learning algorithm is given as input a set of known benign and known malicious samples (called “the ground truth”). The algorithm creates a model that, based on the values of the features of the samples is the ground-truth dataset, is able to classify known samples correctly. If the dataset from which the algorithm has learned is representative of the real-world domain, and if the features are relevant for discriminating between benign and malicious programs, chances are that the learned model will generalize and allow for the detection of previously unseen malicious samples.

The Role of Feature Engineering

Even though the description above is an oversimplification of the actual process, the basic issue is that some “feature engineering” is necessary. Namely, (human, expensive) data scientists have to decide which features need to be extracted from each sample, and this decision is guided by their domain knowledge, or, more prosaically, by a gut feeling of what features are really useful for detection. However, what if they didn’t get it “right”?

For example, in a recent experiment some security experts were able to evade the Cylance detection system by embedding strings associated with a benign video game.

A New Approach

In our research, Lastline has collaborated with UCSB to explore a novel approach to malware detection that does not require feature engineering. The approach relies on an information-rich representation of programs: the report produced by sandboxing technology. These reports detail the actions performed by a program when executed in a controlled environment (called, aptly, “the sandbox”).

Not all sandbox are created equal, but they share a common feature: instead of focusing on the static aspects of a program (that is, its code, or the way in which its data is packaged), sandboxes focus on the dynamic aspects related to the program’s actual execution (e.g., which files were accessed, which processes were created, which network connections were established). In the end, these reports can be seen as lengthy, detailed documents about the actions performed by programs.

Introducing Neurlux

Our approach, called Neurlux, uses these documents as input, and applies deep learning techniques to create a classifier that is able to discriminate between malicious programs and benign ones. More precisely, Neurlux treats these reports like it would treat any other document: a series of words. These words are transformed into vectors in a process called “embedding.” Finally, these vectors are given as input to a neural network that combines several techniques: technically speaking, a Convolutional Neural Network (CNN), a Bidirectional Long Short-Term Memory Network (BiLSTM), and an Attention Network.

In order to understand if this approach is indeed effective at detecting malware, we compared Neurlux to a state-of-the-art approach that relies on feature engineering (see the technical paper for the details). Neurlux showed an accuracy of 96.8% compared to the state-of-the-art that had an accuracy of 89.2%.

These results show that an approach that does not rely on feature engineering can be extremely effective in real-world settings.

In addition, the fact that no human was involved in determining which specific features had to be extracted and encoded allows the approach to be “future-proof”: if suddenly a new aspect of the execution will become relevant to the detection of malicious programs, the system does not need to be modified, but simply re-trained.

Given that these systems are operating in continuous training mode to address the ever-changing threat landscape, a feature-less approach provides great effectiveness without requiring human experts to continually tweak the system.

And we know how overwhelmed our data analysts already are.

You can find all the details in the following paper:

Neurlux: Dynamic Malware Analysis Without Feature Engineering
Authors: Chani Jindal, Christopher Salls, Hojjat Aghakhani, Keith Long, Christopher Kruegel, Giovanni Vigna
Proceedings of the Annual Computer Security Applications Conference (ACSAC)
San Juan, Puerto Rico December 2019
https://www.acsac.org/2019/program/final/s221.html

Giovanni Vigna

Giovanni Vigna

Giovanni Vigna is one of the founders and CTO of Lastline as well as a Professor in the Department of Computer Science at the University of California in Santa Barbara. His current research interests include malware analysis, web security, vulnerability assessment, and mobile phone security. He also edited a book on Security and Mobile Agents and authored one on Intrusion Correlation. He has been the Program Chair of the International Symposium on Recent Advances in Intrusion Detection (RAID 2003), of the ISOC Symposium on Network and Distributed Systems Security (NDSS 2009), and of the IEEE Symposium on Security and Privacy in 2011. He is known for organizing and running an inter-university Capture The Flag hacking contest, called iCTF, that every year involves dozens of institutions around the world. Giovanni Vigna received his M.S. with honors and Ph.D. from Politecnico di Milano, Italy, in 1994 and 1998, respectively. He is a member of IEEE and ACM.
Giovanni Vigna