Signature Addiction. Artificial Intelligence has a dirty little secret
Venture Capitalists (VCs) goal in life is to find cybersecurity unicorns. These are rare creatures. Much like a Cyndaquil Pokemon, unicorns have common traits and in order to attract VCs, you must exhibit the commonalities of said unicorns.
1. Your company’s product must reference the use of Artificial Intelligence (AI) and Machine Learning (ML), preferably deep learning. And for bonus points, require lots of data scientists. This is one of the two prerequisites that venture capitalists use to gauge Unicornness.
2.The other stringent requirement is that under no circumstances must you ever, ever, ever, say you use . . . Signatures . . . As we all know, signatures are the reason traditional security products have failed, paving the way for the charge of the AI and ML unicorn cavalry who will save the day, without a single signature in sight. Additionally, the “S” word, must be banned from all forms of communication. If your product does use “content updates,” you must get product marketing to invent a new, cool, next generation-sounding name, that cannot contain any reference to signatures.
If you fail to provide either requirement, the VCs might simply ride their skateboards and BMXs around the corner to the next AI and ML cybersecurity startup.
The Flaw in AI & ML-based Tools
The real world is quickly uncovering that the latest AI and ML-based cybersecurity controls are far from a panacea to the malicious threat actor problem. It is being reported that flaws in the building of intelligence and the structuring of the learning process are creating an environment where old school “signatures,” albeit disguised with a next generation name, are being used to fill the gaps in AL and ML protections.
Humans, at the end of the day, feed data into the AI/ML systems. Decisions have to be made–by humans–about the initial training and classification of the data. Often the human decides if the training data goes into the good corpus or the bad corpus for the machine to learn about.
Listen to a real University Professor explain AI and ML.
A source of data for many wannabe Unicorns is, amongst others, VirusTotal. In this example, the human may set a threshold that files with positive detection of 30+ AV decisions are indeed malware and fit the classification to be added to the bad training corpus. By contrast, files that have been classified as known good, will become the opposing data training corpus. The system can then compare and contrast files being analyzed against data classified “bad corpus” and data classified “good corpus.” The intended output of the process is the AI/ML system learning to identify strong probability characteristics of both good and bad files.
One of the strong indicators in known good corpus datasets is a feature that files have a code signed certificate. Code signing is present in a high percentage of good files, and missing in a high percentage of files that have 30+ detections of maliciousness. The Artificial Intelligence learns that it is highly probable that signed files are not malicious.
This machine-learned bias was subverted in a recent report, where a research team took known bad malware, gave the malware a certificate value, and managed to change the verdict from bad to good on many ML-based Anti Virus engines. This is an example of real-world corruption in machine learning.
Fixing Polluted AI/ML Tools Requires – yes, the “S” word and (gasp) Humans
This exploitation of AI/ML feature selection is not easily fixed, and the only timely solution given the immediacy of malware is increasingly becoming the much-maligned signature update, a simple fix until the ML can be retrained. It appears that when data is biased, polluted or abused, AI/ML systems, ironically are signature-dependent, too.
At the other end of the scale, you’re not going to catch a lot of sophisticated malware in a timely manner if you’re waiting for 30+ detections before you train your systems on it. So, what is the right number? Is a single detection the right number in order to include a new file into training datasets? You risk introducing high levels of false positives. Files that have single digit detections on VirusTotal often represent both heuristic detections of bad files and false positives of good files. The heuristic detection can be based on the identification of packers being used, rather than an actual feature of the contained file, further risking the polluting of either data set.
Listen to a real University Professor explain packers.
These two examples, of heuristic detections of packers, can result in a good file being classified as bad, and malware with a code signed value being classified as good, results in training AI/ML systems from day one on a polluted dataset thereby eroding accuracy and confidence in the probability of verdicts from ML features.
How do we fix this?
Humans and signatures. Just don’t tell anyone.
Artificial Intelligence and machine learning solutions appeal, of course, is that they potentially solve the availability problem of the shrinking pool of human talent in the security process, and are attractive to companies because the cost of running a security operation is reduced. However, without better classification and learning processes, AI and ML-based security solutions are bound to the fate of many previous silver bullets. Sound great, but when you try to use them, you find out the reality is eroded by false positives and false negatives.