Machine learning is one of the year’s hottest technology trends, driving innovation and making waves across both the enterprise and consumer technology landscape. Within the cybersecurity industry, many companies legitimately claim to do some machine learning, though it’s often not clear what that means, how it works, or even why it is important.
In this post, we’ll share more insight on Symantec’s investments in machine learning – and how that drove important innovations in Symantec Endpoint Protection 14.
Announced last week, the new software uses state-of-the-art machine learning technologies to block more attacks than the competition and significantly raise the bar on attackers. To achieve this, we combine a multi-layered approach with an insane amount of data, advanced algorithms and techniques, and an automation system to stay ahead of the attackers.
The machine learning work was led by our Center for Advanced Machine Learning, which we established in 2014. The team now includes 20+ experts who conduct high-impact R&D in machine learning architectures, algorithms and applications to address security and information management challenges. This includes leading-edge research in deep learning, probabilistic programming, reinforcement learning and Bayesian nonparametrics.
For Symantec Endpoint Protection 14, the group worked with Symantec’s security experts to develop a set of machine learning technologies that work together to examine three major dimensions of attacks. The three dimensions collectively provide a multi-layered threat assessment by analyzing what a file is (static), how it behaves (dynamic) and – via the cloud – what relationships it has with other files, machines and URLs (provenance):
- Static attributes: We start by inspecting thousands of static characteristics of a file – things like file name, function calls, entropy, etc.
- Dynamic behaviors: We then dig deeper to understand a program’s dynamic behaviors. We watch for combinations of thousands of behaviors – for example, does the program connect to the network, does it launch another process, does it access registry keys, etc.
- Relationships and reputation: To complete the picture, we examine the file’s relationships with other files, machines and URLs to generate a file “reputation.” Inspired by “the wisdom of the crowd,” this reputation analysis runs on big data at scale in our cloud, and enables us to understand if a program seen on only one or a few machines around the world is likely malicious.
The beauty of these three dimensions is that they are complementary to each other, so each can be aggressive in stopping threats because the other two dimensions serve as a “check” on its conclusions.
Big Data + Predictive Models = Smarter Protection
Big data is at the heart of Symantec’s approach to machine learning. Thanks to our broad footprint across endpoint, network and cloud security, we have threat and attack data from over 175 million endpoints and 57 million attack sensors being monitored in real time every day, minute by minute. That translates into billions of files and nearly four trillion relationships. That’s an enormous and rich dataset to train our classifier systems on “good,” “bad” and everything in between.
That’s important because data is the fuel for machine learning. You want lots of it. The more data you have, the “farther” you can go in building precise and effective detection technologies. You also want rich data. The more diverse and rich the data inputs, the more likely you are to uncover important hidden relationships. Ultimately, machine learning systems are only as good as the quality, diversity and reach of the datasets used to train them – and ours benefit from the world’s largest civilian threat intelligence network.
If data is the fuel, then algorithms are the engine of machine learning. Algorithms take data and produce models that are used to give us predictions, for example determining whether a file is malicious. Companies make a lot of noise about algorithms and models because they are trendy, and new ones appear all the time. The trick is knowing how to match the correct algorithm to the task and data at hand – i.e. the secret sauce for machine learning practitioners.
One of the key techniques we use is “ensembling,” which is a fancy way of saying “use many models and combine them in a good way.” It’s key to getting the best models possible – and was famously used in the $1M Netflix Prize. We add some “magic” through proprietary ensembling techniques that allow our systems to learn how best to combine predictions from many different models, even when we don’t know during training what the correct predictions are.
Another key technique we use is “adaptation.” Our security models must be continually tuned to track adversaries, changes in the software and network landscape and changes in user behavior. These are significant hurdles for traditional machine learning. For Symantec Endpoint Protection, we use a “meta-algorithm” called boosting, which operates by iteratively improving a model – each time focusing on the mistakes the model has previously made and correcting them without “unlearning” the things that were correct.
Last but not least, automation is essential for us to scale machine learning. We built automation for the entire machine learning process – from ingesting, cleaning and processing our telemetry data to optimizing and exploring different models. Without automation (and, of course, sufficient computing power) it simply would not be possible to “crunch all the numbers” and produce the best models.
What’s the end result? Simply put, Symantec has the most advanced machine learning available for endpoint security. A leading independent testing organization (AV-Test) recently tested Symantec Endpoint Protection 14, which beat all our competitors in detection and performance with minimal false positives. Even in artificial “scan” tests, the new software detected nearly 100% of threats at a nearly zero false positive rate. (Importantly, false positive performance in Symantec Endpoint Protection 14 can be tuned to meet customer policy requirements.)
We are excited about the new frontiers in threat detection made possible via machine learning and artificial intelligence. Used correctly, and with massive amounts of rich, diverse data being analyzed across endpoints and the cloud, these technologies are true game-changers in how we can fight attackers.
Please join us December 6 for a special webinar on the features and benefits of machine learning within Symantec Endpoint Protection 14. Learn more about the new product here and watch this space for weekly blog posts that drill into key capabilities with insights from Symantec and third-party experts.