Machine Learning and Security
Clarence Chio and David Freeman
Copyright 2017 Clarence Chio and David Freeman
All rights reserved.
Published by OReilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472.
ISBN-13: 9781491979907
10/4/17
Chapter 1: Why Machine Learning and Security?
In the beginning, there was spam.
As soon as academics and scientists had hooked enough computers together via the Internet to create a communications network that provided value, other people realized that this medium of free transmission and broad distribution was a perfect way to advertise sketchy products, steal account credentials, and spread computer viruses.
In the intervening forty years, the field of computer and network security has come to encompass an enormous range of threats and domains: intrusion detection, web application security, malware analysis, social network security, advanced persistent threats, and applied cryptography, just to name a few. But even today spam remains a major focus for those in the email or messaging space, and for the general public spam is probably the aspect of computer security that most directly touches their own lives.
Machine learning was not invented by spam fighters, but it was quickly adopted by statistically inclined technologists who saw its potential in dealing with a constantly evolving source of abuse. Email providers and Internet service providers (ISPs) have access to a wealth of email content, metadata, and user behavior. Leveraging email data, content-based models can be built to create a generalizable approach to recognize spam. Metadata and entity reputations can be extracted from email to predict the likelihood that an email is spam without even looking at its content. By instantiating a user behavior feedback loop, the system can build a collective intelligence and improve over time with the help of its users.
Email filters have thus gradually evolved to deal with the growing diversity of circumvention methods that spammers have thrown at them. Even though 86% of all emails sent today are spam (according to one study),
The fundamental lesson that both researchers and practitioners have taken away from this battle is the importance of using data to defeat malicious adversaries and improve the quality of our interactions with technology. Indeed, the story of spam fighting serves as a representative example for the use of data and machine learning in any field of computer security. Today almost all organizations have a critical reliance on technology, and almost every piece of technology has security vulnerabilities. Driven by the same core motivations as the spammers from the 1980s (unregulated, cost-free access to an audience with disposable income and private information to offer), malicious actors can pose security risks to almost all aspects of modern life. Indeed, the fundamental nature of the battle between attacker and defender is the same in all fields of computer security as it is in spam fighting: a motivated adversary is constantly trying to misuse a computer system, and each side takes turns at fixing the flaws in design or technique that the other has uncovered. The problem statement has not changed one bit.
Computer systems and web services have become increasingly centralized, and many applications have evolved to serve millions or even billions of users. Entities that become arbiters of information are bigger targets for exploitation, but are also in the perfect position to make use of the data and their user base to achieve better security. Coupled with the advent of powerful data crunching hardware, and the development of more powerful data analysis and machine learning algorithms, there has never been a better time for exploiting the potential of machine learning in security.
In this book, we will demonstrate applications of machine learning and data analysis techniques to various problem domains in security and abuse. We will explore methods for evaluating the suitability of different machine learning techniques in different scenarios, and focus on guiding principles that will help you use data to achieve better security. Our goal is to leave you not with the answer to every security problem you might face, but to give you a framework for thinking about data and security, and a toolkit from which you can pick the right method for the problem at hand.
The remainder of this chapter sets up context for the rest of the book: we discuss what threats face modern computer and network systems, what machine learning is, and how machine learning applies to the aforementioned threats. We conclude with a detailed examination of approaches to spam fighting, which, as above, gives a concrete example of applying machine learning to security that can be generalized to nearly any domain.
Cyber threat landscape
The landscape of adversaries and miscreants in computer security has evolved over time, but the general categories of threats have remained the same. Security research exists to stymie the goals of attackers, and it is always important to have a good understanding of the different types of attacks that exist in the wild. As you can see from the Cyber Threat Taxonomy tree (fig 1), the relationships between threat entities and categories can be complex in some cases.
We begin by defining the principal threats that we will explore in future chapters.
Malware (or Virus)
Short for malicious software, any software designed to cause harm or gain unauthorized access to computer systems.
Worm
Standalone malware that replicates itself in order to spread to other computer systems.
Trojan
Malware disguised as legitimate software for detection avoidance.
Spyware
Malware installed on a computer system without permission and/or knowledge by the operator, with purposes of espionage and information collection. Keyloggers fall into this category.
Adware
Malware that injects unsolicited advertising material (e.g. pop-ups, banners, videos) into a user interface, often when a user is browsing the web.
Ransomware
Malware designed to restrict availability of computer systems until a sum of money (ransom) is given up.
Rootkit
A collection of (often) low-level software designed to enable access to or gain control of a computer system. (Root denotes the most powerful level of access to a system.)
Backdoor
An intentional hole placed in the system perimeter to allow for future accesses that can bypass perimeter protections.
Bot
A variant of malware that allows attackers to remotely take over and control computer systems, making them zombies.
Botnet
A large network of bots.
Exploit
A piece of code or software that exploits specific vulnerabilities in other software applications or frameworks.
Scanning
Attacks that send a variety of requests to computer systems, often in a brute-force manner, with the goal of finding weak points and vulnerabilities, as well as information gathering.
Sniffing
Silently observing and recording network and in-server traffic and processes without knowledge of operators.
Keylogger
A piece of hardware or software that (often covertly) records the keys pressed on a keyboard or similar input computer input device.
Spam
Unsolicited bulk messaging, usually for the purposes of advertising. Typically email, but could be SMS or through a messaging provider (e.g. WhatsApp).
Login attack
Multiple, usually automated, attempts at guessing credentials for authentication systems, either in a brute-force manner or with stolen/purchased credentials.