MALWARE DATA SCIENCE
Attack Detection and Attribution
by Joshua Saxe with Hillary Sanders
San Francisco
MALWARE DATA SCIENCE. Copyright 2018 by Joshua Saxe with Hillary Sanders.
All rights reserved. No part of this work may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval system, without the prior written permission of the copyright owner and the publisher.
ISBN-10: 1-59327-859-4
ISBN-13: 978-1-59327-859-5
Publisher: William Pollock
Production Editor: Laurel Chun
Cover Illustration: Jonny Thomas
Interior Design: Octopod Studios
Developmental Editors: Annie Choi and William Pollock
Technical Reviewer: Gabor Szappanos
Copyeditor: Barton Reed
Compositor: Laurel Chun
Proofreader: James Fraleigh
Indexer: BIM Creatives, LLC
For information on distribution, translations, or bulk sales, please contact No Starch Press, Inc. directly:
No Starch Press, Inc.
245 8th Street, San Francisco, CA 94103
phone: 1.415.863.9900;
www.nostarch.com
Library of Congress Control Number: 2018949204
No Starch Press and the No Starch Press logo are registered trademarks of No Starch Press, Inc. Other product and company names mentioned herein may be the trademarks of their respective owners. Rather than use a trademark symbol with every occurrence of a trademarked name, we are using the names only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark.
The information in this book is distributed on an As Is basis, without warranty. While every precaution has been taken in the preparation of this work, neither the authors nor No Starch Press, Inc. shall have any liability to any person or entity with respect to any loss or damage caused or alleged to be caused directly or indirectly by the information contained in it.
To Alen Capalik, for bringing me back to computers after a long hiatus
About the Authors
Joshua Saxe is Chief Data Scientist at the major security vendor Sophos, where he leads a security data science research team. Hes also a principal inventor of Sophos neural networkbased malware detector, which defends tens of millions of Sophos customers from malware infections. Before joining Sophos, Joshua spent five years leading DARPA-funded security data research projects for the US government.
Hillary Sanders is a senior software engineer and data scientist at Sophos, where she has played a key role in inventing and productizing neural network, machine learning, and malware similarity analysis technologies. Before joining Sophos, Hillary was a data scientist at Premise Data Corporation. She is a regular speaker at security conferences, having given security data science talks at Blackhat USA and BSides Las Vegas. She studied Statistics at UC Berkeley.
About the Technical Reviewer
Gabor Szappanos graduated from the Eotvos Lorand University of Budapest with a degree in physics. His first job was developing diagnostic software and hardware for nuclear power plants at the Computer and Automation Research Institute. Gabor started antivirus work in 1995 and joined VirusBuster in 2001, where he was responsible for taking care of macro virus and script malware; in 2002, he became head of the virus lab. Between 2008 and 2016, he was a member of the board of directors in Anti-Malware Testing Standards Organizations (AMTSO), and, in 2012, he joined Sophos as a Principal Malware Researcher.
BRIEF CONTENTS
CONTENTS IN DETAIL
1
BASIC STATIC MALWARE ANALYSIS
2
BEYOND BASIC STATIC ANALYSIS: X86 DISASSEMBLY
3
A BRIEF INTRODUCTION TO DYNAMIC ANALYSIS
4
IDENTIFYING ATTACK CAMPAIGNS USING MALWARE NETWORKS
5
SHARED CODE ANALYSIS
6
UNDERSTANDING MACHINE LEARNINGBASED MALWARE DETECTORS
7
EVALUATING MALWARE DETECTION SYSTEMS
8
BUILDING MACHINE LEARNING DETECTORS
9
VISUALIZING MALWARE TRENDS
10
DEEP LEARNING BASICS
11
BUILDING A NEURAL NETWORK MALWARE DETECTOR WITH KERAS
12
BECOMING A DATA SCIENTIST
APPENDIX
AN OVERVIEW OF DATASETS AND TOOLS
FOREWORD
Congratulations on picking up Malware Data Science. Youre on your way to equipping yourself with the skills necessary to become a cybersecurity professional. In this book, youll find a wonderful introduction to data science as applied to malware analysis, as well as the requisite skills and tools you need to be proficient at it.
There are far more jobs in cybersecurity than there are qualified candidates, so the good news is that cybersecurity is a great field to get into. The bad news is that the skills required to stay current are changing rapidly. As is often the case, necessity is the mother of invention. With far more demand for skilled cybersecurity professionals than there is supply, data science algorithms are filling the gap by providing new insights and predictions about threats against networks. The traditional model of watchmen monitoring network data is rapidly becoming obsolete as data science is increasingly being used to find threat patterns in terabytes of data. And thank goodness for that, because monitoring a screen of alerts is about as exciting as monitoring a video camera surveillance system of a parking lot.
So what exactly is data science and how does it apply to security? As youll see in the Introduction, data science applied to security is the art and science of using machine learning, data mining, and visualization to detect threats against networks. While youll find a lot of hyperbole around machine learning and artificial intelligence driven by marketing, there are, in fact, very good use cases for these technologies that are in production today.
For instance, when it comes to malware detection, both the volume of malware production and the cost to the adversary in changing malware signatures has rendered signature-only based approaches to malware obsolete. Instead, antivirus companies are now training neural networks or other types of machine learning algorithms over very large datasets of malware to learn their characteristics, so that new variants of malware can be detected without having to update the model daily. The combination of signature-based and machine learningbased detection provides coverage for both known and unknown malware. This is a topic both Josh and Hillary are experts in and from which they speak from deep experience.
But malware detection is only one use case for data science. In fact, when it comes to finding threats on the network, todays sophisticated adversaries often will not drop executable programs. Instead, they will exploit existing software for initial access and then leverage system tools to pivot from one machine to the next using the user privileges obtained through exploitation. From an adversarial point of view this approach doesnt leave behind artifacts such as malware that antivirus software will detect. However, a good endpoint logging system or an endpoint detection and response (EDR) system will capture system level activities and send this telemetry to the cloud, from where analysts can attempt to piece together the digital footprints of an intruder. This process of combing through massive streams of data and continuously looking for patterns of intrusion is a problem well-suited for data science, specifically data mining with statistical algorithms and data visualization. You can expect more and more Security Operations Centers (SOCs) to adopt data mining and artificial intelligence technologies. Its really the only way to cull through massive data sets of system events to identify actual attacks.