About the Author
Pascal Bugnion is a data engineer at the ASI, a consultancy offering bespoke data science services. Previously, he was the head of data engineering at SCL Elections. He holds a PhD in computational physics from Cambridge University.
Besides Scala, Pascal is a keen Python developer. He has contributed to NumPy, matplotlib and IPython. He also maintains scikit-monaco, an open source library for Monte Carlo integration. He currently lives in London, UK.
I owe a huge debt of gratitude to my parents and my partner for supporting me in this, as well as my employer for encouraging me to pursue this project. I also thank the reviewers, Umanga Bista, Yuanhang Wang, and Radek Ostrowski for their tireless efforts, as well as the entire team at Packt for their support, advice, and hard work carrying this book to completion.
About the Reviewers
Umanga Bista is machine learning and real-time analytics enthusiast from Kathmandu. He completed his bachelors in computer engineering in September, 2013. Since then, he has been working at LogPoint, a SEIM product and company. He primarily works on building statistical plugins and real time, scalable, and fault tolerant architecture to process multiterabyte scale log data streams for security analytics, intelligence, and compliance.
Radek Ostrowski is a freelance big data engineer with an educational background in high-performance computing. He specializes in building scalable real-time data collection and predictive analytics platforms. He has worked at EPCC, University of Edinburgh in data-related projects for many years. Additionally, he has contributed to the success of a game's startupdeltaDNA, co-built super-scalable backend for PlayStation 4 at Sony, helped to improve data processes at Expedia, and started a Docker revolution at Tesco Bank. He is currently working with Spark and Scala for Max2 Inc, an NYC-based startup that is building a community-powered venue discovery platform, offering personalized recommendations, curated and real-time information.
Yuanhang Wang is a data scientist with primary focus on DSL design. He has dabbled in several functional programming languages. He is particularly interested in machine learning and programming language theory. He is currently a data scientist at China Mobile Research Center, working on typed data processing engine and optimizer that is built on top of several big-data platforms.
Yuanhang Wang describes himself as an enthusiast of purely functional programming and neural networks. He obtained his master's degrees both in Harbin Institute of Technology, China and University of Pavia, Italy.
www.PacktPub.com
Support files, eBooks, discount offers, and more
For support files and downloads related to your book, please visit www.PacktPub.com.
Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at > for more details.
At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www2.packtpub.com/books/subscription/packtlib
Do you need instant solutions to your IT questions? PacktLib is Packt's online digital book library. Here, you can search, access, and read Packt's entire library of books.
Why subscribe?
- Fully searchable across every book published by Packt
- Copy and paste, print, and bookmark content
- On demand and accessible via a web browser
Free access for Packt account holders
If you have an account with Packt at www.PacktPub.com, you can use this to access PacktLib today and view 9 entirely free books. Simply use your login credentials for immediate access.
To my parents.
To Jessica and to my friends.
Preface
Data science is fashionable. Data science startups are sprouting across the globe and established companies are scrambling to assemble data science teams. The ability to analyze large datasets is also becoming increasingly important in the academic and research world.
Why this explosion in demand for data scientists? Our view is that the emergence of data science can be viewed as the serendipitous collusion of several interlinked factors. The first is data availability. Over the last fifteen years, the amount of data collected by companies has exploded. In the world of research, cheap gene sequencing techniques have drastically increased the amount of genomic data available. Social and professional networking sites have built huge graphs interlinking a significant fraction of the people living on the planet. At the same time, the development of the World Wide Web makes accessing this wealth of data possible from almost anywhere in the world.
The increased availability of data has resulted in an increase in data awareness. It is no longer acceptable for decision makers to trust their experience and "gut feeling" alone. Increasingly, one expects business decisions to be driven by data.
Finally, the tools for efficiently making sense of and extracting insights from huge data sets are starting to mature: one doesn't need to be an expert in distributed computing to analyze a large data set any more. Apache Spark, for instance, greatly eases writing distributed data analysis applications. The explosion of cloud infrastructure facilitates scaling computing needs to cope with variable data amounts.