Indrajit A. Das
Lawrence A. Herman
About the Authors
Garry Turkington has over 15 years of industry experience, most of which has been focused on the design and implementation of large-scale distributed systems. In his current role as the CTO at Improve Digital, he is primarily responsible for the realization of systems that store, process, and extract value from the company's large data volumes. Before joining Improve Digital, he spent time at Amazon.co.uk, where he led several software development teams, building systems that process the Amazon catalog data for every item worldwide. Prior to this, he spent a decade in various government positions in both the UK and the USA.
He has BSc and PhD degrees in Computer Science from Queens University Belfast in Northern Ireland, and a Master's degree in Engineering in Systems Engineering from Stevens Institute of Technology in the USA. He is the author of Hadoop Beginners Guide , published by Packt Publishing in 2013, and is a committer on the Apache Samza project.
I would like to thank my wife Lea and mother Sarah for their support and patience through the writing of another book and my daughter Maya for frequently cheering me up and asking me hard questions. I would also like to thank Gabriele for being such an amazing co-author on this project.
Gabriele Modena is a data scientist at Improve Digital. In his current position, he uses Hadoop to manage, process, and analyze behavioral and machine-generated data. Gabriele enjoys using statistical and computational methods to look for patterns in large amounts of data. Prior to his current job in ad tech he held a number of positions in Academia and Industry where he did research in machine learning and artificial intelligence.
He holds a BSc degree in Computer Science from the University of Trento, Italy and a Research MSc degree in Artificial Intelligence: Learning Systems, from the University of Amsterdam in the Netherlands.
First and foremost, I want to thank Laura for her support, constant encouragement and endless patience putting up with far too many "can't do, I'm working on the Hadoop book". She is my rock and I dedicate this book to her.
A special thank you goes to Amit, Atdhe, Davide, Jakob, James and Valerie, whose invaluable feedback and commentary made this work possible.
Finally, I'd like to thank my co-author, Garry, for bringing me on board with this project; it has been a pleasure working together.
About the Reviewers
Atdhe Buja is a certified ethical hacker, DBA (MCITP, OCA11g), and developer with good management skills. He is a DBA at the Agency for Information Society / Ministry of Public Administration, where he also manages some projects of e-governance and has more than 10 years' experience working on SQL Server.
Atdhe is a regular columnist for UBT News. Currently, he holds an MSc degree in computer science and engineering and has a bachelor's degree in management and information. He specializes in and is certified in many technologies, such as SQL Server (all versions), Oracle 11 g , CEH, Windows Server, MS Project, SCOM 2012 R2, BizTalk, and integration business processes.
He was the reviewer of the book, Microsoft SQL Server 2012 with Hadoop , published by Packt Publishing. His capabilities go beyond the aforementioned knowledge!
I thank Donika and my family for all the encouragement and support.
Amit Gurdasani is a software engineer at Amazon. He architects distributed systems to process product catalogue data. Prior to building high-throughput systems at Amazon, he was working on the entire software stack, both as a systems-level developer at Ericsson and IBM as well as an application developer at Manhattan Associates. He maintains a strong interest in bulk data processing, data streaming, and service-oriented software architectures.
Jakob Homan has been involved with big data and the Apache Hadoop ecosystem for more than 5 years. He is a Hadoop committer as well as a committer for the Apache Giraph, Spark, Kafka, and Tajo projects, and is a PMC member. He has worked in bringing all these systems to scale at Yahoo! and LinkedIn.
James Lampton is a seasoned practitioner of all things data (big or small) with 10 years of hands-on experience in building and using large-scale data storage and processing platforms. He is a believer in holistic approaches to solving problems using the right tool for the right job. His favorite tools include Python, Java, Hadoop, Pig, Storm, and SQL (which sometimes I like and sometimes I don't). He has recently completed his PhD from the University of Maryland with the release of Pig Squeal: a mechanism for running Pig scripts on Storm.
I would like to thank my spouse, Andrea, and my son, Henry, for giving me time to read work-related things at home. I would also like to thank Garry, Gabriele, and the folks at Packt Publishing for the opportunity to review this manuscript and for their patience and understanding, as my free time was consumed when writing my dissertation.
Davide Setti , after graduating in physics from the University of Trento, joined the SoNet research unit at the Fondazione Bruno Kessler in Trento, where he applied large-scale data analysis techniques to understand people's behaviors in social networks and large collaborative projects such as Wikipedia.