BIG DATA & HADOOP
Learn by Example
by
Mayank Bhushan
FIRST EDITION 2020
Copyright BPB Publications, INDIA
ISBN: 978-93-8655-199-3
All Rights Reserved. No part of this publication can be stored in a retrieval system or reproduced in any form or by any means without the prior written permission of the publishers.
LIMITS OF LIABILITY AND DISCLAIMER OF WARRANTY
The Author and Publisher of this book have tried their best to ensure that the programmes, procedures and functions described in the book are correct. However, the author and the publishers make no warranty of any kind, expressed or implied, with regard to these programmes or the documentation contained in the book. The author and publisher shall not be liable in any event of any damages, incidental or consequential, in connection with, or arising out of the furnishing, performance or use of these programmes, procedures and functions. Product name mentioned are used for identification purposes only and may be trademarks of their respective companies.
All trademarks referred to in the book are acknowledged as properties of their respective owners.
Distributors:
BPB PUBLICATIONS
20, Ansari Road, Darya Ganj
New Delhi-110002
Ph: 23254990/23254991
BPB BOOK CENTRE
Old Lajpat Rai Market,
Delhi-110006
Ph: 23861747
MICRO MEDIA
Shop No. 5, Mahendra Chambers,
DN Rd. Next to Capital Cinema,
V.T. (C.S.T.) Station, MUMBAI-400
Ph: 22078296/22078297
DECCAN AGENCIES
4-3-329, Bank Street,
Hyderabad-500195
Ph: 24756967/24756400
Published by Manish Jain for BPB Publications, 20, Ansari Road, Darya Ganj, New Delhi-110002 and Printed by Repro India Pvt Ltd, Mumbai
Dedicated To
My beloved Family
Mrs. Neelam Sharma/Mr. Gopal Krishna Sharma
Mrs. Aarti/Mr. Shashank
Mrs. Apoorva
Most loving-Anjika
Preface
I am very confident that the present work will come as a relief to the students wishing to go through a comprehensive work explaining difficult concepts in the layman's language, offering a variety of practical approaches and conceptual problems along with their systematically worked out solutions, covering all the syllabus prescribed at various levels in universities.
This book promises to be a very good starting point for beginners and an asset to advanced users too.
This book is written as per the syllabus of various universities learning pattern and its aim is to keep course approach as learning with example Difficult concepts of Big Data-Hadoop is given in an easy and practical way, so that students can able to understand it in an efficient manner. This book provides screenshots of practical approaches which can be helpful for students.
It is said To err is human, to forgive divine. In this light I wish that the shortcomings of the book will be forgiven. At the same I am open to any kind of constructive criticisms and suggestions for further improvement. All intelligent suggestions are welcome and I will try my best to incorporate such in valuable suggestions in the subsequent editions of this book.
23rd March 2018
Mayank Bhushan
Acknowledgement
I would like to express my gratitude to all those who provided support, talked things over, read, wrote, offered comments, allowed me to quote their remarks and assisted in the editing, proofreading and design.
I have relied on many people to guide me directly and indirectly in writing this book. I am very thankful to Hadoop community; from whom I have learned with continuous efforts and I also owe a debt of gratitude for ABES College to provide me all facilities for Big Data-Hadoop lab.
There is always a sense of gratitude, which every one expresses others for their helpful and needy services they render during difficult phases of life and to achieve the goal already set.
It is impossible to thank individually but we are here by making humble effort to thanks some of them. At the outset I am thankful to the almighty that is constantly and invisibly guiding every body and have also helped us to work on the right path.
I am very much thankful to Prof. (Dr.) Shailesh Tiwari, H.O.D. (CSE), ABES Engineering College, Ghaziabad (U.P.) for guiding and supporting me. He is the main source of inspiration for me. I would also like to thanks to Dr. Munesh Chandra Trivedi Dean (REC-Azamgarh) Dr. Pratibha Singh (Prof., ABES Engineering College) and Dr. Shaswati Banerjea, Asst. Prof. (MNNIT Allahabad) who always provide me support everywhere. Without help from them this book is not possible. I am in debt of technical help from my dearest friend and colleague Mr. Omesh Kumar who guide me technically for every problem.
I wish my thanks to my all Guru's, friends and colleagues who helped and kept us motivated for writing this text. Special thanks to:
Dr. K.K. Mishra, MNNIT Allahabad
Dr. Mayank Pandey, MNNIT Allahabad
Dr. Shashank Srivastava, MNNIT Allahabad
Mr. Nitin Shukla, MNNIT Allahabad
Mr. Suraj Deb Barma. Govt. Polytechnic College, Agartala
Dr. A.L.N. Rao, GL Bajaj, Greater Noida
Mr. Ankit Yadav, Mr. Desh Deepak Pathak, ABES EC Ghaziabad
Dr. Sumit Yadav, IP University.
Mr. Aatif Jamshed, Galgotia College, Greater Noida
I also thank the Publisher and the whole staff at BPB Publications, especially
Mr. Manish Jain for bringing this text in a nice presentable form.
Finally, I want to thanks everyone who has directly or indirectly contributed to complete this authentic work.
Mayank Bhushan
Table of Content
Chapter 1: Big Data-Introduction and Demand
1.1 Big Data
1.1.1 Characteristics of Big Data
1.1.2 Why Big Data
1.2 Hadoop
1.2.1 History of Hadoop
1.2.2 Name of Hadoop
1.2.3 Hadoop Ecosystem
1.3 Convergence of Key Trends
1.3.1 Convergence of Big Data into Business
1.3.2 Big data Vs other techniques
1.4 Unstructured Data
1.5 Industry examples of Big data
1.5.1 Use of Big data-Hadoop at Yahoo
1.5.2 In RackSpace for log processing
1.5.3 Hadoop at Facebook
1.6 Usages of Big Data
1.6.1 Web analytics
1.6.2 Big Data and marketing
1.6.3 Big data and fraud
1.6.4 Risk management in Big Data with Credit card
1.6.5 Big data and algorithm trading
1.6.6 Big data in Healthcare
Chapter 2: NoSQL Data Management
2.1 Introduction to NoSQL database
2.1.1 Terminology used in NoSQL and RDBMS
2.1.2 Database use in NoSQL
2.2 SQL Vs NoSQL
2.2.1 Denormalization
2.2.2 Data distribution
2.2.3 Data Durability
2.3 Consistency in NoSQL
2.3.1 ACID Vs BASE
2.3.2 Relaxing Consistency
2.4 Hbase
2.4.1 Installation
2.4.2 History
2.4.3 Hbase Data Structure
2.4.4 Physical Storage
2.4.5 Components
2.4.6 Hbase Shell Commands
2.4.7 The different usages of scan command
2.4.8 Terminologies