• Complain

Bikramaditya Singhal - Spark for Data Science

Here you can read online Bikramaditya Singhal - Spark for Data Science full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2016, publisher: Packt Publishing, genre: Computer. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Bikramaditya Singhal Spark for Data Science

Spark for Data Science: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Spark for Data Science" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

Analyze your data and delve deep into the world of machine learning with the latest Spark version, 2.0

About This Book

  • Perform data analysis and build predictive models on huge datasets that leverage Apache Spark
  • Learn to integrate data science algorithms and techniques with the fast and scalable computing features of Spark to address big data challenges
  • Work through practical examples on real-world problems with sample code snippets

Who This Book Is For

This book is for anyone who wants to leverage Apache Spark for data science and machine learning. If you are a technologist who wants to expand your knowledge to perform data science operations in Spark, or a data scientist who wants to understand how algorithms are implemented in Spark, or a newbie with minimal development experience who wants to learn about Big Data Analytics, this book is for you!

What You Will Learn

  • Consolidate, clean, and transform your data acquired from various data sources
  • Perform statistical analysis of data to find hidden insights
  • Explore graphical techniques to see what your data looks like
  • Use machine learning techniques to build predictive models
  • Build scalable data products and solutions
  • Start programming using the RDD, DataFrame and Dataset APIs
  • Become an expert by improving your data analytical skills

In Detail

This is the era of Big Data. The words Big Data implies big innovation and enables a competitive advantage for businesses. Apache Spark was designed to perform Big Data analytics at scale, and so Spark is equipped with the necessary algorithms and supports multiple programming languages.

Whether you are a technologist, a data scientist, or a beginner to Big Data analytics, this book will provide you with all the skills necessary to perform statistical data analysis, data visualization, predictive modeling, and build scalable data products or solutions using Python, Scala, and R.

With ample case studies and real-world examples, Spark for Data Science will help you ensure the successful execution of your data science projects.

Style and approach

This book takes a step-by-step approach to statistical analysis and machine learning, and is explained in a conversational and easy-to-follow style. Each topic is explained sequentially with a focus on the fundamentals as well as the advanced concepts of algorithms and techniques. Real-world examples with sample code snippets are also included.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Bikramaditya Singhal: author's other books


Who wrote Spark for Data Science? Find out the surname, the name of the author of the book and a list of all author's works by series.

Spark for Data Science — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Spark for Data Science" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make
Chapter 1. Big Data and Data Science An Introduction

Big data is definitely a big deal! It promises a wealth of opportunities by deriving hidden insights in huge data silos and by opening new avenues to excel in business. Leveraging big data through advanced analytics techniques has become a no-brainer for organizations to create and maintain their competitive advantage.

This chapter explains what big data is all about, the various challenges with big data analysis and how Apache Spark pitches in as the de facto standard to address computational challenges and also serves as a data science platform.

The topics covered in this chapter are as follows:

  • Big data overview - what is all the fuss about?
  • Challenges with big data analytics - why was it so difficult?
  • Evolution of big data analytics - the data analytics trend
  • Spark for data analytics - the solution to big data challenges
  • The Spark stack - all that makes it up for a complete big data solution
Big data overview

Much has already been spoken and written about what big data is, but there is no specific standard as such to clearly define it. It is actually a relative term to some extent. Whether small or big, your data can be leveraged only if you can analyze it properly. To make some sense out of your data, the right set of analysis techniques is needed and selecting the right tools and techniques is of utmost importance in data analytics. However, when the data itself becomes a part of the problem and the computational challenges need to be addressed prior to performing data analysis, it becomes a big data problem.

A revolution took place in the World Wide Web, also referred to as Web 2.0, which changed the way people used the Internet. Static web pages became interactive websites and started collecting more and more data. Technological advancements in cloud computing, social media, and mobile computing created an explosion of data. Every digital device started emitting data and many other sources started driving the data deluge. The dataflow from every nook and corner generated varieties of voluminous data, at speed! The formation of big data in this fashion was a natural phenomenon, because this is how the World Wide Web had evolved and no explicit efforts were involved in specifics. This is about the past! If you consider the change that is happening now, and is going to happen in future, the volume and speed of data generation is beyond what one can anticipate. I am propelled to make such a statement because every device is getting smarter these days, thanks to the Internet of Things ( IoT ).

The IT trend was such that the technological advancements also facilitated the data explosion. Data storage had experienced a paradigm shift with the advent of cheaper clusters of online storage pools and the availability of commodity hardware with bare minimal price. Storing data from disparate sources in its native form in a single data lake was rapidly gaining over carefully designed data marts and data warehouses. Usage patterns also shifted from rigid schema-driven, RDBMS-based approaches to schema-less, continuously available NoSQL data-store-driven solutions. As a result, the rate of data creation, whether structured, semi-structured, or unstructured, started accelerating like never before.

Organizations are very much convinced that not only can specific business questions be answered by leveraging big data; it also brings in opportunities to cover the uncovered possibilities in businesses and address the uncertainties associated with this. So, apart from the natural data influx, organizations started devising strategies to generate more and more data to maintain their competitive advantages and to be future ready. Here, an example would help to understand this better. Imagine sensors are installed on the machines of a manufacturing plant which are constantly emitting data, and hence the status of the machine parts, and a company is able to predict when the machine is going to fail. It lets the company prevent a failure or damage and avoid unplanned downtime, saving a lot of money.

Challenges with big data analytics

There are broadly two types of formidable challenges in the analysis of big data. The first challenge is the requirement for a massive computation platform, and once it is in place, the second challenge is to analyze and make sense out of huge data at scale.

Computational challenges

With the increase in data, the storage requirement for big data also grew more and more. Data management became a cumbersome task. The latency involved in accessing the disk storage due to the seek time became the major bottleneck even though the processing speed of the processor and the frequency of RAM were up to the mark.

Fetching structured and unstructured data from across the gamut of business applications and data silos, consolidating them, and processing them to find useful business insights was challenging. There were only a few applications that could address any one area, or just a few areas of diversified business requirement. However, integrating those applications to address most of the business requirements in a unified way only increased the complexity.

To address these challenges, people turned to the distributed computing framework with distributed file system, for example, Hadoop and Hadoop Distributed File System ( HDFS ). This could eliminate the latency due to disk I/O, as the data could be read in parallel across the cluster of machines.

Distributed computing technologies had existed for decades before, but gained more prominence only after the importance of big data was realized in the industry. So, technology platforms such as Hadoop and HDFS or Amazon S3 became the industry standard. On top of Hadoop, many other solutions such as Pig, Hive, Sqoop, and others were developed to address different kinds of industry requirements such as storage, Extract, Transform, and Load ( ETL ), and data integration to make Hadoop a unified platform.

Analytical challenges

Analyzing data to find some hidden insights has always been challenging because of the additional intricacies involved in dealing with huge datasets. The traditional BI and OLAP solutions could not address most of the challenges that arose due to big data. As an example, if there were multiple dimensions to a dataset, say 100, it got really difficult to compare these variables with one another to draw a conclusion because there would be around 100C2 combinations for it. Such cases required statistical techniques such as correlation and the like to find the hidden patterns.

Though there were statistical solutions to many problems, it got really difficult for data scientists or analytics professionals to slice and dice the data to find intelligent insights unless they loaded the entire dataset into a DataFrame in memory. The major roadblock was that most of the general-purpose algorithms for statistical analysis and machine learning were single-threaded and written at a time when datasets were usually not so huge and could fit in the RAM on a single computer. Those algorithms written in R or Python were no longer very useful in their native form to be deployed on a distributed computing environment because of the limitation of in-memory computation.

To address this challenge, statisticians and computer scientists had to work together to rewrite most of the algorithms that would work well in a distributed computing environment. Consequently, a library called Mahout

Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «Spark for Data Science»

Look at similar books to Spark for Data Science. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Reviews about «Spark for Data Science»

Discussion, reviews of the book Spark for Data Science and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.