• Complain

Unknown - Frank Kanes Taming Big Data with Apache Spark and Python

Here you can read online Unknown - Frank Kanes Taming Big Data with Apache Spark and Python full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. genre: Home and family. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Unknown Frank Kanes Taming Big Data with Apache Spark and Python
  • Book:
    Frank Kanes Taming Big Data with Apache Spark and Python
  • Author:
  • Genre:
  • Rating:
    3 / 5
  • Favourites:
    Add to favourites
  • Your mark:
    • 60
    • 1
    • 2
    • 3
    • 4
    • 5

Frank Kanes Taming Big Data with Apache Spark and Python: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Frank Kanes Taming Big Data with Apache Spark and Python" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

Frank Kanes Taming Big Data with Apache Spark and Python — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Frank Kanes Taming Big Data with Apache Spark and Python" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make
Contents

  • 1: Getting Started with Spark
    • b'Chapter 1: Getting Started with Spark'
    • b'Getting set up - installing Python, a JDK, and Spark and its dependencies'
    • b'Installing the MovieLens movie rating dataset'
    • b'Run your first Spark program - the ratings histogram example'
    • b'Summary'
  • 2: Spark Basics and Spark Examples
    • b'Chapter 2: Spark Basics and Spark Examples'
    • b'What is Spark?'
    • b'The Resilient Distributed Dataset (RDD)'
    • b'Ratings histogram walk-through'
    • b'Key/value RDDs and the average friends by age example'
    • b'Running the average friends by age example'
    • b'Filtering RDDs and the minimum temperature by location example'
    • b'Running the minimum temperature example and modifying it for maximums'
    • b'Running the maximum temperature by location example'
    • b'Counting word occurrences using flatmap()'
    • b'Improving the word-count script with regular expressions'
    • b'Sorting the word count results'
    • b'Find the total amount spent by customer'
    • b'Check your results and sort them by the total amount spent'
    • b'Check your sorted implementation and results against mine'
    • b'Summary'
  • 3: Advanced Examples of Spark Programs
    • b'Chapter 3: Advanced Examples of Spark Programs'
    • b'Finding the most popular movie'
    • b'Using broadcast variables to display movie names instead of ID numbers'
    • b'Finding the most popular superhero in a social graph'
    • b'Running the script - discover who the most popular superhero is'
    • b'Superhero degrees of separation - introducing the breadth-first search algorithm'
    • b'Accumulators and implementing BFS in Spark'
    • b'Superhero degrees of separation - review the code and run it'
    • b'Item-based collaborative filtering in Spark, cache(), and persist()'
    • b'Running the similar-movies script using Spark's cluster manager'
    • b'Improving the quality of the similar movies example'
    • b'Summary'
  • 4: Running Spark on a Cluster
    • b'Chapter 4: Running Spark on a Cluster'
    • b'Introducing Elastic MapReduce'
    • b'Setting up our Amazon Web Services / Elastic MapReduce account and PuTTY'
    • b'Partitioning'
    • b'Creating similar movies from one million ratings - part 1'
    • b'Creating similar movies from one million ratings - part 2'
    • b'Creating similar movies from one million ratings \xc3\xa2\xc2\x80\xc2\x93 part 3'
    • b'Troubleshooting Spark on a cluster'
    • b'More troubleshooting and managing dependencies'
    • b'Summary'
  • 5: SparkSQL, DataFrames, and DataSets
    • b'Chapter 5: SparkSQL, DataFrames, and DataSets'
    • b'Introducing SparkSQL'
    • b'Executing SQL commands and SQL-style functions on a DataFrame'
    • b'Using DataFrames instead of RDDs'
    • b'Summary'
  • 6: Other Spark Technologies and Libraries
    • b'Chapter 6: Other Spark Technologies and Libraries'
    • b'Introducing MLlib'
    • b'Using MLlib to produce movie recommendations'
    • b'Analyzing the ALS recommendations results'
    • b'Using DataFrames with MLlib'
    • b'Spark Streaming and GraphX'
    • b'Summary'
  • 7: Where to Go From Here? Learning More About Spark and Data Science
    • b'Chapter 7: Where to Go From Here? \xe2\x80\x93 Learning More About Spark and Data Science'
Chapter 1. Getting Started with Spark

Spark is one of the hottest technologies in big data analysis right now, and with good reason. If you work for, or you hope to work for, a company that has massive amounts of data to analyze, Spark offers a very fast and very easy way to analyze that data across an entire cluster of computers and spread that processing out. This is a very valuable skill to have right now.

My approach in this book is to start with some simple examples and work our way up to more complex ones. We'll have some fun along the way too. We will use movie ratings data and play around with similar movies and movie recommendations. I also found a social network of superheroes, if you can believe it; we can use this data to do things such as figure out who's the most popular superhero in the fictional superhero universe. Have you heard of the Kevin Bacon number, where everyone in Hollywood is supposedly connected to a Kevin Bacon to a certain extent? We can do the same thing with our superhero data and figure out the degrees of separation between any two superheroes in their fictional universe too. So, we'll have some fun along the way and use some real examples here and turn them into Spark problems. Using Apache Spark is easier than you might think and, with all the exercises and activities in this book, you'll get plenty of practice as we go along. I'll guide you through every line of code and every concept you need along the way. So let's get started and learn Apache Spark.

Getting set up - installing Python, a JDK, and Spark and its dependencies

Let's get you started. There is a lot of software we need to set up. Running Spark on Windows involves a lot of moving pieces, so make sure you follow along carefully, or else you'll have some trouble. I'll try to walk you through it as easily as I can. Now, this chapter is written for Windows users. This doesn't mean that you're out of luck if you're on Mac or Linux though. If you open up the download package for the book or go to this URL, http://media.sundog-soft.com/spark-python-install.pdf , you will find written instructions on getting everything set up on Windows, macOS, and Linux. So, again, you can read through the chapter here for Windows users, and I will call out things that are specific to Windows, so you'll find it useful in other platforms as well; however, either refer to that spark-python-install.pdf file or just follow the instructions here on Windows and let's dive in and get it done.

Installing Enthought Canopy

This book uses Python as its programming language, so the first thing you need is a Python development environment installed on your PC. If you don't have one already, just open up a web browser and head on to https://www.enthought.com/ , and we'll install Enthought Canopy:

Enthought Canopy is just my development environment of choice; if you have a different one already that's probably okay. As long as it's Python 3 or a newer environment, you should be covered, but if you need to install a new Python environment or you just want to minimize confusion, I'd recommend that you install Canopy. So, head up to the big friendly download Canopy button here and select your operating system and architecture:

For me, the operating system is going to be Windows (64-bit). Make sure you choose Python 3.5 or a newer version of the package. I can't guarantee the scripts in this book will work with Python 2.7; they are built for Python 3, so select Python 3.5 for your OS and download the installer:

Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «Frank Kanes Taming Big Data with Apache Spark and Python»

Look at similar books to Frank Kanes Taming Big Data with Apache Spark and Python. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Reviews about «Frank Kanes Taming Big Data with Apache Spark and Python»

Discussion, reviews of the book Frank Kanes Taming Big Data with Apache Spark and Python and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.