• Complain

it-ebooks - Databricks Spark Reference Applications

Here you can read online it-ebooks - Databricks Spark Reference Applications full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2017, publisher: iBooker it-ebooks, genre: Computer. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

it-ebooks Databricks Spark Reference Applications
  • Book:
    Databricks Spark Reference Applications
  • Author:
  • Publisher:
    iBooker it-ebooks
  • Genre:
  • Year:
    2017
  • Rating:
    4 / 5
  • Favourites:
    Add to favourites
  • Your mark:
    • 80
    • 1
    • 2
    • 3
    • 4
    • 5

Databricks Spark Reference Applications: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Databricks Spark Reference Applications" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

it-ebooks: author's other books


Who wrote Databricks Spark Reference Applications? Find out the surname, the name of the author of the book and a list of all author's works by series.

Databricks Spark Reference Applications — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Databricks Spark Reference Applications" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make
Table of Contents
Batch Import
Batch Data Import

This section covers batch importing data into Apache Spark, such asseen in the non-streaming examples from Chapter 1. Those examples load datafrom files all at once into one RDD, processes that RDD, the job completes,and the program exits. In a production system, you could set up a cron job tokick off a batch job each night to process the last day's worth of log files and then publish statistics for the last day.

  • covers caveats when importing data from files.
  • links to examples of reading data from databases.
Built In Methods for Streaming Import
Built In Methods for Streaming Import

The StreamingContext has many built in methods for importing data to streaming.socketTextStream was introduced in the previous chapter, and textFileStreamis introduced here. The textFileStream method monitors any Hadoop-compatible filesystem directory for newfiles and when it detects a new file - reads it into Spark Streaming.Just replace the call to socketTextStream with textFileStream,and pass in the directory to monitor for log files.

// This methods monitors a directory for new files // to read in for streaming. JavaDStream logData = jssc.textFileStream(directory);

Try running by specifying a directory. You'll also need to drop or copy some new log filesinto that directory while the program is running to see the calculated values update.

There are more built-in input methods for streaming - check them out in thereference API documents for the StreamingContext.

Collect a Dataset of Tweets
Part 1: Collect a Dataset of Tweets

Spark Streaming is used to collect tweets as the dataset. The tweets are written out in JSON format, one tweet per line. A file of tweets is written every time interval until at least the desired number of tweets is collected.

See for the full code. We'll walk through someof the interesting bits now.

Collect.scala takes in the following argument list:

  1. outputDirectory - the output directory for writing the tweets. The files will be named 'part-%05d'
  2. numTweetsToCollect - this is the minimum number of tweets to collect before the program exits.
  3. intervalInSeconds - write out a new set of tweets every interval.
  4. partitionsEachInterval - this is used to control the number of output files written for each interval

Collect.scala will also require Twitter API Credentials. If you have never signed up for Twitter Api Credentials, follow these steps here. The Twitter credentials are passed in through command line flags.

Below is a snippet of the actual code in Collect.scala. The code calls TwitterUtils in the Spark Streaming Twitter library to get a DStream of tweets. Then, map is called to convert the tweets to JSON format. Finally, call for each RDD on the DStream. This example repartitions the RDD to write out so that you can control the number of output files.

val tweetStream = TwitterUtils .createStream(ssc, Utils .getAuth) .map(gson.toJson(_))tweetStream.foreachRDD((rdd, time) => { val count = rdd.count() if (count > ) { val outputRDD = rdd.repartition(partitionsEachInterval) outputRDD.saveAsTextFile( outputDirectory + "/tweets_" + time.milliseconds.toString) numTweetsCollected += count if (numTweetsCollected > numTweetsToCollect) { System .exit() } }})

Run yourself to collect a dataset of tweets:

% ${YOUR_SPARK_HOME} /bin/spark-submit \ --class "com.databricks.apps.twitter_classifier.Collect" \ --master ${YOUR_SPARK_MASTER:-local[4]} \ target/scala- 2.10 /spark-twitter-lang-classifier-assembly- 1.0 .jar \ ${YOUR_OUTPUT_DIR:-/tmp/tweets} \ ${NUM_TWEETS_TO_COLLECT:-10000} \ ${OUTPUT_FILE_INTERVAL_IN_SECS:-10} \ ${OUTPUT_FILE_PARTITIONS_EACH_INTERVAL:-1} \ --consumerKey ${YOUR_TWITTER_CONSUMER_KEY} \ --consumerSecret ${YOUR_TWITTER_CONSUMER_SECRET} \ --accessToken ${YOUR_TWITTER_ACCESS_TOKEN} \ --accessTokenSecret ${YOUR_TWITTER_ACCESS_SECRET}
Examine with Spark SQL
Examine with Spark SQL

Spark SQL can be used to examine data based on the tweets. Below are some relevantcode snippets from .

First, here is code to pretty print 5 sample tweets so that they aremore human-readable.

val tweets = sc.textFile(tweetInput) for (tweet <- tweets.take()) { println(gson.toJson(jsonParser.parse(tweet)))}

Spark SQL can load JSON files and infer the schema based on that data. Here isthe code to load the json files, register the data in the temp table called "tweetTable" and print out the schema based on that.

val tweetTable = sqlContext.jsonFile(tweetInput)tweetTable.registerTempTable( "tweetTable" )tweetTable.printSchema()

Now, look at the text of 10 sample tweets.

sqlContext.sql( "SELECT text FROM tweetTable LIMIT 10" ) .collect().foreach(println)

View the user language, user name, and text for 10 sample tweets.

sqlContext.sql( "SELECT user.lang, user.name, text FROM tweetTable LIMIT 10" ) .collect().foreach(println)

Finally, show the count of tweets by user language. This can help determine the number of clusters that is ideal for this dataset of tweets.

sqlContext.sql( "SELECT user.lang, COUNT(*) as cnt FROM tweetTable " + "GROUP BY user.lang ORDER BY cnt DESC limit 1000" ) .collect.foreach(println)
Examine the Tweets and Train a Model
Part 2: Examine Tweets and Train a Model

The second program examines the data found in tweets and trains a language classifier using K-Means clustering on the tweets:

  • - Spark SQL is used to gather data about the tweets -- to look at a few of them, and to count the total number of tweets for the most common languages of the user.
  • - Spark MLLib is used for applying the K-Means algorithm for clustering the tweets. The number of clusters and the number of iterations of algorithm are configurable. After training the model, some sample tweets from the different clusters are shown.

See for the command to run part 2.

HDFS
HDFS

HDFS is a file system that is meant for storing large data sets and being faulttolerant. In a production system, your Spark clustershould ideally be on the same machines as your Hadoop cluster to make it easy toread files. The Spark binary you run on your clusters must be compiled with thesame HDFS version as the one you wish to use.

There are many ways to install HDFS, but heading to the Hadoop homepageis one way to get started and run hdfs locally on your machine.

Run on any file pattern on your hdfs directory.

Importing from Databases
Reading from Databases

Most likely, you aren't going to be storing your logs data in a database (that is likely too expensive), but there may be other data you want to input to Spark that is stored in a database. Perhaps that data can be joined with the logs to provide more information.

The same way file systems have evolved over time to scale, so have databases.

A simple database to begin with is a single database - SQL databases are quite common. When that fills, one option is to buy a larger machine for the database. The price of these larger machines gets increasingly expensive (even price per unit of storage) and it is eventually no longer possible to buy a machine big enough at some point. A common choice then is to switch to sharded databases. With that option, application level code is written to determine on which database shard a piece of data should be read or written to.

Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «Databricks Spark Reference Applications»

Look at similar books to Databricks Spark Reference Applications. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Reviews about «Databricks Spark Reference Applications»

Discussion, reviews of the book Databricks Spark Reference Applications and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.