• Complain

Alexey Grigorev - Java: Data Science Made Easy

Here you can read online Alexey Grigorev - Java: Data Science Made Easy full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2017, publisher: Packt Publishing, genre: Home and family. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Alexey Grigorev Java: Data Science Made Easy

Java: Data Science Made Easy: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Java: Data Science Made Easy" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

Data collection, processing, analysis, and more

About This Book

  • Your entry ticket to the world of data science with the stability and power of Java
  • Explore, analyse, and visualize your data effectively using easy-to-follow examples
  • A highly practical course covering a broad set of topics - from the basics of Machine Learning to Deep Learning and Big Data frameworks.

Who This Book Is For

This course is meant for Java developers who are comfortable developing applications in Java, and now want to enter the world of data science or wish to build intelligent applications. Aspiring data scientists with some understanding of the Java programming language will also find this book to be very helpful. If you are willing to build efficient data science applications and bring them in the enterprise environment without changing your existing Java stack, this book is for you!

What You Will Learn

  • Understand the key concepts of data science
  • Explore the data science ecosystem available in Java
  • Work with the Java APIs and techniques used to perform efficient data analysis
  • Find out how to approach different machine learning problems with Java
  • Process unstructured information such as natural language text or images, and create your own search
  • Learn how to build deep neural networks with DeepLearning4j
  • Build data science applications that scale and process large amounts of data
  • Deploy data science models to production and evaluate their performance

In Detail

Data science is concerned with extracting knowledge and insights from a wide variety of data sources to analyse patterns or predict future behaviour. It draws from a wide array of disciplines including statistics, computer science, mathematics, machine learning, and data mining. In this course, we cover the basic as well as advanced data science concepts and how they are implemented using the popular Java tools and libraries.The course starts with an introduction of data science, followed by the basic data science tasks of data collection, data cleaning, data analysis, and data visualization. This is followed by a discussion of statistical techniques and more advanced topics including machine learning, neural networks, and deep learning. You will examine the major categories of data analysis including text, visual, and audio data, followed by a discussion of resources that support parallel implementation. Throughout this course, the chapters will illustrate a challenging data science problem, and then go on to present a comprehensive, Java-based solution to tackle that problem. You will cover a wide range of topics from classification and regression, to dimensionality reduction and clustering, deep learning and working with Big Data. Finally, you will see the different ways to deploy the model and evaluate it in production settings.

By the end of this course, you will be up and running with various facets of data science using Java, in no time at all.

This course contains premium content from two of our recently published popular titles:

  • Java for Data Science
  • Mastering Java for Data Science

Style and approach

This course follows a tutorial approach, providing examples of each of the concepts covered. With a step-by-step instructional style, this book covers various facets of data science and will get you up and running quickly.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Alexey Grigorev: author's other books


Who wrote Java: Data Science Made Easy? Find out the surname, the name of the author of the book and a list of all author's works by series.

Java: Data Science Made Easy — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Java: Data Science Made Easy" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make
Evaluation

We have covered many machine learning libraries, and many of them implement the same algorithms such as random forest or logistic regression. Also, each individual model can have many different parameters, a logistic regression has the regularization coefficient, an SVM is configured by setting the kernel and its parameters.

How do we select the best single model out of so many possible variants?

For that, we first define some evaluation metric and then select the model which achieves the best possible performance with respect to this metric. For binary classification, there are many metrics that we can use for comparison, and the most commonly used ones are as follows:

  • Accuracy and error
  • Precision, recall, and F1
  • AUC (AU ROC)

We use these metrics to see how well the model will be able to generalize to new unseen data. Therefore, it is important to model this situation when the data is new to the model. This is typically done by splitting the data into several parts. So, we will also cover the following:

  • Result evaluation
  • K-fold cross-validation
  • Training, validation, and testing

Let us start with the most intuitive evaluation metric, accuracy.

Supervised learning for texts

Supervised machine learning methods are also quite useful for text data. Like in the usual settings, here we have the label information, which we can use to understand the information within texts.

A very common example of such application of supervised learning to texts is spam detection: every time you hit the spam button in your e-mail client, this data is collected and then put in a classifier. Then, this classifier is trained to tell apart spam versus nonspam e-mails.

In this section, we will look into how to use Supervised methods for text on two examples: first, we will build a model for sentiment analysis, and then we will use a ranking classifier for reranking search results.

Getting Started with Data Science

Data science is not a single science as much as it is a collection of various scientific disciplines integrated for the purpose of analyzing data. These disciplines include various statistical and mathematical techniques, including:

  • Computer science
  • Data engineering
  • Visualization
  • Domain-specific knowledge and approaches

With the advent of cheaper storage technology, more and more data has been collected and stored permitting previously unfeasible processing and analysis of data. With this analysis came the need for various techniques to make sense of the data. These large sets of data, when used to analyze data and identify trends and patterns, become known as big data.

This in turn gave rise to cloud computing and concurrent techniques such as map-reduce , which distributed the analysis process across a large number of processors, taking advantage of the power of parallel processing.

The process of analyzing big data is not simple and evolves to the specialization of developers who were known as data scientists. Drawing upon a myriad of technologies and expertise, they are able to analyze data to solve problems that previously were either not envisioned or were too difficult to solve.

Early big data applications were typified by the emergence of search engines capable of more powerful and accurate searches than their predecessors. For example, AltaVista was an early popular search engine that was eventually superseded by Google. While big data applications were not limited to these search engine functionalities, these applications laid the groundwork for future work in big data.

The term, data science, has been used since 1974 and evolved over time to include statistical analysis of data. The concepts of data mining and data analytics have been associated with data science. Around 2008, the term data scientist appeared and was used to describe a person who performs data analysis. A more in-depth discussion of the history of data science can be found at http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/#3d9ea08369fd .

This book aims to take a broad look at data science using Java and will briefly touch on many topics. It is likely that the reader may find topics of interest and pursue these at greater depth independently. The purpose of this book, however, is simply to introduce the reader to the significant data science topics and to illustrate how they can be addressed using Java.

There are many algorithms used in data science. In this book, we do not attempt to explain how they work except at an introductory level. Rather, we are more interested in explaining how they can be used to solve problems. Specifically, we are interested in knowing how they can be used with Java.

Reading the DBLP graph

To start with this project, we first need to read the graph data, and for this, we will use Apache Spark and a few of its libraries. The first library is Spark Data frames, it is similar to R data frames, pandas or joinery, except that they are distributed and based on RDDs.

Let's read this dataset. The first step is to create a special class Edge for storing the data:

public class Edge implements Serializable {
private final String node1;
private final String node2;
private final int year;
// constructor and setters omitted
}

Now, let's read the data:

SparkConf conf = new SparkConf().setAppName("graph").setMaster("local[*]");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD edgeFile = sc.textFile("/data/dblp/dblp_coauthorship.json.gz");
JavaRDD edges = edgeFile.filter(s -> s.length() > 1).map(s -> {
Object[] array = JSON.std.arrayFrom(s);
String node1 = (String) array[0];
String node2 = (String) array[1];
Integer year = (Integer) array[2];
if (year == null) {
return new Edge(node1, node2, -1);
}
return new Edge(node1, node2, year);
});

After setting up the context, we read the data from a text file, and then apply a map function to each line to convert it to Edge. For parsing JSON, we use the Jackson-Jr library as previously, so make sure you add this to the pom file.

Note that we also include a filter here: the first and the last line contain [ and ] respectively, so we need to skip them.

To check whether we managed to parse the data successfully, we can use the take method: it gets the head of the RDD and puts it into a List, which we can print to the console:

edges.take(5).forEach(System.out::println);

This should produce the following output:

Edge [node1=Alin Deutsch, node2=Mary F. Fernandez, year=1998]
Edge [node1=Alin Deutsch, node2=Daniela Florescu, year=1998]
Edge [node1=Alin Deutsch, node2=Alon Y. Levy, year=1998]
Edge [node1=Alin Deutsch, node2=Dan Suciu, year=1998]
Edge [node1=Mary F. Fernandez, node2=Daniela Florescu, year=1998]

After successfully converting the data, we will put it into a Data Frame. For that, we will use Spark DataFrame, which is a part of the Spark-SQL package. We can include it with the following dependency:


org.apache.spark
spark-sql_2.11
2.1.0

To create a DataFrame from our RDD, we first create a SQL session, and then use its createDataFrame method:

SparkSession sql = new SparkSession(sc.sc());
Dataset df = sql.createDataFrame(edges, Edge.class);

There are quite a lot of papers in the dataset. We can make it smaller by restricting it to papers that were published only in 1990. For this we can use the filter method:

df = df.filter("year >= 1990");

Next, many authors can have multiple papers together, and we are interested in the earliest one. We can get it using the min function:

Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «Java: Data Science Made Easy»

Look at similar books to Java: Data Science Made Easy. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Reviews about «Java: Data Science Made Easy»

Discussion, reviews of the book Java: Data Science Made Easy and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.