• Complain

Alexey Grigorev [Alexey Grigorev] - Mastering Java for Data Science

Here you can read online Alexey Grigorev [Alexey Grigorev] - Mastering Java for Data Science full text of the book (entire story) in english for free. Download pdf and epub, get meaning, cover and reviews about this ebook. year: 2017, publisher: Packt Publishing, genre: Home and family. Description of the work, (preface) as well as reviews are available. Best literature library LitArk.com created for fans of good reading and offers a wide selection of genres:

Romance novel Science fiction Adventure Detective Science History Home and family Prose Art Politics Computer Non-fiction Religion Business Children Humor

Choose a favorite category and find really read worthwhile books. Enjoy immersion in the world of imagination, feel the emotions of the characters or learn something new for yourself, make an fascinating discovery.

Alexey Grigorev [Alexey Grigorev] Mastering Java for Data Science

Mastering Java for Data Science: summary, description and annotation

We offer to read an annotation, description, summary or preface (depends on what the author of the book "Mastering Java for Data Science" wrote himself). If you haven't found the necessary information about the book — write in the comments, we will try to find it.

Use Java to create a diverse range of Data Science applications and bring Data Science into production

About This Book

  • An overview of modern Data Science and Machine Learning libraries available in Java
  • Coverage of a broad set of topics, going from the basics of Machine Learning to Deep Learning and Big Data frameworks.
  • Easy-to-follow illustrations and the running example of building a search engine.

Who This Book Is For

This book is intended for software engineers who are comfortable with developing Java applications and are familiar with the basic concepts of data science. Additionally, it will also be useful for data scientists who do not yet know Java but want or need to learn it.

If you are willing to build efficient data science applications and bring them in the enterprise environment without changing the existing stack, this book is for you!

What You Will Learn

  • Get a solid understanding of the data processing toolbox available in Java
  • Explore the data science ecosystem available in Java
  • Find out how to approach different machine learning problems with Java
  • Process unstructured information such as natural language text or images
  • Create your own search engine
  • Get state-of-the-art performance with XGBoost
  • Learn how to build deep neural networks with DeepLearning4j
  • Build applications that scale and process large amounts of data
  • Deploy data science models to production and evaluate their performance

In Detail

Java is the most popular programming language, according to the TIOBE index, and it is a typical choice for running production systems in many companies, both in the startup world and among large enterprises.

Not surprisingly, it is also a common choice for creating data science applications: it is fast and has a great set of data processing tools, both built-in and external. What is more, choosing Java for data science allows you to easily integrate solutions with existing software, and bring data science into production with less effort.

This book will teach you how to create data science applications with Java. First, we will revise the most important things when starting a data science application, and then brush up the basics of Java and machine learning before diving into more advanced topics. We start by going over the existing libraries for data processing and libraries with machine learning algorithms. After that, we cover topics such as classification and regression, dimensionality reduction and clustering, information retrieval and natural language processing, and deep learning and big data.

Finally, we finish the book by talking about the ways to deploy the model and evaluate it in production settings.

Style and approach

This is a practical guide where all the important concepts such as classification, regression, and dimensionality reduction are explained with the help of examples.

Downloading the example code for this book. You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the code file.

Alexey Grigorev [Alexey Grigorev]: author's other books


Who wrote Mastering Java for Data Science? Find out the surname, the name of the author of the book and a list of all author's works by series.

Mastering Java for Data Science — read online for free the complete book (whole text) full work

Below is the text of the book, divided by pages. System saving the place of the last page read, allows you to conveniently read the book "Mastering Java for Data Science" online for free, without having to search again every time where you left off. Put a bookmark, and you can go to the page where you finished reading at any time.

Light

Font size:

Reset

Interval:

Bookmark:

Make
Collections

Data is the most important part of data science. When dealing with data, it needs to be efficiently stored and processed, and for this we use data structures. A data structure describes a way to store data efficiently to solve a specific problem, and the Java Collection API is the standard Java API for data structures. This API offers a wide variety of implementations that are useful in practical data science applications.

We will not describe the collection API in full detail, but concentrate on the most useful and important ones--list, set, and map interfaces.

Lists are collections where each element can be accessed by its index. The g0-to implementation of the List interface is ArrayList, which should be used in 99% of cases and it can be used as follows:

List list = new ArrayList<>();
list.add("alpha");
list.add("beta");
list.add("beta");
list.add("gamma");
System.out.println(list);

There are other implementations of the List interface, LinkedList or CopyOnWriteArrayList, but they are rarely needed.

Set is another interface in the Collections API, and it describes a collection which allows no duplicates. The go-to implementation is HashSet, if the order in which we insert elements does not matter, or LinkedHashSet, if the order matters. We can use it as follows:

Set set = new HashSet<>();
set.add("alpha");
set.add("beta");
set.add("beta");
set.add("gamma");
System.out.println(set);

List and Set both implement the Iterable interface, which makes it possible to use the for-each loop with them:

for (String el : set) {
System.out.println(el);
}

The Map interface allows mapping keys to values, and is sometimes called as dictionary or associative array in other languages. The g0-to implementation is HashMap:

Map map = new HashMap<>();
map.put("alpha", "");
map.put("beta", "");
map.put("gamma", "");
System.out.println(map);

If you need to keep the insertion order, you can use LinkedHashMap; if you know that the map interface will be accessed from multiple threads, use ConcurrentHashMap.

The Collections class provides several helper methods for dealing with collections such as sorting, or extracting the max or min elements:

String min = Collections.min(list);
String max = Collections.max(list);
System.out.println("min: " + min + ", max: " + max);
Collections.sort(list);
Collections.shuffle(list);

There are other collections such as Queue, Deque, Stack, thread-safe collections, and some others. They are less frequently used and not very important for data science.

JSAT

Java Statistical Analysis Tool (JSAT) is another Java library which contains a lot of implementations of commonly-used machine learning algorithms. You can check the full list of implemented models at https://github.com/EdwardRaff/JSAT/wiki/Algorithms .

To include JSAT to a Java project, add the following snippet to pom:


com.edwardraff
JSAT
0.0.5

Unlike Smile models, which require just an array of doubles with the feature information, JSAT requires a special wrapper class for data. If we have an array, it is converted to the JSAT representation like this:

double[][] X = ... // data
int[] y = ... // labels
// change to more classes for more classes for multi-classification
CategoricalData binary = new CategoricalData(2);
List> data = new ArrayList<>(X.length);
for (int i = 0; i < X.length; i++) {
int target = y[i];
DataPoint row = new DataPoint(new DenseVector(X[i]));
data.add(new DataPointPair(row, target));
}
ClassificationDataSet dataset = new ClassificationDataSet(data, binary);

Once we have prepared the dataset, we can train a model. Let's consider the Random Forest classifier again:

RandomForest model = new RandomForest();
model.setFeatureSamples(4);
model.setMaxForestSize(150);
model.trainC(dataset);

First, we set some parameters for the model, and then, in the end, we call the trainC method (which means train a classifier).

In the JSAT implementation, Random Forest has fewer options for tuning than Smile, only the number of features to select and the number of trees to grow.

Also, JSAT contains several implementations of Logistic Regression. The usual Logistic Regression model does not have any parameters, and it is trained like this:

LogisticRegression model = new LogisticRegression();
model.trainC(dataset);

If we want to have a regularized model, then we need to use the LogisticRegressionDCD class. Dual Coordinate Descent (DCD) is the optimization method used to train the logistic regression). We train it like this:

LogisticRegressionDCD model = new LogisticRegressionDCD();
model.setMaxIterations(maxIterations);
model.setC(C);
model.trainC(fold.toJsatDataset());

In this code, C is the regularization parameter, and the smaller values of C correspond to stronger regularization effect.

Finally, for outputting probabilities, we can do the following:

double[] row = // data
DenseVector vector = new DenseVector(row);
DataPoint point = new DataPoint(vector);
CategoricalResults out = model.classify(point);
double probability = out.getProb(1);

The CategoricalResults class contains a lot of information, including probabilities for each class and the most likely label.

What this book covers

, Data Science Using Java, provides the overview of the existing tools available in Java as well and introduces the methodology for approaching Data Science projects, CRISP-DM. In this chapter, we also introduce our running example, building a search engine.

, Data Processing Toolbox, reviews the standard Java library: the Collection API for storing the data in memory, the IO API for reading and writing the data, and the Streaming API for a convenient way of organizing data processing pipelines. We will look at the extensions to the standard libraries such as Apache Commons Lang, Apache Commons IO, Google Guava, and AOL Cyclops React. Then, we will cover most common ways of storing the data--text and CSV files, HTML, JSON, and SQL Databases, and discuss how we can get the data from these data sources. We will finish this chapter by talking about the ways we can collect the data for the running example--the search engine, and how we prepare the data for that.

, Exploratory Data Analysis, performs the initial analysis of data with Java: we look at how to calculate common statistics such as the minimal and maximal values, the average value, and the standard deviation. We also talk a bit about interactive analysis and see what are the tools that allow us to visually inspect the data before building models. For the illustration in this chapter, we use the data we collect for the search engine.

, Supervised Learning - Classification and Regression, starts with Machine Learning, and then looks at the models for performing supervised learning in Java. Among others, we look at how to use the following libraries--Smile, JSAT, LIBSVM, LIBLINEAR, and Encog, and we see how we can use these libraries to solve the classification and regression problems. We use two examples here, first, we use the search engine data for predicting whether a URL will appear on the first page of results or not, which we use for illustrating the classification problem. Second, we predict how much time it takes to multiply two matrices on certain hardware given its characteristics, and we illustrate the regression problem with this example.

, Unsupervised Learning Clustering and Dimensionality Reduction, explores the methods for Dimensionality Reduction available in Java, and we will learn how to apply PCA and Random Projection to reduce the dimensionality of this data. This is illustrated with the hardware performance dataset from the previous chapter. We also look at different ways to cluster data, including Agglomerative Clustering, K-Means, and DBSCAN, and we use the dataset with customer complaints as an example.

Next page
Light

Font size:

Reset

Interval:

Bookmark:

Make

Similar books «Mastering Java for Data Science»

Look at similar books to Mastering Java for Data Science. We have selected literature similar in name and meaning in the hope of providing readers with more options to find new, interesting, not yet read works.


Reviews about «Mastering Java for Data Science»

Discussion, reviews of the book Mastering Java for Data Science and just readers' own opinions. Leave your comments, write what you think about the work, its meaning or the main characters. Specify what exactly you liked and what you didn't like, and why you think so.