CONTENTS
INTRODUCTION
THE GROWTH OF USER-DRIVEN CONTENT has fueled a rapid increase in the volume and type of data that is generated, manipulated, analyzed, and archived. In addition, varied newer sets of sources, including sensors, Global Positioning Systems (GPS), automated trackers and monitoring systems, are generating a lot of data. These larger volumes of data sets, often termed big data , are imposing newer challenges and opportunities around storage, analysis, and archival.
In parallel to the fast data growth, data is also becoming increasingly semi-structured and sparse. This means the traditional data management techniques around upfront schema definition and relational references is also being questioned.
The quest to solve the problems related to large-volume and semi-structured data has led to the emergence of a class of newer types of database products. This new class of database products consists of column-oriented data stores, key/value pair databases, and document databases. Collectively, these are identified as NoSQL.
The products that fall under the NoSQL umbrella are quite varied, each with their unique sets of features and value propositions. Given this, it often becomes difficult to decide which product to use for the case at hand. This book prepares you to understand the entire NoSQL landscape. It provides the essential concepts that act as the building blocks for many of the NoSQL products. Instead of covering a single product exhaustively, it provides a fair coverage of a number of different NoSQL products. The emphasis is often on breadth and underlying concepts rather than a full coverage of every product API. Because a number of NoSQL products are covered, a good bit of comparative analysis is also included.
If you are unsure where to start with NoSQL and how to learn to manage and analyze big data, then you will find this book to be a good introduction and a useful reference to the topic.
WHO THIS BOOK IS FOR
Developers, architects, database administrators, and technical project managers are the primary audience of this book. However, anyone savvy enough to understand database technologies is likely to find it useful.
The subject of big data and NoSQL is of interest to a number of computer science students and researchers as well. Such students and researchers could benefit from reading this book.
Anyone starting out with big data analysis and NoSQL will gain from reading this book.
WHAT THIS BOOK COVERS
This book starts with the essentials of NoSQL and graduates to advanced concepts around performance tuning and architectural guidelines. The book focuses all along on the fundamental concepts that relate to NoSQL and explains those in the context of a number of different NoSQL products. The book includes illustrations and examples that relate to MongoDB, CouchDB, HBase, Hypertable, Cassandra, Redis, and Berkeley DB. A few other NoSQL products, besides these, are also referenced.
An important part of NoSQL is the way large data sets are manipulated. This book covers all the essentials of MapReduce-based scalable processing. It illustrates a few examples using Hadoop. Higher-level abstractions like Hive and Pig are also illustrated.
Chapter 10, which is entirely devoted to NoSQL in the cloud, brings to light the facilities offered by Amazon Web Services and the Google App Engine.
The book includes a number of examples and illustration of use cases. Scalable data architectures at Google, Amazon, Facebook, Twitter, and LinkedIn are also discussed.
Towards the end of the book the discussion on comparing NoSQL products and polyglot persistence in an application stack are explained.
HOW THIS BOOK IS STRUCTURED
This book is divided into four parts:
- Part I: Getting Started
- Part II: Learning the NoSQL Basics
- Part III: Gaining Proficiency with NoSQL
- Part IV: Mastering NoSQL
Topics in each part are built on top of what is covered in the preceding parts.
Part I of the book gently introduces NoSQL. It defines the types of NoSQL products and introduces the very first examples of storing data in and accessing data from NoSQL:
- Chapter 1 defines NoSQL.
- Starting with the quintessential Hello World, Chapter 2 presents the first few examples of using NoSQL.
- Chapter 3 includes ways of interacting and interfacing with NoSQL products.
Part II of the book is where a number of the essential concepts of a variety of NoSQL products are covered:
- Chapter 4 starts by explaining the storage architecture.
- Chapters 5 and 6 cover the essentials of data management by demonstrating the CRUD operations and the querying mechanisms. Data sets evolve with time and usage.
- Chapter 7 addresses the questions around data evolution. The world of relational databases focuses a lot on query optimization by leveraging indexes.
- Chapter 8 covers indexes in the context of NoSQL products. NoSQL products are often disproportionately criticized for their lack of transaction support.
- Chapter 9 demystifies the concepts around transactions and the transactional-integrity challenges that distributed systems face.
Parts III and IV of the book are where a select few advanced topics are covered:
- Chapter 10 covers the Google App Engine data store and Amazon SimpleDB. Much of big data processing rests on the shoulders of the MapReduce style of processing.
- Learn all the essentials of MapReduce in Chapter 11 .
- Chapter 12 extends the MapReduce coverage to demonstrate how Hive provides a SQL-like abstraction for Hadoop MapReduce tasks. Chapter 13 revisits the topic of database architecture and internals.
Part IV is the last part of the book. Part IV starts with Chapter 14 , where NoSQL products are compared. Chapter 15 promotes the idea of polyglot persistence and the use of the right database, which should depend on the use case. Chapter 16 segues into tuning scalable applications. Although seemingly eclectic, topics in Part IV prepare you for practical usage of NoSQL. Chapter 17 is a presentation of a select few tools and utilities that you are likely to leverage with your own NoSQL deployment.
WHAT YOU NEED TO USE THIS BOOK
Please install the required pieces of software to follow along with the code examples. Refer to Appendix A for install and setup instructions.
CONVENTIONS
To help you get the most from the text and keep track of whats happening, weve used a number of conventions throughout the book.
The pencil icon indicates notes, tips, hints, tricks, and asides to the current discussion.
As for styles in the text:
- We italicize new terms and important words when we introduce them.
- We show file names, URLs, and code within the text like so: persistence.properties .
- We present code in two different ways:
We use a monofont type with no highlighting for most code examples. We use bold to emphasize code that is particularly important in the present context or to show changes from a previous code snippet.
SOURCE CODE
As you work through the examples in this book, you may choose either to type in all the code manually, or to use the source code files that accompany the book. All the source code used in this book is available for download at www.wrox.com . When at the site, simply locate the books title (use the Search box or one of the title lists) and click the Download Code link on the books detail page to obtain all the source code for the book. Code that is included on the website is highlighted by the following icon: