If you purchased this ebook directly from oreilly.com, you have the following benefits:
If you purchased this ebook from another retailer, you can upgrade your ebook to take advantage of all these benefits for just $4.99. to access your ebook upgrade.
Foreword
Aaron Kimball
San Francisco, CA
Its been four years since, via a post to the Apache JIRA, the first version of Sqoop was released to the world as an addition to Hadoop. Since then, the project has taken several turns, most recently landing as a top-level Apache project. Ive been amazed at how many people use this small tool for a variety of large tasks. Sqoop users have imported everything from humble test data sets to mammoth enterprise data warehouses into the Hadoop Distributed Filesystem, HDFS. Sqoop is a core member of the Hadoop ecosystem, and plug-ins are provided and supported by several major SQL and ETL vendors. And Sqoop is now part of integral ETL and processing pipelines run by some of the largest users of Hadoop.
The software industry moves in cycles. At the time of Sqoops origin, a major concern was in unlocking data stored in an organizations RDBMS and transferring it to Hadoop. Sqoop enabled users with vast troves of information stored in existing SQL tables to use new analytic tools like MapReduce and Apache Pig. As Sqoop matures, a renewed focus on SQL-oriented analytics continues to make it relevant: systems like Cloudera Impala and Dremel-style analytic engines offer powerful distributed analytics with SQL-based languages, using the common data substrate offered by HDFS.
The variety of data sources and analytic targets presents a challenge in setting up effective data transfer pipelines. Data sources can have a variety of subtle inconsistencies: different DBMS providers may use different dialects of SQL, treat data types differently, or use distinct techniques to offer optimal transfer speeds. Depending on whether youre importing to Hive, Pig, Impala, or your own MapReduce pipeline, you may want to use a different file format or compression algorithm when writing data to HDFS. Sqoop helps the data engineer tasked with scripting such transfers by providing a compact but powerful tool that flexibly negotiates the boundaries between these systems and their data layouts.
The internals of Sqoop are described in its online user guide, and Hadoop: The Definitive Guide (OReilly) includes a chapter covering its fundamentals. But for most users who want to apply Sqoop to accomplish specific imports and exports, The Apache Sqoop Cookbook offers guided lessons and clear instructions that address particular, common data management tasks. Informed by the multitude of times they have helped individuals with a variety of Sqoop use cases, Kathleen and Jarcec put together a comprehensive list of ways you may need to move or transform data, followed by both the commands you should run and a thorough explanation of whats taking place under the hood. The incremental structure of this books chapters will have you moving from a table full of Hello, world! strings to managing recurring imports between large-scale systems in no time.
It has been a pleasure to work with Kathleen, Jarcec, and the countless others who made Sqoop into the tool it is today. I would like to thank them for all their hard work so far, and for continuing to develop and advocate for this critical piece of the total big data management puzzle.
Preface
Whether moving a small collection of personal vacation photos between applications or moving petabytes of data between corporate warehouse systems, integrating data from multiple sources remains a struggle. Data storage is more accessible thanks to the availability of a number of widely used storage systems and accompanying tools. Core to that are relational databases (e.g., Oracle, MySQL, SQL Server, Teradata, and Netezza) that have been used for decades to serve and store huge amounts of data across all industries.
Relational database systems often store valuable data in a company. If made available, that data can be managed and processed by Apache Hadoop, which is fast becoming the standard for big data processing. Several relational database vendors championed developing integration with Hadoop within one or more of their products.
Transferring data to and from relational databases is challenging and laborious. Because data transfer requires careful handling, Apache Sqoop, short for SQL to Hadoop, was created to perform bidirectional data transfer between Hadoop and almost any external structured datastore. Taking advantage of MapReduce, Hadoops execution engine, Sqoop performs the transfers in a parallel manner.
If youre reading this book, you may have some prior exposure to Sqoopespecially from Aaron Kimballs Sqoop section in Hadoop: The Definitive Guide by Tom White (OReilly) or from Hadoop Operations by Eric Sammer (OReilly).
From that exposure, youve seen how Sqoop optimizes data transfers between Hadoop and databases. Clearly its a tool optimized for power users. A command-line interface providing 60 parameters is both powerful and bewildering. In this book, well focus on applying the parameters in common use cases to help you deploy and use Sqoop in your environment.
guides you through the basic prerequisites of using Sqoop. You will learn how to download, install, and configure the Sqoop tool on any node of your Hadoop cluster.
Chapters .
In , we focus on integrating Sqoop with the rest of the Hadoop ecosystem. We will show you how to run Sqoop from within a specialized Hadoop scheduler called Apache Oozie and how to load your data into Hadoops data warehouse system Apache Hive and Hadoops database Apache HBase.
For even greater performance, Sqoop supports database-specific connectors that use native features of the particular DBMS. Sqoop includes native connectors for MySQL and PostgreSQL. Available for download are connectors for Teradata, Netezza, Couchbase, and Oracle (from Dell). walks you through using them.
Sqoop 2
The motivation behind Sqoop 2 was to make Sqoop easier to use by having a web application run Sqoop. This allows you to install Sqoop and use it from anywhere. In addition, having a REST API for operation and management enables Sqoop to integrate better with external systems such as Apache Oozie. As further discussion of Sqoop 2 is beyond the scope of this book, we encourage you to download the bits and docs from the Apache Sqoop website and then try it out!
Conventions Used in This Book
The following typographical conventions are used in this book:
Italic Indicates new terms, URLs, email addresses, filenames, and file extensions. Constant width
Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords.