Apache Sqoop Cookbook
Jarek Jarcec Cecho
Format: PDF / Kindle (mobi) / ePub
Integrating data from multiple sources is essential in the age of big data, but it can be a challenging and time-consuming task. This handy cookbook provides dozens of ready-to-use recipes for using Apache Sqoop, the command-line interface application that optimizes data transfers between relational databases and Hadoop.
Sqoop is both powerful and bewildering, but with this cookbook’s problem-solution-discussion format, you’ll quickly learn how to deploy and then apply Sqoop in your environment. The authors provide MySQL, Oracle, and PostgreSQL database examples on GitHub that you can easily adapt for SQL Server, Netezza, Teradata, or other relational systems.
- Transfer data from a single database table into your Hadoop ecosystem
- Keep table data and Hadoop in sync by importing data incrementally
- Import data from more than one database table
- Customize transferred data by calling various database functions
- Export generated, processed, or backed-up data from Hadoop to your database
- Run Sqoop within Oozie, Hadoop’s specialized workflow scheduler
- Load data into Hadoop’s data warehouse (Hive) or database (HBase)
- Handle installation, connection, and syntax issues common to specific database vendors
Hadoop and databases. Clearly it’s a tool optimized for power users. A command-line interface providing 60 parameters is both powerful and bewildering. In this book, we’ll focus on applying the parameters in common use cases to help you deploy and use Sqoop in your environment. Chapter 1 guides you through the basic prerequisites of using Sqoop. You will learn how to download, install, and configure the Sqoop tool on any node of your Hadoop cluster. Chapters 2, 3, and 4 are devoted to the.
Be slower than exporting directly to the final table. Sqoop requires that the structure of the staging table be the same as that of the target table. The number of columns and their types must be the same; otherwise, the export operation will fail. Other characteristics are not enforced, as Sqoop gives the user the ability to take advantage of advanced database features. You can store the staging table in a different logical database (on the same physical box) or in a different file group. Some.
With the --columns parameter. Discussion By default, Sqoop assumes that your HDFS data contains the same number and ordering of columns as the table you’re exporting into. The parameter --columns is used to specify either a reordering of columns or that only a subset of table columns is available in the input files. The parameter accepts a comma-separated list of column names and can be particularly helpful if you’re exporting data to different tables or your table has changed between.
Transferring data between Oracle and Hadoop. Is there a faster and more optimal way of exchanging data with Oracle? Solution You should consider using OraOop, a specialized connector for Oracle developed and maintained by Quest Software, now a division of Dell. You can download the connector from the Cloudera website. Discussion OraOop is a highly specialized connector for the Oracle database. Instead of splitting data into equal ranges using one column (usually the table’s.
A Teradata appliance as your enterprise data warehouse system and you need to import and export data from there to Hadoop and vice versa. You have used Sqoop with the Generic JDBC Connector. Is there a more optimal solution? Solution Download, install, and use the Cloudera Teradata Connector, which is available for free on the Cloudera website. Discussion The Cloudera Teradata Connector is a specialized connector for Teradata that is not part of the Sqoop distribution. You need.