Tutorial: Load from Hadoop

In this tutorial, you'll load files into Druid using Hadoop in local standalone mode. Then, you'll learn how to connect Druid to a remote Hadoop cluster to automatically parallelize ingestion.


You will need:

  • Java 8 or better
  • Node.js 4.5.x or better
  • Linux, Mac OS X, or other Unix-like OS (Windows is not supported)
  • At least 4GB of RAM
  • A Hadoop cluster

On Mac OS X, you can use Oracle's JDK 8 to install Java and Homebrew to install Node.js.

On Linux, your OS package manager should be able to help for both Java and Node.js. If your Ubuntu- based OS does not have a recent enough version of Java, WebUpd8 offers packages for those OSes. If your Debian, Ubuntu, or Enterprise Linux OS does not have a recent enough version of Node.js, NodeSource offers packages for those OSes.

Start Imply

If you've already installed and started Imply using the quickstart, you can skip this step.

First, download Imply 2.3.4 from imply.io/get-started and unpack the release archive.

tar -xzf imply-2.3.4.tar.gz
cd imply-2.3.4

Next, you'll need to start up Imply, which includes Druid, Imply Pivot, and ZooKeeper. You can use the included supervise program to start everything with a single command:

bin/supervise -c conf/supervise/quickstart.conf

You should see a log message printed out for each service that starts up. You can view detailed logs for any service by looking in the var/sv/ directory using another terminal.

Later on, if you'd like to stop the services, CTRL-C the supervise program in your terminal. If you want a clean start after stopping the services, remove the var/ directory and then start up again.

Load data

We've included a sample of Wikipedia edits from June 27, 2016 to get you started with batch ingestion, located in the file quickstart/wikiticker-2016-06-27-sampled.json. Open the quickstart/wikiticker-index-hadoop.json ingestion task file to see how Druid can be configured to load this data using Hadoop.

To load this data into Druid, you can submit the ingestion spec that you opened earlier. To do this, run the following command from your Imply directory:

bin/post-index-task --file quickstart/wikiticker-index-hadoop.json

This command will start a Druid Hadoop ingestion task. Since you didn't yet configure Druid to use a remote Hadoop cluster for ingestion tasks, this will run in-process (inside the Druid task JVM) using Hadoop in local standalone mode. If you had configured Druid to use a remote Hadoop cluster, the Druid task would submit a Hadoop job and automatically parallelize on YARN, and then just wait for the Hadoop job to finish.

After your ingestion task finishes, the data will be loaded by historical nodes and available for querying within a minute or two. You can monitor the progress of loading your data in the coordinator console, by checking whether there is a datasource "wikiticker" with a blue circle indicating "fully available": http://localhost:8081/#/.

Query data

After sending data, you can immediately query it using any of the supported query methods. To start off, try a SQL query:

$ bin/dsql
dsql> SELECT page, SUM("count") AS Edits FROM wikiticker WHERE TIMESTAMP '2016-06-27 00:00:00' <= __time AND __time < TIMESTAMP '2016-06-28 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 5;

│ page                                                     │ Edits │
│ Copa América Centenario                                  │    29 │
│ User:Cyde/List of candidates for speedy deletion/Subpage │    16 │
│ Wikipedia:Administrators' noticeboard/Incidents          │    16 │
│ 2016 Wimbledon Championships – Men's Singles             │    15 │
│ Wikipedia:Administrator intervention against vandalism   │    15 │
Retrieved 5 rows in 0.04s.

Next, try configuring a datacube in Pivot:

  1. Navigate to Pivot at http://localhost:9095.
  2. Click on the Plus icon in the top right of the header bar and select "New data cube".
  3. Select the source "druid: wikiticker-hadoop" and ensure "Auto-fill dimensions and measures" is checked.
  4. Click "Next: configure data cube".
  5. Click "Create cube". You should see the confirmation message "Data cube created".
  6. View your new datacube by clicking the Home icon in the top-right and selecting the "Wikiticker Hadoop" cube you just created.

Further reading

So far, you've loaded data using an ingestion spec that we've included in the distribution, using Hadoop in standalone mode.