Load from Hadoop

In this tutorial, you will load files into Druid using Hadoop in local standalone mode. Then, you will learn how to connect Druid to a remote Hadoop cluster to automatically parallelize ingestion.


You will need:

Imply builds and certifies its releases using OpenJDK. We suggest selecting a distribution that provides long-term support and open-source licensing. Amazon Corretto and Azul Zulu are two good options.

Start Imply

If you've already installed and started Imply using the quickstart, you can skip this step.

First, download Imply 3.1.19 from imply.io/get-started and unpack the release archive.

tar -xzf imply-3.1.19.tar.gz
cd imply-3.1.19

Next, you will need to start up Imply, which includes Druid, Pivot, and ZooKeeper. You can use the included supervise program to start everything with a single command:

bin/supervise -c conf/supervise/quickstart.conf

You should see a log message printed out for each service that starts up. You can view detailed logs for any service by looking in the var/sv/ directory using another terminal.

Later on, if you'd like to stop the services, CTRL-C the supervise program in your terminal. If you want a clean start after stopping the services, remove the var/ directory and then start up again.

Load data

We've included a sample of Wikipedia edits from June 27, 2016 to get you started with batch ingestion, located in the file quickstart/wikipedia-2016-06-27-sampled.json. Open the quickstart/wikipedia-index-hadoop.json ingestion task file to see how Druid can be configured to load this data using Hadoop.

To load this data into Druid, you can submit the ingestion spec that you opened earlier. To do this, run the following command from your Imply directory:

bin/post-index-task --file quickstart/wikipedia-index-hadoop.json

This command will start a Druid Hadoop ingestion task. Since you haven't yet configured Druid to use a remote Hadoop cluster for ingestion tasks, this will run in-process (inside the Druid task JVM) using Hadoop in local standalone mode. If you had configured Druid to use a remote Hadoop cluster, the Druid task would submit a Hadoop job and automatically parallelize on YARN, and then just wait for the Hadoop job to finish.

After your ingestion task finishes, the data will be loaded by historical nodes and will be available for querying within a minute or two. You can monitor the progress of loading your data in the coordinator console, by checking whether there is a datasource "wikipedia" with a blue circle indicating "fully available": http://localhost:8081/#/.

Query data

After sending data, you can immediately query it using any of the supported query methods including visualization, SQL, and API. To start off, try a SQL query:

$ bin/dsql
dsql> SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2016-06-27 00:00:00' AND TIMESTAMP '2016-06-28 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 5;

│ page                                                     │ Edits │
│ Copa América Centenario                                  │    29 │
│ User:Cyde/List of candidates for speedy deletion/Subpage │    16 │
│ Wikipedia:Administrators' noticeboard/Incidents          │    16 │
│ 2016 Wimbledon Championships – Men's Singles             │    15 │
│ Wikipedia:Administrator intervention against vandalism   │    15 │
Retrieved 5 rows in 0.04s.

Next, try configuring a data cube within Imply:

  1. Navigate to Pivot at http://localhost:9095/.
  2. Click on the Plus icon in the top right of the header bar and select "New data cube".
  3. Select the source "druid: wikipedia" and ensure "Auto-fill dimensions and measures" is checked.
  4. Click "Next: configure data cube".
  5. Click "Create cube". You should see the confirmation message "Data cube created".
  6. View your new data cube by clicking the Home icon in the top-right and selecting the "Wikipedia" cube you just created.

Further reading

So far, you've loaded data using an ingestion spec that we've included in the distribution, using Hadoop in standalone mode.




Manage Data

Query Data



Special UI Features

Imply Manager