In this tutorial, you will load files into Druid using Hadoop in local standalone mode. Then, you will learn how to connect Druid to a remote Hadoop cluster to automatically parallelize ingestion.
You will need:
Imply builds and certifies its releases using OpenJDK. We suggest selecting a distribution that provides long-term support and open-source licensing. Amazon Corretto and Azul Zulu are two good options.
If you've already installed and started Imply using the quickstart, you can skip this step.
First, download Imply 3.0.2 from imply.io/get-started and unpack the release archive.
tar -xzf imply-3.0.2.tar.gz cd imply-3.0.2
bin/supervise -c conf/supervise/quickstart.conf
You should see a log message printed out for each service that starts up. You can view detailed logs
for any service by looking in the
var/sv/ directory using another terminal.
Later on, if you'd like to stop the services, CTRL-C the supervise program in your terminal. If you
want a clean start after stopping the services, remove the
var/ directory and then start up again.
We've included a sample of Wikipedia edits from June 27, 2016 to get you started with batch ingestion, located in the
quickstart/wikipedia-2016-06-27-sampled.json. Open the
quickstart/wikipedia-index-hadoop.json ingestion task file
to see how Druid can be configured to load this data using Hadoop.
To load this data into Druid, you can submit the ingestion spec that you opened earlier. To do this, run the following command from your Imply directory:
bin/post-index-task --file quickstart/wikipedia-index-hadoop.json
This command will start a Druid Hadoop ingestion task. Since you haven't yet configured Druid to use a remote Hadoop cluster for ingestion tasks, this will run in-process (inside the Druid task JVM) using Hadoop in local standalone mode. If you had configured Druid to use a remote Hadoop cluster, the Druid task would submit a Hadoop job and automatically parallelize on YARN, and then just wait for the Hadoop job to finish.
After your ingestion task finishes, the data will be loaded by historical nodes and will be available for querying within a minute or two. You can monitor the progress of loading your data in the coordinator console, by checking whether there is a datasource "wikipedia" with a blue circle indicating "fully available": http://localhost:8081/#/.
$ bin/dsql dsql> SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2016-06-27 00:00:00' AND TIMESTAMP '2016-06-28 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 5; ┌──────────────────────────────────────────────────────────┬───────┐ │ page │ Edits │ ├──────────────────────────────────────────────────────────┼───────┤ │ Copa América Centenario │ 29 │ │ User:Cyde/List of candidates for speedy deletion/Subpage │ 16 │ │ Wikipedia:Administrators' noticeboard/Incidents │ 16 │ │ 2016 Wimbledon Championships – Men's Singles │ 15 │ │ Wikipedia:Administrator intervention against vandalism │ 15 │ └──────────────────────────────────────────────────────────┴───────┘ Retrieved 5 rows in 0.04s.
Next, try configuring a data cube within Imply:
So far, you've loaded data using an ingestion spec that we've included in the distribution, using Hadoop in standalone mode.