In this tutorial, you'll edit an ingestion spec to load a new data file. This tutorial uses batch ingestion, but the same principles apply to any form of ingestion supported by Druid.
You will need:
On Mac OS X, you can use Oracle's JDK 8 to install Java.
On Linux, your OS package manager should be able to help install Java. If your Ubuntu-based OS does not have a recent enough version of Java, Azul offers Zulu, an open source OpenJDK-based package with packages for Red Hat, Ubuntu, Debian, and other popular Linux distributions.
If you've already installed and started Imply using the quickstart, you can skip this step.
First, download Imply 3.4.9 from imply.io/get-started and unpack the release archive.
tar -xzf imply-3.4.9.tar.gz
cd imply-3.4.9
Next, you will need to start up Imply, which includes Druid, Pivot, and ZooKeeper. You can use the included supervise program to start everything with a single command:
bin/supervise -c conf/supervise/quickstart.conf
You should see a log message printed out for each service that starts up. You can view detailed logs
for any service by looking in the var/sv/
directory using another terminal window.
Later on, if you'd like to stop the services, CTRL-C the supervise program in your terminal. If you
want a clean start after stopping the services, remove the var/
directory and then start up again.
When loading files into Druid, you will use Druid's batch loading process.
There is an example batch ingestion spec in quickstart/wikipedia-index.json
, which you can modify
for your own needs.
The most important questions are:
"rollup": true
in the "granularitySpec".If your data does not have a natural sense of time, you can tag each row with the current time. You can also tag all rows with a fixed timestamp, like "2000-01-01T00:00:00.000Z".
Let's use a small pageviews dataset as an example. Druid supports TSV, CSV, and JSON out of the box. Note that nested JSON objects are not supported, so if you do use JSON, you should provide a file containing flattened objects.
{"time": "2015-09-01T00:00:00Z", "url": "/foo/bar", "user": "alice", "latencyMs": 32}
{"time": "2015-09-01T01:00:00Z", "url": "/", "user": "bob", "latencyMs": 11}
{"time": "2015-09-01T01:30:00Z", "url": "/foo/bar", "user": "bob", "latencyMs": 45}
If you save this to a file called "pageviews.json", then for this dataset:
You can copy the existing quickstart/wikipedia-index.json
indexing task to a new file:
cp quickstart/wikipedia-index.json my-index-task.json
And modify it by altering the sections above. After altering, it should look like:
{
"type": "index",
"spec": {
"dataSchema": {
"dataSource": "pageviews",
"parser": {
"type": "string",
"parseSpec": {
"format": "json",
"dimensionsSpec": {
"dimensions": ["url", "user"]
},
"timestampSpec": {
"format": "auto",
"column": "time"
}
}
},
"metricsSpec": [
{ "name": "views", "type": "count" },
{ "name": "latencyMs", "type": "doubleSum", "fieldName": "latencyMs" }
],
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "day",
"queryGranularity": "none",
"intervals": ["2015-09-01/2015-09-02"],
"rollup": true
}
},
"ioConfig": {
"type": "index",
"firehose": {
"type": "local",
"baseDir": ".",
"filter": "pageviews.json"
},
"appendToExisting": false
},
"tuningConfig": {
"type": "index",
"targetPartitionSize": 5000000,
"maxRowsInMemory": 25000,
"forceExtendableShardSpecs": true
}
}
}
First make sure that the indexing task can read pageviews.json, by placing it in the root of the Imply distribution.
Next, to kick off the indexing process, POST your indexing task to the Druid Overlord. In a
single-machine Imply install, the URL is http://localhost:8090/druid/indexer/v1/task
. You can
also use the post-index-task command included in the Imply distribution:
bin/post-index-task --file my-index-task.json
If anything goes wrong with this task (e.g. it finishes with status FAILED), you can troubleshoot by visiting the "Task log" URL given by the post-index-task program.
After sending data, you can immediately query it using any of the supported query methods including visualization, SQL, and API. To start off, try a SQL query:
$ bin/dsql
dsql> SELECT "pageviews"."user", SUM(views) FROM pageviews GROUP BY "pageviews"."user";
┌───────┬────────┐
│ user │ EXPR$1 │
├───────┼────────┤
│ alice │ 1 │
│ bob │ 2 │
└───────┴────────┘
Retrieved 2 rows in 0.02s.
Next, try configuring a data cube within Imply: