Edit ingestion specs

Getting started

In this tutorial, you'll edit an ingestion spec to load a new data file. This tutorial uses batch ingestion, but the same principles apply to any form of ingestion supported by Druid.

Prerequisites

You will need:

On Mac OS X, you can use Oracle's JDK 8 to install Java.

On Linux, your OS package manager should be able to help install Java. If your Ubuntu-based OS does not have a recent enough version of Java, Azul offers Zulu, an open source OpenJDK-based package with packages for Red Hat, Ubuntu, Debian, and other popular Linux distributions.

Start Imply

If you've already installed and started Imply using the quickstart, you can skip this step.

First, download Imply 2.6.8 from imply.io/get-started and unpack the release archive.

tar -xzf imply-2.6.8.tar.gz
cd imply-2.6.8

Next, you'll need to start up Imply, which includes Druid, Imply UI, and ZooKeeper. You can use the included supervise program to start everything with a single command:

bin/supervise -c conf/supervise/quickstart.conf

You should see a log message printed out for each service that starts up. You can view detailed logs for any service by looking in the var/sv/ directory using another terminal.

Later on, if you'd like to stop the services, CTRL-C the supervise program in your terminal. If you want a clean start after stopping the services, remove the var/ directory and then start up again.

Writing an ingestion spec

When loading files into Druid, you will use Druid's batch loading process. There's an example batch ingestion spec in quickstart/wikipedia-index.json that you can modify for your own needs.

The most important questions are:

If your data does not have a natural sense of time, you can tag each row with the current time. You can also tag all rows with a fixed timestamp, like "2000-01-01T00:00:00.000Z".

Let's use a small pageviews dataset as an example. Druid supports TSV, CSV, and JSON out of the box. Note that nested JSON objects are not supported, so if you do use JSON, you should provide a file containing flattened objects.

{"time": "2015-09-01T00:00:00Z", "url": "/foo/bar", "user": "alice", "latencyMs": 32}
{"time": "2015-09-01T01:00:00Z", "url": "/", "user": "bob", "latencyMs": 11}
{"time": "2015-09-01T01:30:00Z", "url": "/foo/bar", "user": "bob", "latencyMs": 45}

If you save this to a file called "pageviews.json", then for this dataset:

You can copy the existing quickstart/wikipedia-index.json indexing task to a new file:

cp quickstart/wikipedia-index.json my-index-task.json

And modify it by altering the sections above. After altering, it should look like:

{
  "type" : "index",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "pageviews",
      "parser" : {
        "type" : "string",
        "parseSpec" : {
          "format" : "json",
          "dimensionsSpec" : {
            "dimensions" : [
              "url",
              "user"
            ]
          },
          "timestampSpec" : {
            "format" : "auto",
            "column" : "time"
          }
        }
      },
      "metricsSpec" : [
        { "name": "views", "type": "count" },
        { "name": "latencyMs", "type": "doubleSum", "fieldName": "latencyMs" }
      ],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "day",
        "queryGranularity" : "none",
        "intervals" : ["2015-09-01/2015-09-02"],
        "rollup" : true
      }
    },
    "ioConfig" : {
      "type" : "index",
      "firehose" : {
        "type" : "local",
        "baseDir" : ".",
        "filter" : "pageviews.json"
      },
      "appendToExisting" : false
    },
    "tuningConfig" : {
      "type" : "index",
      "targetPartitionSize" : 5000000,
      "maxRowsInMemory" : 25000,
      "forceExtendableShardSpecs" : true
    }
  }
}

Running the task

First make sure that the indexing task can read pageviews.json, by placing it in the root of the Imply distribution.

Next, to kick off the indexing process, POST your indexing task to the Druid Overlord. In a single-machine Imply install, the URL is http://localhost:8090/druid/indexer/v1/task. You can also use the post-index-task command included in the Imply distribution:

bin/post-index-task --file my-index-task.json

If anything goes wrong with this task (e.g. it finishes with status FAILED), you can troubleshoot by visiting the "Task log" URL given by the post-index-task program.

Query data

After sending data, you can immediately query it using any of the supported query methods including visualization, SQL, and API. To start off, try a SQL query:

$ bin/dsql
dsql> SELECT "pageviews"."user", SUM(views) FROM pageviews GROUP BY "pageviews"."user";
┌───────┬────────┐
│ user  │ EXPR$1 │
├───────┼────────┤
│ alice │      1 │
│ bob   │      2 │
└───────┴────────┘
Retrieved 2 rows in 0.02s.

Next, try configuring a data cube within Imply:

  1. Navigate to Imply UI at http://localhost:9095/.
  2. Click on the Plus icon in the top right of the header bar and select "New data cube".
  3. Select the source "druid: pageviews" and ensure "Auto-fill dimensions and measures" is checked.
  4. Click "Next: configure data cube".
  5. Click "Create cube". You should see the confirmation message "Data cube created".
  6. View your new data cube by clicking the Home icon in the top-right and selecting the "Pageviews" cube you just created.

Further reading

Overview

Tutorial

Deploy

Manage Data

Query Data

Visualize

Configure

Special UI Features

Misc