Tutorial: Edit ingestion specs

Getting started

In this tutorial, you'll edit an ingestion spec to load a new data file. This tutorial uses batch ingestion, but the same principles apply to any form of ingestion supported by Druid.

Prerequisites

You will need:

  • Java 8 or better
  • Node.js 4.5.x or better
  • Linux, Mac OS X, or other Unix-like OS (Windows is not supported)
  • At least 4GB of RAM

On Mac OS X, you can use Oracle's JDK 8 to install Java and Homebrew to install Node.js.

On Linux, your OS package manager should be able to help for both Java and Node.js. If your Ubuntu- based OS does not have a recent enough version of Java, WebUpd8 offers packages for those OSes. If your Debian, Ubuntu, or Enterprise Linux OS does not have a recent enough version of Node.js, NodeSource offers packages for those OSes.

Start Imply

If you've already installed and started Imply using the quickstart, you can skip this step.

First, download Imply 2.3.4 from imply.io/get-started and unpack the release archive.

tar -xzf imply-2.3.4.tar.gz
cd imply-2.3.4

Next, you'll need to start up Imply, which includes Druid, Imply Pivot, and ZooKeeper. You can use the included supervise program to start everything with a single command:

bin/supervise -c conf/supervise/quickstart.conf

You should see a log message printed out for each service that starts up. You can view detailed logs for any service by looking in the var/sv/ directory using another terminal.

Later on, if you'd like to stop the services, CTRL-C the supervise program in your terminal. If you want a clean start after stopping the services, remove the var/ directory and then start up again.

Writing an ingestion spec

When loading files into Druid, you will use Druid's batch loading process. There's an example batch ingestion spec in quickstart/wikiticker-index.json that you can modify for your own needs.

The most important questions are:

  • What should the dataset be called? This is the "dataSource" field of the "dataSchema".
  • Where is the dataset located? This configuration belongs in the "ioConfig". The specific configuration is different for different ingestion methods (local files, Hadoop, Kafka, etc). For the local file method we're using, the file locations go in the "firehose".
  • Which field should be treated as a timestamp? This belongs in the "column" of the "timestampSpec". Druid always requires a timestamp column.
  • Do you want to roll up your data as an OLAP cube or not? Druid supports an OLAP data model, where you organize your columns into dimensions (attributes you can filter and split on) and metrics (aggregated values; also called "measures"). OLAP data models are designed to allow fast slice-and-dice analysis of data. This is belongs in rollup "rollup" flag of the "granularitySpec".
  • If you are using an OLAP data model, your dimensions belong in the "dimensions" field of the "dimensionsSpec" and your metrics belong in the "metricsSpec".
  • If you are not using an OLAP data model, your columns should all go in the "dimensions" field of the "dimensionsSpec", and the "metricsSpec" should be empty.
  • For batch ingestion only: What time ranges (intervals) are being loaded? This belongs in the "intervals" of the "granularitySpec".

If your data does not have a natural sense of time, you can tag each row with the current time. You can also tag all rows with a fixed timestamp, like "2000-01-01T00:00:00.000Z".

Let's use a small pageviews dataset as an example. Druid supports TSV, CSV, and JSON out of the box. Note that nested JSON objects are not supported, so if you do use JSON, you should provide a file containing flattened objects.

{"time": "2015-09-01T00:00:00Z", "url": "/foo/bar", "user": "alice", "latencyMs": 32}
{"time": "2015-09-01T01:00:00Z", "url": "/", "user": "bob", "latencyMs": 11}
{"time": "2015-09-01T01:30:00Z", "url": "/foo/bar", "user": "bob", "latencyMs": 45}

If you save this to a file called "pageviews.json", then for this dataset:

  • Let's call the dataset "pageviews".
  • The data is located in "pageviews.json" in the root of the Imply distribution.
  • The timestamp is the "time" field.
  • Let's use an OLAP data model, so set "rollup" to true.
  • Good choices for dimensions are the string fields "url" and "user".
  • Good choices for measures are a count of pageviews, and the sum of "latencyMs". Collecting that sum when we load the data will allow us to compute an average at query time as well.
  • The data covers the time range 2015-09-01 (inclusive) through 2015-09-02 (exclusive).

You can copy the existing quickstart/wikiticker-index.json indexing task to a new file:

cp quickstart/wikiticker-index.json my-index-task.json

And modify it by altering the sections above. After altering, it should look like:

{
  "type" : "index",
  "spec" : {
    "dataSchema" : {
      "dataSource" : "pageviews",
      "parser" : {
        "type" : "string",
        "parseSpec" : {
          "format" : "json",
          "dimensionsSpec" : {
            "dimensions" : [
              "url",
              "user"
            ]
          },
          "timestampSpec" : {
            "format" : "auto",
            "column" : "time"
          }
        }
      },
      "metricsSpec" : [
        { "name": "views", "type": "count" },
        { "name": "latencyMs", "type": "doubleSum", "fieldName": "latencyMs" }
      ],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "day",
        "queryGranularity" : "none",
        "intervals" : ["2015-09-01/2015-09-02"],
        "rollup" : true
      }
    },
    "ioConfig" : {
      "type" : "index",
      "firehose" : {
        "type" : "local",
        "baseDir" : ".",
        "filter" : "pageviews.json"
      },
      "appendToExisting" : false
    },
    "tuningConfig" : {
      "type" : "index",
      "targetPartitionSize" : 5000000,
      "maxRowsInMemory" : 25000,
      "forceExtendableShardSpecs" : true
    }
  }
}

Running the task

First make sure that the indexing task can read pageviews.json, by placing it in the root of the Imply distribution.

Next, to kick off the indexing process, POST your indexing task to the Druid Overlord. In a single-machine Imply install, the URL is http://localhost:8090/druid/indexer/v1/task. You can also use the post-index-task command included in the Imply distribution:

bin/post-index-task --file my-index-task.json

If anything goes wrong with this task (e.g. it finishes with status FAILED), you can troubleshoot by visiting the "Task log" URL given by the post-index-task program.

Query data

After sending data, you can immediately query it using any of the supported query methods. To start off, try a SQL query:

$ bin/dsql
dsql> SELECT "pageviews"."user", SUM(views) FROM pageviews GROUP BY "pageviews"."user";
┌───────┬────────┐
│ user  │ EXPR$1 │
├───────┼────────┤
│ alice │      1 │
│ bob   │      2 │
└───────┴────────┘
Retrieved 2 rows in 0.02s.

Next, try configuring a datacube in Pivot:

  1. Navigate to Pivot at http://localhost:9095.
  2. Click on the Plus icon in the top right of the header bar and select "New data cube".
  3. Select the source "druid: pageviews" and ensure "Auto-fill dimensions and measures" is checked.
  4. Click "Next: configure data cube".
  5. Click "Create cube". You should see the confirmation message "Data cube created".
  6. View your new datacube by clicking the Home icon in the top-right and selecting the "Pageviews" cube you just created.

Further reading

On-Premise