2021.02

2021.02

  • Imply
  • Pivot
  • Druid
  • Manager
  • Clarity

›Tutorials

Overview

  • Imply Overview
  • Design
  • Release notes

Tutorials

  • Quickstart
  • Data ingestion tutorial
  • Kafka ingestion tutorial
  • Connect to Kinesis
  • Querying data

Deploy

  • Deployment planning
  • Imply Managed

    • Imply Cloud overview
    • Imply Cloud security
    • Direct access Pivot
    • On-prem Cloud crossover

    Imply Private

    • Imply Private overview
    • Install Imply on Minikube
    • Imply Private on Kubernetes
    • Imply Private on Azure Kubernetes Service
    • Enhanced Imply Private on Google Kubernetes Engine
    • Kubernetes Scaling Reference
    • Kubernetes Deep Storage Reference
    • Imply Private on Linux
    • Pivot state sharing
    • Migrate to Imply

    Unmanaged Imply

    • Unmanaged Imply deploy

Misc

  • Druid API users
  • Extensions
  • Third-party software licenses
  • Experimental features

Kafka ingestion tutorial

The Kafka indexing service enables you to ingest data into Imply from Apache Kafka. This service offers exactly-once ingestion guarantees as well as the ability to ingest historical data.

You can load data from Kafka in the Druid Console using the Apache Kafka data loader:

data loader kafka

This tutorial guides you through the steps to:

  • Set up an instance of Kafka and create a sample topic called "wikipedia".
  • Configure the Druid Kafka indexing service to load data from the Kafka event stream.
  • Load data into the Kafka "wikipedia" topic.
  • Create a data cube in Imply Pivot.

The steps assume you have access to a running instance of Imply. If you don't, see the Quickstart for information on getting started.

The Druid Kafka indexing service requires access to read from an Apache Kafka topic. If you are running Imply Cloud, consider installing Kafka in the same VPC as your Druid cluster. For more information, see Imply Cloud Security.

Step 1: Get Kafka

  1. In a terminal window, download Kafka as follows:

    curl -O https://archive.apache.org/dist/kafka/2.5.1/kafka_2.13-2.5.1.tgz
    tar -xzf kafka_2.13-2.5.1.tgz
    cd kafka_2.13-2.5.1
    

    This directory is referred to as the Kafka home for the rest of this tutorial.

Imply and Kafka both rely on Zookeeper. If you are running this tutorial on the same machine where you are running an Imply single machine instance, such as the Quickstart on-prem installation, modify the default configuration to avoid port conflicts. Change the default port for the Kafka Zookeeper from 2181 to 2180 where it appears in the following files:

  • <kafka_home>/config/zookeeper.properties
  • <kafka_home>/config/server.properties
  1. Start Kafka's Zookeeper as follows:

    ./bin/zookeeper-server-start.sh config/zookeeper.properties
    
  2. Open a new terminal window and navigate to the Kafka home directory.

  3. Start a Kafka broker as follows:

    ./bin/kafka-server-start.sh config/server.properties
    
  4. From another terminal window, run this command to create a Kafka topic called wikipedia, the topic to which you will send data:

    ./bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic wikipedia
    

    Kafka returns a message when it successfully adds the topic: Created topic wikipedia.

Step 2: Enable Druid Kafka ingestion

You can use Druid's Kafka indexing service to ingest messages from your newly created wikipedia topic. To start the service, navigate to the Imply directory and submit a supervisor spec to the Druid overlord as follows:

curl -XPOST -H'Content-Type: application/json' -d @quickstart/wikipedia-kafka-supervisor.json http://localhost:8090/druid/indexer/v1/supervisor

If you are not using a locally running Imply instance:

  • Copy the contents of the following listing to a file.
  • Modify the value for bootstrap.servers to the address and port where Druid can access the Kafka broker.
  • Post the updated file to the URL where your Druid Overlord process is running.
{
  "type": "kafka",
  "dataSchema": {
    "dataSource": "wikipedia-kafka",
    "parser": {
      "type": "string",
      "parseSpec": {
        "format": "json",
        "timestampSpec": {
          "column": "timestamp",
          "format": "iso"
        },
        "dimensionsSpec": {
          "dimensions": [
            "isRobot",
            "channel",
            "flags",
            "isUnpatrolled",
            "page",
            "diffUrl",
            "comment",
            "isNew",
            "isMinor",
            "user",
            "namespace",
            { "name" : "commentLength", "type" : "long" },
            { "name" : "deltaBucket", "type" : "long" },
            "cityName",
            "countryIsoCode",
            "countryName",
            "isAnonymous",
            "metroCode",
            "regionIsoCode",
            "regionName",
            { "name": "added", "type": "long" },
            { "name": "deleted", "type": "long" },
            { "name": "delta", "type": "long" }
          ]
        }
      }
    },
    "metricsSpec" : [],
    "granularitySpec": {
      "type": "uniform",
      "segmentGranularity": "HOUR",
      "queryGranularity": "NONE",
      "rollup": true
    }
  },
  "ioConfig": {
    "topic": "wikipedia",
    "consumerProperties": {
      "bootstrap.servers": "localhost:9092"
    }
  }
}


When the Overlord successfully creates the supervisor, it returns a response containing the ID of the supervisor. In this case: {"id":"wikipedia-kafka"}.

For more details about what's going on here, check out the Druid Kafka indexing service documentation.

Step 3: Load historical data

Now it's time to launch a console producer for your topic and send some data!

  1. Navigate to your Kafka directory.

  2. Modify the following command to replace {PATH_TO_IMPLY} with the path to your Imply directory:

    export KAFKA_OPTS="-Dfile.encoding=UTF-8"
    ./bin/kafka-console-producer.sh --broker-list localhost:9092 --topic wikipedia < {PATH_TO_IMPLY}/quickstart/wikipedia-2016-06-27-sampled.json
    
  3. Run the command to post sample events to the wikipedia Kafka topic.

  4. The Kafka indexing reads the events from the topic and ingests them into Druid.

  5. Within the Druid console, navigate to the Datasources page to verify that the new wikipedia-kafka datasource appears:

    data loader kafka

  6. Click the segments link to see the list of segments generated by ingestion. Notice that the ingestion spec specified partitioning to HOUR, so there is a segment for each hour, around 22 based on the data ingested.

    data loader kafka

That's it! You can now query the historical data in the console, but first try ingesting real-time data, as follows.

Step 4: Load real-time data

Exploring historical Wikipedia edits is useful, but it is even more interesting to explore trends on Wikipedia happening right now.

To do this, you can download a helper application to parse events from Wikimedia's IRC feeds and post them to the wikipedia Kafka topic from the previous step, as follows:

  1. From a terminal, run the following commands to download and extract the helper application:

    curl -O https://static.imply.io/quickstart/wikiticker-0.8.tar.gz
    tar -xzf wikiticker-0.8.tar.gz
    cd wikiticker-0.8-SNAPSHOT
    
  2. Now run wikiticker, passing "wikipedia" as the -topic parameter:

    bin/wikiticker -J-Dfile.encoding=UTF-8 -out kafka -topic wikipedia
    
  3. After a few moments, look for an additional segment of real-time data in the datasource created.

  4. As Kafka data is sent to Druid, you can immediately query it. To see the latest data sent by wikiticker, set a time floor for the latest hour, as shown:

    data loader kafka

Step 5: Build a data cube

Next, try configuring a data cube in Pivot:

  1. Navigate to Pivot at http://localhost:9095. You should see your wikipedia-kafka datasource: data loader kafka

  2. Click on the data source and then click Create a data cube, and confirm when prompted.

  3. Click Go to data cube.

  4. You can now slice and dice and explore the data like you would any data cube: data loader kafka

Next steps

So far, you've loaded data using an ingestion spec included in the Imply distribution. Each ingestion spec is designed for a particular dataset. You can load your own datasets by writing a custom ingestion spec.

To write your own ingestion spec, you can start by copying the content of the quickstart/wikipedia-kafka-supervisor.json file (or copying from the listing above) into your own file as a starting point, and editing it as needed.

Alternatively, use the Druid data loader UI to generate the ingestion spec by clicking Apache Kafka from the Load Data page.

The steps for configuring Kafka ingestion in the data loader are similar to those for batch file ingestion, as described in the Quickstart. However, you will use the Kafka bootstrap server as the source, as shown:

data loader kafka

As a starting point, you can keep most settings at their default values. At the Tune step, however, you must choose whether to retrieve the earliest or latest offsets in Kafka by choosing False or True for the Input tuning.

When you load your own Kafka topics, Druid creates at least one segment for every Kafka partition for every segmentGranularity period. If your segmentGranularity period is "HOUR" and you have three Kafka partitions, then Druid creates at least one segment per hour.

For best performance, keep your number of Kafka partitions appropriately sized to avoid creating too many segments.

For more information, see Druid's Kafka indexing service documentation for more details.

← Data ingestion tutorialConnect to Kinesis →
  • Step 1: Get Kafka
  • Step 2: Enable Druid Kafka ingestion
  • Step 3: Load historical data
  • Step 4: Load real-time data
  • Step 5: Build a data cube
  • Next steps
2021.02
Key links
Try ImplyApache Druid siteImply GitHub
Get help
Stack OverflowSupportContact us
Learn more
BlogApache Druid docs
Copyright © 2021 Imply Data, Inc