2020.12

2020.12

  • Imply
  • Pivot
  • Druid
  • Manager
  • Clarity

›Tutorials

Overview

  • Imply Overview
  • Design
  • Release notes

Tutorials

  • Quickstart
  • Data ingestion tutorial
  • Kafka ingestion tutorial
  • Connect to Kinesis
  • Querying data

Deploy

  • Deployment planning
  • Imply Managed

    • Imply Cloud overview
    • Imply Cloud security
    • Direct access Pivot
    • On-prem Cloud crossover

    Imply Private

    • Imply Private overview
    • Install Imply on Minikube
    • Imply Private on Kubernetes
    • Imply Private on Azure Kubernetes Service
    • Enhanced Imply Private on Google Kubernetes Engine
    • Kubernetes Scaling Reference
    • Kubernetes Deep Storage Reference
    • Imply Private on Linux
    • Pivot state sharing
    • Migrate to Imply

    Unmanaged Imply

    • Unmanaged Imply deploy

Misc

  • Druid API users
  • Extensions
  • Third-party software licenses

Imply Quickstart

This guide introduces you to Imply by taking you through the steps to install Imply, load sample data, and then query and create visualizations of your dataset.

There are several ways to get started with Imply. The easiest way is to sign up for an Imply Cloud (AWS) Free Trial. Imply Cloud is a managed service that will deploy and manage scalable Imply clusters directly in your AWS account for you.

Alternatively, you can install Imply as an evaluation instance on a single machine using the quickstart configuration. This type of Imply is not managed by the Imply Manager.

For a quickstart installation of a managed Imply instance, see the Kubernetes quickstart.

Get and start Imply

To follow the steps for using one of these methods, continue with the relevant section here:

  • Start Imply Cloud
  • Start Unmanaged Imply

Start Imply Cloud

To follow these steps, you will need an Imply Cloud account. Sign up for a free account if you do not have one.

  1. When you log into Imply Cloud, you will start at the Clusters view:

    Clusters View

  2. In this view, click on the New cluster button in the top right hand corner.

  3. Choose a name for your cluster, and use the default values for the remainder of the settings.

  4. Click Create cluster to launch a cluster in your AWS VPC. Note that clusters can take between 20–30 minutes to launch.

    The cluster you'll create in this quickstart is not highly available. For high availability, consider as a starting point the recommended topology for the Imply Cloud trial: three m5.large instances for master servers, two c5.large instances for query servers, and three i3.xlarge instances for data servers.

Congratulations, now it's time to load data!

Start Unmanaged Imply

This section describes how to install and start Imply on a single machine using the quickstart configuration.

The configuration used in this quickstart is intended to minimize resource usage and is not meant for load testing large production data sets. For production-ready installations, see Production-ready installation instructions.

To run a single-machine Imply instance with the quickstart configuration, you will need:

  • Java 8 (8u92 or higher)
  • Linux, Mac OS X, or other Unix-like OS (Windows is not supported)
  • At least 4GB of RAM

Imply builds and certifies its releases using OpenJDK. We suggest selecting a distribution that provides long-term support and open-source licensing. Amazon Corretto and Azul Zulu are two good options.

Download Imply

Download Imply 2020.12 from imply.io/get-started and unpack the release archive.

tar -xzf imply-2020.12.tar.gz
cd imply-2020.12

In this package, you'll find:

  • bin/* - run scripts for included software.
  • conf/* - template configurations for a clustered setup.
  • conf-quickstart/* - configurations for this quickstart.
  • dist/* - all included software.
  • quickstart/* - files useful for this quickstart.

Start Imply

Next, you'll need to start the Imply services, which includes Druid, Pivot, and ZooKeeper. You can use the included supervise program to start everything with a single command:

bin/supervise -c conf/supervise/quickstart.conf

You should see a log message printed out for each service that starts up. You can view detailed logs for any service by looking in the var/sv/ directory using another terminal.

Later on, if you'd like to stop the services, CTRL-C the supervise program in your terminal. If you want a clean start after stopping the services, remove the var/ directory.

Congratulations, now it's time to load data!

Load data file

In this tutorial we will fetch and load data representing Wikipedia edits from June 27, 2016 from a public web server. To complete this tutorial, your cluster must be able to reach static.imply.io.

1. Open Pivot. To access Pivot:

  • In Imply Cloud: Click the Open button from the cluster list or cluster overview page.
  • In a self-hosted Imply Manager: Once you've got Imply up and running, go to http://localhost:9095. You should see a page similar to the following.

If you get a connection refused error, your Imply cluster may not yet be online. Try waiting a few seconds and refreshing the page.

quickstart 1

2. Open the Druid Console. Click Load data to open the data loader in the Druid console.

This data loader allows you to ingest from a number of static and streaming sources:

quickstart 2

3. Start the data loader. Select the HTTP(s) option, as we'll be loading data from an online location, and Connect data to start the flow.

4. Sample the data. Enter https://static.imply.io/data/wikipedia.json.gz in the URIs input and Apply, to preview the data. Once you see the screen below you are ready to move to the next step by clicking Next: Parse data.

quickstart 3

5. Configure the parser. The data loader automatically detects the parser type for the data and presents a preview of the parsed output. In this case, it should have suggested the json parser, as is appropriate for this dataset. Proceed to the next step by clicking Next: Parse time.

quickstart 4

6. Configure the time column parsing. Druid uses a timestamp column to partition your data. This page allows you to identify which column should be used as the primary time column and how the timestamp is formatted. In this case, the loader should have automatically detected the timestamp column and chosen the iso format. Click Next: Transform to continue.

quickstart 5

7. Skip the transform, filter, and configure schema steps. Accept the defaults for the next three steps, until you reach the partition settings. In short, a transform lets you modify columns at ingestion time and create new derived columns. Filters allow you to exclude unwanted columns from the ingested data. Roll-up causes similar events to be aggregated during indexing, which results in reduced disk usage and faster queries for certain types of data. Click Next: Filter, Next: Configure schema, and then Next: Partition.

8. Configure the partition. Data in Imply is always primarily partitioned by time. In the partition settings, you can choose the granularity of the time intervals. Choose DAY as the Segment granularity. Also choose dynamic as the Partitioning type option, which results in secondary partitioning based on the number of rows in a segment. (See partitionsSpec for details.) Click Next: Tune.

quickstart 8

9. Skip the tune and publish steps. The next sections of the data loader allow you to modify tuning and publishing parameters for the ingestion job. The defaults here are appropriate, so click Next: Publish, and then Next: Edit spec. For subsequent jobs, note that the Publish section is where you specify the name of the datasource which is used when managing or querying your data.

10. Examine the final spec. The last page of the data loader provides an overview of the ingestion spec that will be submitted. Here, advanced users can make manual adjustments to the spec to configure functionality not available through the data loader. When you are ready, click Submit to begin the ingestion task.

quickstart 7

11. Wait for the data to finish loading. You will be taken to the task screen, and should see your task begin to run. Once the task status changes to SUCCESS, you can move on to the next section to define a data cube and begin visualizing the data.

quickstart 9

Create a data cube

Go back to Pivot and make sure that your newly ingested datasource appears in the list (it might take a few seconds for it to show up).

quickstart 10

Switch to the Visualize section of Pivot by clicking on the Visuals button on the top bar. From here, you can create data cubes to model your data, explore these cubes, and organize views into dashboards. Start by clicking + Create new data cube.

quickstart 11

In the dialog that comes up, make sure that wikipedia is the selected Source and that Auto-fill dimensions and measures is selected. Continue by clicking Next: Create data cube.

From here you can configure the various aspects of your data cube, including defining and customizing the cube's dimensions and measures. The data cube creation flow can intelligently inspect the columns in your data source and determine possible dimensions and measures automatically. We enabled this when we selected Auto-fill dimensions and measures on the previous screen and you can see that the cube's settings have been largely pre-populated. In our case, the suggestions are appropriate so we can continue by clicking on the Save button in the top-right corner.

Pivot's data cubes are highly configurable and give you the flexibility to represent your dataset, as well as derived and custom columns, in many different ways. The documentation on dimensions and measures is a good starting point for learning how to configure a data cube.

Visualize a data cube

After clicking Save, the data cube view for this new data cube is automatically loaded. In the future, this view can also be loaded by clicking on the name of the data cube (in this example, 'Wikipedia') from the Visuals screen.

quickstart 12

Here, you can explore a dataset by filtering and splitting it across any dimension. For each filtered split of your data, you will see the aggregate value of your selected measures. For example, on the wikipedia dataset, you can see the most frequently edited pages by splitting on Page. Drag Page to the Show bar, and keep the default sort, by Number of Events. You should see a screen like the following:

quickstart 13

The data cube view suggests different visualizations based on how you split your data. If you split on a string column, your data is initially presented as a table. If you split on time, the data cube view recommends a time series plot, and if you split on a numeric column you will get a bar chart. Try replacing the Page dimension with Time in the Show bar. Your visualization switches to a time series chart, like the following:

quickstart 14

You can also change the visualization manually by choosing your preferred visualization from the dropdown. If the shown dimensions are not appropriate for a particular visualization, the data cube view will recommend alternative dimensions.

For more information on visualizing data, refer to the Data cubes section.

Run SQL

Imply includes an easy-to-use interface for issuing Druid SQL queries. To access the SQL editor, go to the SQL section. If you are in the visualization view, you can navigate to this screen by selecting SQL from the hamburger menu in the top-left corner of the page. Once there, try running the following query, which will return the most edited Wikipedia pages:

SELECT page, COUNT(*) AS Edits
FROM wikipedia
WHERE "__time" BETWEEN TIMESTAMP '2016-06-27 00:00:00' AND TIMESTAMP '2016-06-28 00:00:00'
GROUP BY page
ORDER BY Edits
DESC LIMIT 5

You should see results like the following:

quickstart 15

For more details on making SQL queries with Druid, see the Druid SQL documentation.

Next steps

Congratulations! You have now deployed a simple Imply cluster, loaded a sample dataset into Imply, defined a data cube, explored some simple visualizations, and executed queries using Druid SQL.

Learn more

Next, you can:

  • Configure a data cube to customize dimensions and measures for your data cube.
  • Create a dashboard with your favorite views and share it.
  • Read more about supported query methods, including visualization or SQL.

Production-ready installation instructions

As previously mentioned, the configuration described in this quickstart is intended for investigatory or learning scenarios. To learn more about production-ready installations, refer to the following guides:

  • For a distributed cluster that uses Kubernetes as the orchestration layer, see:
    • Install Imply on Kubernetes.
    • Install Imply on Azure Kubernetes Service.
    • Install Imply on Google Kubernetes Service.
  • For a distributed cluster environment without Kubernetes, see Install Imply without Kubernetes.
← Release notesData ingestion tutorial →
  • Get and start Imply
    • Start Imply Cloud
    • Start Unmanaged Imply
  • Load data file
  • Create a data cube
  • Visualize a data cube
  • Run SQL
  • Next steps
    • Learn more
    • Production-ready installation instructions
2020.12
Key links
Try ImplyApache Druid siteImply GitHub
Get help
Stack OverflowSupportContact us
Learn more
BlogApache Druid docs
Copyright © 2021 Imply Data, Inc