Quickstart

The easiest way to evaluate Imply is to install it on a single machine. In this quickstart, we'll set up the platform locally, load some example data, and visualize the data.

Prerequisites

You will need:

Imply builds and certifies its releases using OpenJDK. We suggest selecting a distribution that provides long-term support and open-source licensing. Amazon Corretto and Azul Zulu are two good options.

Please note that the configurations used for this quickstart are tuned to be light on resource usage and are not meant for load testing. Optimal performance on a given dataset or hardware requires some tuning; see our clustering documentation for details.

Getting started

First, download Imply 3.1.7.1 from imply.io/get-started and unpack the release archive.

tar -xzf imply-3.1.7.1.tar.gz
cd imply-3.1.7.1

In this package, you'll find:

Start up services

Next, you'll need to start up Imply, which includes Druid, Pivot, and ZooKeeper. You can use the included supervise program to start everything with a single command:

bin/supervise -c conf/supervise/quickstart.conf

You should see a log message printed out for each service that starts up. You can view detailed logs for any service by looking in the var/sv/ directory using another terminal.

Later on, if you'd like to stop the services, CTRL-C the supervise program in your terminal. If you want a clean start after stopping the services, remove the var/ directory.

Congratulations, now it's time to load data!

Load data file

Imply 3.1.7.1 allows you to load and visualize your data through a web-based interface. In this tutorial we will be fetching and loading a sample of Wikipedia edits from June 27, 2016 from a public web server. If firewall or connectivity restrictions prevent you from making an outbound request to fetch the file, you can load the sample manually using the instructions below.

The data loader can not load every type of data supported by Druid. You may find that for your particular dataset, it is necessary to load data by submitting a task directly rather than going through the data loader. For an example of how to do this, see the section Loading sample Wikipedia data offline.

1. Open Pivot. To access Pivot, go to http://localhost:9095. You should see a page similar to the following screenshot. If you see a connection refused error, it may mean your Druid cluster is not yet online; try waiting a few seconds and refreshing the page.

quickstart 1

2. Navigate to the Druid console by clicking the "Druid console" button in the top right of the screen. Then click Load data to reach the Druid data loader. This data loader allows you to ingest from a number of static and streaming sources. You should now see the following screen:

quickstart 2

3. Sample the data. For this tutorial, we will be using a dataset of Wikipedia edits hosted online. Select the HTTP(s) option to start the flow, and enter https://static.imply.io/data/wikipedia.json.gz in the URIs input and Preview the data. Once you see the screen below you are ready to move to the next step.

quickstart 3

4. Configure the parser. In the next step the data loader will automatically guess the parser type for the data and preview the parsed output. The auto-detected json parser is suitable for this dataset. Go to the next step by clicking Next: Parse time.

quickstart 4

5. Configure the time column parsing. Druid uses a timestamp column to partition your data. This page allows you to identify which column should be used as the primary time column and how the timestamp is formatted. In this case, the loader should have automatically detected the timestamp column and chosen the iso format. Click Next to move on.

quickstart 5

6. Configuring the schema. Click Next a few more times to skip the Transform and Filter steps (out of scope for this tutorial) and get to the Configure schema step. In this step you will get a preview of how this data will look in Druid once ingested.

Druid can index data using an ingestion-time, first-level aggregation known as "roll-up". Roll-up causes similar events to be aggregated during indexing, which can result in reduced disk usage and faster queries for certain types of data. The Druid Concepts page provides an introduction on how roll-up works. For this quickstart, click on the Rollup toggle to turn rollup off.

quickstart 6

7. Examine the final spec. Click Next until you get to the Edit JSON spec step skipping (Partiton, Tune, and Publish - the defaults will do). You have constructed an ingestion spec, you can examine it and edit it if you need prior to submitting it. Click Submit to submit the ingestion task.

quickstart 7

8. Wait for the data to finish loading. You will be taken to the task screen, with your newly submitted task selected. Once the loader has indicated that the data has been indexed, you can move on to the next section to define a data cube and begin visualizing the data.

This section showed you how to load data from files, but Druid also supports streaming ingestion. Druid's streaming ingestion can load data with virtually no delay between events occurring and being available for queries. For more information, see Loading data.

quickstart 9

Create a data cube

Go back to Pivot and make sure that your newly ingested datasource appears in the list (it might take a few seconds for it to show up).

quickstart 10

Switch to the Visualize section of Pivot by clicking on the Visuals button on the top bar. From here, you can create data cubes to model your data, explore these cubes, and organize views into dashboards. Start by clicking + Create new data cube.

quickstart 11

In the dialog that comes up, make sure that wikipedia is the selected Source and that Auto-fill dimensions and measures is selected. Continue by clicking Next: Create data cube.

From here you can configure the various aspects of your data cube, including defining and customizing the cube's dimensions and measures. The data cube creation flow can intelligently inspect the columns in your data source and determine possible dimensions and measures automatically. We enabled this when we selected Auto-fill dimensions and measures on the previous screen and you can see that the cube's settings have been largely pre-populated. In our case, the suggestions are appropriate so we can continue by clicking on the Save button in the top-right corner.

Pivot's data cubes are highly configurable and give you the flexibility to represent your dataset, as well as derived and custom columns, in many different ways. The documentation on dimensions and measures is a good starting point for learning how to configure a data cube.

Visualize a data cube

After clicking Save, the data cube view for this new data cube is automatically loaded. In the future, this view can also be loaded by clicking on the name of the data cube (in this example, 'Wikipedia') from the Visualize screen.

quickstart 12

Here, you can explore a dataset by filtering and splitting it across any dimension. For each filtered split of your data, you will see the aggregate value of your selected measures. For example, on the wikipedia dataset, you can see the most frequently edited pages by splitting on Page (drag Page to the Show bar) and sorting by Number of Events (this is the default sort; you can also click on any column to sort by it). You should see a screen like the following:

quickstart 13

The data cube view suggests different visualizations based on how you split your data. If you split on a string column, your data will initially be presented as a table. If you split on time, the data cube view will recommend a timeseries plot, and if you split on a numeric column you will get a bar chart. Try replacing the Page dimension with Time in the Show bar, which will switch your visualization to a timeseries chart like the following:

quickstart 14

You can also change the visualization manually by choosing your preferred visualization from the dropdown. If the shown dimensions are not appropriate for a particular visualization, the data cube view will recommend alternative dimensions you can show.

If you would like more information on visualizing data, please refer to the Data cubes section.

Run SQL

Imply includes an easy-to-use interface for issuing Druid SQL queries. To access the SQL editor, go to the Run SQL section. If you are in the visualization view, you can navigate to this screen by selecting SQL from the hamburger menu in the top-left corner of the page. Once there, try running the following query, which will return the most edited Wikipedia pages:

SELECT page, COUNT(*) AS Edits
FROM wikipedia
WHERE "__time" BETWEEN TIMESTAMP '2016-06-27 00:00:00' AND TIMESTAMP '2016-06-28 00:00:00'
GROUP BY page
ORDER BY Edits
DESC LIMIT 5

You should see results like the following:

quickstart 15

For more details on making SQL queries with Druid, see the Druid SQL documentation.

Next steps

Congratulations! You have now installed and run Imply on a single machine, loaded a sample dataset into Druid, defined a data cube, explored some simple visualizations, and executed queries using Druid SQL.

Next, you can:

Addendum: Loading sample Wikipedia data offline

If you are unable to access the public web server, you can load the same dataset from a local file bundled in this distribution.

Simply select Local disc from the initial data source screen.

quickstart extra 1

And enter quickstart/ and wikipedia-2016-06-27-sampled.json as the base dir and filter respectively and follow the steps outlined above.

quickstart extra 1

Alternatively the quickstart directory includes a sample dataset and an ingestion spec to process the data, named wikipedia-2016-06-27-sampled.json and wikipedia-index.json respectively.

To submit an indexing job to Druid for this ingestion spec, run the following command from your Imply directory:

bin/post-index-task --file quickstart/wikipedia-index.json

A successful run will generate logs similar to the following:

Beginning indexing data for wikipedia
Task started: index_wikipedia_2017-12-05T03:22:28.612Z
Task log:     http://localhost:8090/druid/indexer/v1/task/index_wikipedia_2017-12-05T03:22:28.612Z/log
Task status:  http://localhost:8090/druid/indexer/v1/task/index_wikipedia_2017-12-05T03:22:28.612Z/status
Task index_wikipedia_2017-12-05T03:22:28.612Z still running...
Task index_wikipedia_2017-12-05T03:22:28.612Z still running...
Task finished with status: SUCCESS
Completed indexing data for wikipedia. Now loading indexed data onto the cluster...
wikipedia is 0.0% finished loading...
wikipedia is 0.0% finished loading...
wikipedia is 0.0% finished loading...
wikipedia loading complete! You may now query your data

After the dataset has been created, you can move on to the next step to create a data cube.

Overview

Tutorial

Deploy

Manage Data

Query Data

Visualize

Configure

Special UI Features

Imply Manager

Misc