This guide introduces you to Imply by taking you through the steps to install Imply, load sample data, and create data visualizations.
There are several ways to get an Imply cluster or single server instance up and running in a self-hosted environment. Depending on how you plan to use Imply, you may want to refer to the respective guide for that mode if it is not already available:
For a distributed cluster using Kubernetes as the orchestration layer, see the Kubernetes quickstart.
For a distributed cluster environment without Kubernetes, see the Docker deployment guide. In this mode, you use Docker to create Imply cluster agents, and the Imply Manager to manage the cluster.
If you do not yet have an Imply license, you can start with a single-machine installation using the following instructions. This deployment mode does not include the Imply Manager.
In any case, after installing and starting up Imply in the mode of your choosing, you can return to this quickstart and start with loading data.
The quickstart configuration described in this quickstart can be installed on a single machine. The machine should have:
Imply builds and certifies its releases using OpenJDK. We suggest selecting a distribution that provides long-term support and open-source licensing. Amazon Corretto and Azul Zulu are two good options.
The configuration used here is designed for minimal resource usage and is not meant for load testing of large production datasets. Optimizing performance for a given dataset, use case, or hardware specification requires tuning. See planning documentation for more information.
After confirming the requirements, follow these steps:
tar -xzf imply-22.214.171.124.tar.gz cd imply-126.96.36.199
bin/supervise -c conf/supervise/quickstart.conf
A log message appears for each service that starts up. You can view detailed logs
for any service by looking in the
var/sv/ directory in another terminal.
To stop the services, use Ctrl-C in the terminal in which the services are running.
To perform a clean start after stopping the services, remove the
var/ directory in the Imply home directory.
Imply is now up and running. Now try loading some data.
In this tutorial we will fetch and load data representing Wikipedia edits from June 27, 2016 from a public web server. If a firewall or connectivity restrictions prevent you from making outbound requests to fetch the file, you can load the sample manually using the instructions below.
1. Open Pivot. To access Pivot, go to http://localhost:9095. You should see a page similar to the following screenshot. If you see a connection refused error, it may mean your Druid cluster is not yet online. Try waiting a few seconds and refreshing the page.
2. Open the Druid Console. Click the Load data button at the top right to open the data loader in the Druid console.
This data loader allows you to ingest from a number of static and streaming sources, as follows:
3. Start the data loader. Select the HTTP(s) option, as we'll be loading data from an online location, and Connect data to start the flow.
You can get much more information for any of the settings or steps mentioned in the following instructions by clicking the information icon next to the setting in the UI, or by clicking the link to the Druid documentation in the page description in the top-right panel.
4. Sample the data. Enter
https://static.imply.io/data/wikipedia.json.gz in the
URIs input and Apply, to preview the data.
Once you see the screen below you are ready to move to the next step by clicking Next: Parse data.
The data loader cannot load every type of data supported by Druid. You may find that for your particular dataset, you need to load data by submitting a task directly, rather than going through the data loader. For an example of how to do this, see the section Loading sample Wikipedia data offline.
5. Configure the parser. The data loader automatically detects the parser type for the data and presents a preview of the
parsed output. In this case, it should have suggested the
json parser, as is appropriate for this dataset. Proceed to the next step by clicking Next: Parse time.
6. Configure the time column parsing. Druid uses a timestamp column to partition your data. This page allows you to
identify which column should be used as the primary time column and how the timestamp is formatted. In this case, the
loader should have automatically detected the
timestamp column and chosen the
Click Next: Transform to continue.
7. Configure the schema. Click
Next a few more times to skip the
Filter steps. The
lets you modify columns at ingestion time or create new derived columns, while the
Filter allows you to exclude unwanted columns.
Configure schema step presents a preview of how the data will look in Druid after ingestion. Druid can index data using an ingestion-time, first-level aggregation known as "roll-up".
Roll-up causes similar events to be aggregated during indexing, which can result in reduced disk usage and faster queries
for certain types of data. The Druid Concepts page provides an
introduction on how roll-up works. For this quickstart, click on the Rollup toggle to turn rollup off and click Next: Partition.
8. Configure the partition. The partition defines the time chunk granularity of the ingested data. A time chunk has
one or more data segments, with the data timestamped to that time chunk. Choose
DAY as the Segment granularity for our
data and click Next: Tune.
9. Examine the final spec. Click
Next until you get to the
Edit JSON spec step, accepting the defaults in the
Publishpanes. Note that the Publish step is where you can specify a name for the datasource. This name identifies the
datasource in the Datasources list, among other places, so a descriptive name for it may be helpful.
You have constructed an ingestion spec. You can edit it if needed prior to submitting it. Click Submit to submit the ingestion task.
10. Wait for the data to finish loading. You will be taken to the task screen, with your newly submitted task selected. Once the loader has indicated that the data has been indexed, you can move on to the next section to define a data cube and begin visualizing the data.
This section showed you how to load data from files, but Druid also supports streaming ingestion. Druid's streaming ingestion can load data with virtually no delay between events occurring and being available for queries. For more information, see Loading data.
Go back to Pivot and make sure that your newly ingested datasource appears in the list (it might take a few seconds for it to show up).
Switch to the Visualize section of Pivot by clicking on the Visuals button on the top bar. From here, you can create data cubes to model your data, explore these cubes, and organize views into dashboards. Start by clicking + Create new data cube.
In the dialog that comes up, make sure that
wikipedia is the selected Source and that Auto-fill dimensions and measures is selected.
Continue by clicking Next: Create data cube.
From here you can configure the various aspects of your data cube, including defining and customizing the cube's dimensions and measures. The data cube creation flow can intelligently inspect the columns in your data source and determine possible dimensions and measures automatically. We enabled this when we selected Auto-fill dimensions and measures on the previous screen and you can see that the cube's settings have been largely pre-populated. In our case, the suggestions are appropriate so we can continue by clicking on the Save button in the top-right corner.
Pivot's data cubes are highly configurable and give you the flexibility to represent your dataset, as well as derived and custom columns, in many different ways. The documentation on dimensions and measures is a good starting point for learning how to configure a data cube.
After clicking Save, the data cube view for this new data cube is automatically loaded. In the future, this view can also be loaded by clicking on the name of the data cube (in this example, 'Wikipedia') from the Visualize screen.
Here, you can explore a dataset by filtering and splitting it across any dimension. For each filtered split of your data, you will see the aggregate value of your selected measures. For example, on the wikipedia dataset, you can see the most frequently edited pages by splitting on Page. Drag Page to the Show bar, and keep the default sort, by Number of Events. You should see a screen like the following:
The data cube view suggests different visualizations based on how you split your data. If you split on a string column, your data is initially presented as a table. If you split on time, the data cube view recommends a time series plot, and if you split on a numeric column you will get a bar chart. Try replacing the Page dimension with Time in the Show bar. Your visualization switches to a time series chart, like the following:
You can also change the visualization manually by choosing your preferred visualization from the dropdown. If the shown dimensions are not appropriate for a particular visualization, the data cube view will recommend alternative dimensions.
For more information on visualizing data, refer to the Data cubes section.
Imply includes an easy-to-use interface for issuing Druid SQL queries. To access the SQL editor, go to the Run SQL section. If you are in the visualization view, you can navigate to this screen by selecting SQL from the hamburger menu in the top-left corner of the page. Once there, try running the following query, which will return the most edited Wikipedia pages:
SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2016-06-27 00:00:00' AND TIMESTAMP '2016-06-28 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 5
You should see results like the following:
For more details on making SQL queries with Druid, see the Druid SQL documentation.
Congratulations! You have now installed and run Imply on a single machine, loaded a sample dataset into Druid, defined a data cube, explored some simple visualizations, and executed queries using Druid SQL.
Next, you can:
If you are unable to access the public web server, you can load the same dataset from a local file bundled in this distribution.
Simply select Local disc from the initial data source screen.
wikipedia-2016-06-27-sampled.json as the base dir and filter respectively and follow the steps outlined above.
quickstart directory includes a sample dataset and an ingestion spec to process the data, named
To submit an indexing job to Druid for this ingestion spec, run the following command from your Imply directory:
bin/post-index-task --file quickstart/wikipedia-index.json
A successful run will generate logs similar to the following:
Beginning indexing data for wikipedia Task started: index_wikipedia_2017-12-05T03:22:28.612Z Task log: http://localhost:8090/druid/indexer/v1/task/index_wikipedia_2017-12-05T03:22:28.612Z/log Task status: http://localhost:8090/druid/indexer/v1/task/index_wikipedia_2017-12-05T03:22:28.612Z/status Task index_wikipedia_2017-12-05T03:22:28.612Z still running... Task index_wikipedia_2017-12-05T03:22:28.612Z still running... Task finished with status: SUCCESS Completed indexing data for wikipedia. Now loading indexed data onto the cluster... wikipedia is 0.0% finished loading... wikipedia is 0.0% finished loading... wikipedia is 0.0% finished loading... wikipedia loading complete! You may now query your data
After the dataset has been created, you can move on to the next step to create a data cube.