The easiest way to evaluate Imply is to install it on a single machine. In this quickstart, we'll set up the platform locally, load some example data, and visualize the data.
You will need:
On Mac OS X, you can use Oracle's JDK 8 to install Java.
On Linux, your OS package manager should be able to help with installing Java. If your Ubuntu-based OS does not have a recent enough version of Java, Azul offers Zulu, an open source OpenJDK-based package with packages for Red Hat, Ubuntu, Debian, and other popular Linux distributions.
Please note that the configurations used for this quickstart are tuned to be light on resource usage and are not meant for load testing. Optimal performance on a given dataset or hardware requires some tuning; see our clustering documentation for details.
First, download Imply 2.7.8 from imply.io/get-started and unpack the release archive.
tar -xzf imply-2.7.8.tar.gz cd imply-2.7.8
In this package, you'll find:
bin/*- run scripts for included software.
conf/*- template configurations for a clustered setup.
conf-quickstart/*- configurations for this quickstart.
dist/*- all included software.
quickstart/*- files useful for this quickstart.
bin/supervise -c conf/supervise/quickstart.conf
You should see a log message printed out for each service that starts up. You can view detailed logs
for any service by looking in the
var/sv/ directory using another terminal.
Later on, if you'd like to stop the services, CTRL-C the supervise program in your terminal. If you
want a clean start after stopping the services, remove the
Congratulations, now it's time to load data!
Imply 2.7.8 includes a web-based interface for loading, visualizing, and running queries on your data. The Imply interface, starting with Imply 2.4, includes a visual data loader that we will be using for this quickstart. We will be fetching and loading a sample of Wikipedia edits from June 27, 2016 from a public web server. If firewall or connectivity restrictions prevent you from making an outbound request to fetch the file, you can load the sample manually using the instructions described here.
The visual data loader in Imply 2.7.8 is a product preview and is not yet suitable for loading all types of datasets. You may find that for your particular dataset, it is necessary to load data using the Druid APIs rather than the data loader. As of this version, using the Druid API instead of data loader is often necessary for streaming datasets and for large batch datasets. For an example of how to do this, see the section Loading sample Wikipedia data offline.
1. Open Imply. To access Imply, go to http://localhost:9095. You should see a page similar to the following screenshot. If you see a connection refused error, it may mean your Druid cluster is not yet online; try waiting a few seconds and refreshing the page.
2. Start connecting to the sample Wikipedia dataset. In the top-right corner, click + Add dataset. This data loader allows you to ingest from a number of static and streaming sources such as Apache Kafka, Amazon S3, and over HTTP. For this quickstart, we will be using the Wikipedia Edits dataset listed under Examples. Select this option to connect to the data source. You should now see the following screen:
3. Load sample data. The Wikipedia sample uses the HTTP data loader to read a file from the path defined under
URI(s). This file is JSON formatted, so
JSON should be selected as the Format. The default Dataset name of
wikipedia is appropriate, so continue by clicking Sample and continue. The data loader will sample the first few
lines of the input file to ensure that it is parseable and contains the correct data to be ingested. Inspect this
dataset and then click Yes, this is the data I wanted to continue. You should now see the following screen:
4. Configure roll-up. Druid can index data using an ingestion-time, first-level aggregation known as "roll-up". Roll-up causes similar events to be aggregated during indexing which can result in reduced disk usage and faster queries for certain types of data. The Druid Concepts page provides an introduction on how roll-up works. For this quickstart, choose Don't use roll-up and then click Next to continue. You should now see the following screen:
5. Configure timestamp and partitioning. Druid uses a timestamp column to partition your data. This page allows you to
identify which column should be used as as the primary time column and how the timestamp is formatted. In this case, the
loader should have automatically detected the
timestamp column and chosen the
Here, you can also choose with what granularity the data should be partitioned. This should be chosen based on the quantity and time range of your data such that each partition will contain a reasonable amount of data. As an example, if your dataset contains 10000 events uniformly distributed over a year, an 'Hour' segment granularity would be a poor choice since it would result in ~9000 hourly partitions (365 * 24) each containing one or two events. Conversely, if your dataset contains 10 million events per hour, a 'Year' segment granularity would be a poor choice since the dataset would suffer from ineffective partitioning.
The sample dataset contains roughly 24,000 events distributed over a single day, so the
Day segment granularity is
appropriate. Choose Day and click Next to continue. You should now see the following screen:
6. Configure columns to load. The Configure Columns page allows you to map columns from your input data to the columns that will be loaded into Druid. Columns can be added, removed, and renamed. Here, you also specify a data type for each column (one of string, long, or float) which will help Druid to index data efficiently.
The data loader automatically discovers and attempts to detect each column's data type. In the case of our sample data, it correctly identifies the added, delta, deltaBucket, deleted, and commentLength columns as being long (64-bit integer) types and the other non-time columns as being string. Click Next to continue. You should now see the following screen:
7. Confirm and start ingestion! This final page provides a summary of the ingestion task and allows you to make final changes to the indexing spec. When you are ready, click Start loading data to submit the job. The load status page will indicate that indexing is in progress and will update once the job completes. You should now see the following screen:
Once the loader has indicated that the data has been indexed, you can move on to the next section to define a data cube and begin visualizing the data.
This section showed you how to load data from files, but Druid also supports streaming ingestion. Druid's streaming ingestion can load data with virtually no delay between events occurring and being available for queries. For more information, see Loading data.
Switch to the Visualize section of Imply by clicking on the corresponding button on the top bar. From here, you can create data cubes to model your data, explore these cubes, and organize views into dashboards. Start by clicking + Create new data cube.
In the dialog that comes up, make sure that
wikipedia is the selected Source and that Auto-fill dimensions and measures is selected.
Continue by clicking Next: Create data cube.
From here you can configure the various aspects of your data cube including defining and customizing the cube's dimensions and measures. The data cube creation flow can intelligently inspect the columns in your data source and determine possible dimensions and measures automatically. We enabled this when we selected Auto-fill dimensions and measures on the previous screen and you can see that the cube's settings have been largely pre-populated. In our case, the suggestions are appropriate so we can continue by clicking on the Save button in the top-right corner.
Imply's data cubes are highly configurable and gives you the flexibility to represent your dataset as well as derived and custom columns in many different ways. The documentation on dimensions and measures is a good starting point for learning how to configure a data cube.
After clicking Save, the data cube view for this new data cube is automatically loaded. In the future, this view can also be loaded by clicking on the name of the data cube (in this example 'Wikipedia') from the Visualize screen.
Here, you can explore a dataset by filtering and splitting it across any dimension. For each filtered split of your data, you will see the aggregate value of your selected measures. For example, on the wikipedia dataset, you can see the most frequently edited pages by splitting on Page (drag Page to the Show bar) and sorting by Number of Events (this is the default sort; you can also click on any column to sort by it). You should see a screen like the following:
The data cube view suggests different visualizations based on how you split your data. If you split on a string column, your data will initially be presented as a table. If you split on time, the data cube view will recommend a timeseries plot, and if you split on a numeric column you will get a bar chart. Try replacing the Page dimension with Time in the Show bar, which will switch your visualization to a timeseries chart like the following:
You can also change the visualization manually by choosing your preferred visualization from the dropdown. If the shown dimensions are not appropriate for a particular visualization, the data cube view will recommend alternative dimensions you can show.
If you would like more information on visualizing data, please refer to the Data cubes section.
Imply includes an easy-to-use interface for issuing Druid SQL queries. To access the SQL editor, go to the Run SQL section. If you are in the visualization view, you can navigate to this screen by selecting Run SQL from the hamburger menu in the top-left corner of the page. Once there, try running the following query, which will return the most edited Wikipedia pages:
SELECT page, COUNT(*) AS Edits FROM wikipedia WHERE "__time" BETWEEN TIMESTAMP '2016-06-27 00:00:00' AND TIMESTAMP '2016-06-28 00:00:00' GROUP BY page ORDER BY Edits DESC LIMIT 5
You should see results like the following:
For more details on making SQL queries with Druid, see the Druid SQL documentation.
Congratulations! You have now installed and run Imply on a single machine, loaded a sample dataset into Druid, defined a data cube, explored some simple visualizations, and executed queries using Druid SQL.
Next, you can:manage-data/ingestion
If you are unable to access the public web server, you can load the same dataset from a local file bundled in this distribution. The
quickstart directory includes a sample dataset and an ingestion spec to process the data, named
To submit an indexing job to Druid for this ingestion spec, run the following command from your Imply directory:
bin/post-index-task --file quickstart/wikipedia-index.json
A successful run will generate logs similar to the following:
Beginning indexing data for wikipedia Task started: index_wikipedia_2017-12-05T03:22:28.612Z Task log: http://localhost:8090/druid/indexer/v1/task/index_wikipedia_2017-12-05T03:22:28.612Z/log Task status: http://localhost:8090/druid/indexer/v1/task/index_wikipedia_2017-12-05T03:22:28.612Z/status Task index_wikipedia_2017-12-05T03:22:28.612Z still running... Task index_wikipedia_2017-12-05T03:22:28.612Z still running... Task finished with status: SUCCESS Completed indexing data for wikipedia. Now loading indexed data onto the cluster... wikipedia is 0.0% finished loading... wikipedia is 0.0% finished loading... wikipedia is 0.0% finished loading... wikipedia loading complete! You may now query your data
After the dataset has been created, you can move on to the next step to create a data cube.