Imply is designed to be deployed as a horizontally scalable, fault-tolerant cluster.

In this document, we'll set up a simple cluster and discuss how it can be further configured to meet your needs. This simple cluster will feature scalable, fault-tolerant Data servers for ingesting and storing data, a single Query server, and a single Master server. Later, we'll discuss how this simple cluster can be configured for high availability and to scale out all server types.

Select hardware

It's possible to run the entire Imply stack on a single machine with only a few GB of RAM, as you've done in the quickstart. For a clustered deployment, typically larger machines are used in order to handle more data. In this section we'll discuss recommended hardware for a moderately sized cluster.

For this simple cluster, you will need one Master server, one Query server, and as many Data servers as necessary to index and store your data.

Data servers store and ingest data. Data servers run Druid Historical processes for storage and processing of large amounts of immutable data, and Druid MiddleManagers for ingestion and processing of data. These servers benefit greatly from CPU, RAM, and SSDs. The equivalent of an AWS r4.2xlarge is a good starting point. This hardware offers:

For clusters with complex resource allocation needs, you can break apart the pre-packaged Data server and scale the components individually. This allows you to scale Druid Historical Nodes independently of Druid MiddleManagers, as well as eliminate the possibility of resource contention between historical workloads and real-time workloads.

Query servers are the endpoints that users and client applications interact with. Query servers run a Druid Broker that route queries to the appropriate data nodes, and a Druid Router that acts as a thin reverse proxy layer and unified query and API endpoint. They include Pivot as a way to directly explore and visualize your data, Druid's native SQL and JSON-over-HTTP query support. These servers benefit greatly from CPU and RAM, and can also be deployed on the equivalent of an AWS r4.2xlarge. This hardware offers:

Master servers coordinate data ingestion and storage in your Druid cluster. They are not involved in queries. They are responsible for coordinating ingestion jobs and for handling failover of the Druid Historical Node and Druid MiddleManager processes running on your Data servers. The equivalent of an AWS m5.xlarge is sufficient for most clusters. This hardware offers:

Select OS

We recommend running your favorite Linux distribution. You will also need Java 8.

Your OS package manager should be able to help with installing Java. If your Ubuntu-based OS does not have a recent enough version of Java, Azul offers Zulu, an open source OpenJDK-based package with packages for Red Hat, Ubuntu, Debian, and other popular Linux distributions.

Download the distribution

First, download Imply 3.0.19 from and unpack the release archive. It's best to do this on a single machine at first, since you will be editing the configurations and then copying the modified distribution out to all of your servers.

tar -xzf imply-3.0.19.tar.gz
cd imply-3.0.19

In this package, you'll find:

We'll be editing the files in conf/ in order to get things running.

Configure Master server address

In this simple cluster, you will deploy a single Master server running a Druid Coordinator, a Druid Overlord, a ZooKeeper server, and an embedded Derby metadata store.

In conf/druid/_common/, update these properties by replacing "" with the IP address of the machine that you will use as your Master server:

Configure deep storage

Druid relies on a distributed filesystem or binary object store for data storage. The most commonly used deep storage implementations are S3 (popular for those on AWS) and HDFS (popular if you already have a Hadoop deployment).


In conf/druid/_common/,

After this, you should have made the following changes:

druid.extensions.loadList=["druid-parser-route", "druid-s3-extensions"]




In conf/druid/_common/,

After this, you should have made the following changes:

druid.extensions.loadList=["druid-parser-route", "druid-hdfs-storage"]




Configure Hadoop connection (optional)

If you want to use Hadoop for data ingestion, you can configure that now. See Connecting to Hadoop.

Configuration tuning

Druid benefits greatly from being tuned to the hardware that it runs on. If you are using r4.2xlarge EC2 instances or similar hardware, the configuration in the distribution is a reasonable starting point.

If you are using different hardware, we recommend adjusting configurations for your specific hardware. The most commonly adjusted configurations are:

Please see the Druid configuration documentation for a full description of all possible configuration options.

Open ports (if using a firewall)

If you're using a firewall or some other system that only allows traffic on specific ports, allow inbound connections on the following:

Master Server

Query Server

Data Server

Start Master server

Copy the Imply distribution and your edited configurations to your new Master server. If you have been editing the configurations on your local machine, you can use rsync to copy them:

rsync -az imply-3.0.19/ MASTER_SERVER:imply-3.0.19/

On your Master server, cd into the distribution and run this command to start a Master:

bin/supervise -c conf/supervise/master-with-zk.conf

You should see a log message printed out for each service that starts up. You can view detailed logs for any service by looking in the var/sv/ directory using another terminal.

Start Query server

Copy the Imply distribution and your edited configurations to your Query servers. On each one, cd into the distribution and run this command to start a Query server:

bin/supervise -c conf/supervise/query.conf

The default Query server configuration launches a Druid Router, Druid Broker, and Pivot.

Start Data servers

Copy the Imply distribution and your edited configurations to your Data servers. On each one, cd into the distribution and run this command to start a Data server:

bin/supervise -c conf/supervise/data.conf

The default Data server configuration launches a Druid Historical and Druid MiddleManager process. New Data servers will automatically join the existing cluster. These services can be scaled out as much as necessary simply by starting more Data servers.

Production setup

High availability

The cluster you just created runs a single Master server, a single Query server, and can run multiple Data servers. This supports scale-out fault tolerant Data servers out of the box, but some more configuration is necessary to achieve the same for Master and Query servers.

For the Master server, which runs Derby (metadata storage), ZooKeeper, and the Druid Coordinator and Overlord:

For the Query server:

Data servers can be scaled out without any additional configuration.

Geographically distributed deployment

Deployments across geographically distributed datacenters typically involve independent active clusters. For example, for a deployment across two datacenters, you can set up a separate Imply cluster in each datacenter. To ensure that each cluster loads the same data, there are two possible approaches:

  1. Have each cluster load data from the same source (HDFS cluster, S3 bucket, Kafka cluster, etc). In this case, data loading will happen over a long-distance link.
  2. Set up replication at the data input system level. For example, using a tool like DistCp for HDFS, or MirrorMaker for Kafka, to replicate the same input data into every datacenter.

Imply does not currently provide tools to simplify multi-datacenter deployments, so end users typically develop scripts and procedures for keeping the multiple clusters synchronized.


Druid's critical data is all stored in deep storage (e.g. S3 or HDFS) and in its metadata store (e.g. PostgreSQL or MySQL). It is important to back up your metadata store. Deep storage is often infeasible to back up due to its size, but it is important to ensure you have sufficient procedures in place to avoid losing data from deep storage. This could involve backups or could involve replication and proactive operations procedures.

Druid does not store any critical data in ZooKeeper, and does not store any critical data on disk if you have an independent metadata store and deep storage configured.

Next steps

Congratulations, you now have an Imply cluster! The next step is to load your data.




Manage Data

Query Data



Special UI Features

Imply Manager