Imply is designed to be deployed as a horizontally scalable, fault-tolerant cluster.
In this document, we'll set up a simple cluster and discuss how it can be further configured to meet your needs. This simple cluster will feature scalable, fault-tolerant Data servers for ingesting and storing data, a single Query server, and a single Master server. Later, we'll discuss how this simple cluster can be configured for high availability and to scale out all server types.
It's possible to run the entire Imply stack on a single machine with only a few GB of RAM, as you've done in the quickstart. For a clustered deployment, typically larger machines are used in order to handle more data. In this section we'll discuss recommended hardware for a moderately sized cluster.
For this simple cluster, you will need one Master server, one Query server, and as many Data servers as necessary to index and store your data.
Data servers store and ingest data. Data servers run Druid Historical Nodes for storage and processing of large amounts of immutable data, Druid MiddleManagers for ingestion and processing of data, and optional Tranquility components to assist in streaming data ingestion. These servers benefit greatly from CPU, RAM, and SSDs. The equivalent of an AWS r4.2xlarge is a good starting point. This hardware offers:
For clusters with complex resource allocation needs, you can break apart the pre-packaged Data server and scale the components individually. This allows you to scale Druid Historical Nodes independently of Druid MiddleManagers, as well as eliminate the possibility of resource contention between historical workloads and real-time workloads.
Query servers are the endpoints that users and client applications interact with. Query servers run a Druid Broker that route queries to the appropriate data nodes. They include Imply UI as a way to directly explore and visualize your data, Druid's native SQL and JSON-over-HTTP query support. These servers benefit greatly from CPU and RAM, and can also be deployed on the equivalent of an AWS r4.2xlarge. This hardware offers:
Master servers coordinate data ingestion and storage in your Druid cluster. They are not involved in queries. They are responsible for coordinating ingestion jobs and for handling failover of the Druid Historical Node and Druid MiddleManager processes running on your Data servers. The equivalent of an AWS m5.xlarge is sufficient for most clusters. This hardware offers:
We recommend running your favorite Linux distribution. You will also need Java 8.
Your OS package manager should be able to help for installing Java. If your Ubuntu-based OS does not have a recent enough version of Java, Azul offers Zulu, an open source OpenJDK-based package with packages for Red Hat, Ubuntu, Debian, and other popular Linux distributions.
First, download Imply 2.9.20 from imply.io/get-started and unpack the release archive. It's best to do this on a single machine at first, since you will be editing the configurations and then copying the modified distribution out to all of your servers.
tar -xzf imply-2.9.20.tar.gz
cd imply-2.9.20
In this package, you'll find:
bin/
- run scripts for included software.conf/
- template configurations for a clustered setup.conf-quickstart/*
- configurations for the single-machine quickstart.dist/
- all included software.quickstart/
- files related to the single-machine quickstart.We'll be editing the files in conf/
in order to get things running.
In this simple cluster, you will deploy a single Master server running a Druid Coordinator, a Druid Overlord, a ZooKeeper server, and an embedded Derby metadata store.
In conf/druid/_common/common.runtime.properties
, update these properties by replacing
"master.example.com" with the IP address of the machine that you will use as your Master server:
druid.zk.service.host
druid.metadata.storage.connector.connectURI
druid.metadata.storage.connector.host
In conf/tranquility/server.json
and conf/tranquility/kafka.json
, if using those components, also
replace "master.example.com" in these properties with the IP address of your Master server:
zookeeper.connect
Druid relies on a distributed filesystem or binary object store for data storage. The most commonly used deep storage implementations are S3 (popular for those on AWS) and HDFS (popular if you already have a Hadoop deployment).
In conf/druid/_common/common.runtime.properties
,
Add "druid-s3-extensions" to druid.extensions.loadList
. If for example the list already contains "druid-parser-route",
the final property should look like: druid.extensions.loadList=["druid-parser-route", "druid-s3-extensions"]
.
Comment out the configurations for local storage under "Deep Storage" and "Indexing service logs".
Uncomment and configure appropriate values in the "For S3" sections of "Deep Storage" and "Indexing service logs".
After this, you should have made the following changes:
druid.extensions.loadList=["druid-parser-route", "druid-s3-extensions"]
#druid.storage.type=local
#druid.storage.storageDirectory=var/druid/segments
druid.storage.type=s3
druid.storage.bucket=your-bucket
druid.storage.baseKey=druid/segments
druid.s3.accessKey=...
druid.s3.secretKey=...
#druid.indexer.logs.type=file
#druid.indexer.logs.directory=var/druid/indexing-logs
druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=your-bucket
druid.indexer.logs.s3Prefix=druid/indexing-logs
In conf/druid/_common/common.runtime.properties
,
Add "druid-hdfs-storage" to druid.extensions.loadList
. If for example the list already contains "druid-parser-route",
the final property should look like: druid.extensions.loadList=["druid-parser-route", "druid-hdfs-storage"]
.
Comment out the configurations for local storage under "Deep Storage" and "Indexing service logs".
Uncomment and configure appropriate values in the "For HDFS" sections of "Deep Storage" and "Indexing service logs".
After this, you should have made the following changes:
druid.extensions.loadList=["druid-parser-route", "druid-hdfs-storage"]
#druid.storage.type=local
#druid.storage.storageDirectory=var/druid/segments
druid.storage.type=hdfs
druid.storage.storageDirectory=hdfs://namenode.example.com:9000/druid/segments
#druid.indexer.logs.type=file
#druid.indexer.logs.directory=var/druid/indexing-logs
druid.indexer.logs.type=hdfs
druid.indexer.logs.directory=hdfs://namenode.example.com:9000/druid/indexing-logs
Also,
conf/druid/_common/
.If you want to use any of the following data ingestion mechanisms, you can configure them now:
Druid benefits greatly from being tuned to the hardware that it runs on. If you are using r4.2xlarge EC2 instances or similar hardware, the configuration in the distribution is a reasonable starting point.
If you are using different hardware, we recommend adjusting configurations for your specific hardware. The most commonly adjusted configurations are:
Please see the Druid configuration documentation for a full description of all possible configuration options.
If you're using a firewall or some other system that only allows traffic on specific ports, allow inbound connections on the following:
Copy the Imply distribution and your edited configurations to your new Master server. If you have been editing the configurations on your local machine, you can use rsync to copy them:
rsync -az imply-2.9.20/ MASTER_SERVER:imply-2.9.20/
On your Master server, cd into the distribution and run this command to start a Master:
bin/supervise -c conf/supervise/master-with-zk.conf
You should see a log message printed out for each service that starts up. You can view detailed logs
for any service by looking in the var/sv/
directory using another terminal.
Copy the Imply distribution and your edited configurations to your Query servers. On each one, cd into the distribution and run this command to start a Query server:
bin/supervise -c conf/supervise/query.conf
The default Query server configuration launches a Druid Broker and Imply UI.
Copy the Imply distribution and your edited configurations to your Data servers. On each one, cd into the distribution and run this command to start a Data server:
bin/supervise -c conf/supervise/data.conf
The default Data server configuration launches a Druid Historical Node, a Druid MiddleManager, and optionally Tranquility components. New Data servers will automatically join the existing cluster. These services can be scaled out as much as necessary simply by starting more Data servers.
The cluster you just created runs a single Master server, a single Query server, and can run multiple Data servers. This supports scale-out fault tolerant Data servers out of the box, but some more configuration is necessary to achieve the same for Master and Query servers.
For the Master server, which runs Derby (metadata storage), ZooKeeper, and the Druid Coordinator and Overlord:
For the Query server:
Data servers can be scaled out without any additional configuration.
Deployments across geographically distributed datacenters typically involve independent active clusters. For example, for a deployment across two datacenters, you can set up a separate Imply cluster in each datacenter. To ensure that each cluster loads the same data, there are two possible approaches:
Imply does not currently provide tools to simplify multi-datacenter deployments, so end users typically develop scripts and procedures for keeping the multiple clusters synchronized.
Druid's critical data is all stored in deep storage (e.g. S3 or HDFS) and in its metadata store (e.g. PostgreSQL or MySQL). It is important to back up your metadata store. Deep storage is often infeasible to back up due to its size, but it is important to ensure you have sufficient procedures in place to avoid losing data from deep storage. This could involve backups or could involve replication and proactive operations procedures.
Druid does not store any critical data in ZooKeeper, and does not store any critical data on disk if you have an independent metadata store and deep storage configured.
Congratulations, you now have an Imply cluster! The next step is to load your data.