Deploy manually

The Imply Manager gives you a point-and-click interface for deploying and administering an Imply Cluster. Alternatively you can start up Imply with supervise. supervise is a helper script that comes with sample configuration files for quickstart and clustered deployments. After cluster startup, supervise monitors the Druid processes and restarts them if they stop unexpectedly.

In this document, we'll set up a simple cluster made up of a single data, query, and master server. Later, we'll discuss how you can configure high availability and scale up this cluster.

Before starting, see planning information for machine sizing, OS requirements, network configuration (e.g., ports to open), and other considerations.

Download the distribution

First, download Imply 3.4.4 from imply.io/get-started and unpack the release archive. It's best to do this on a single machine at first, since you will be editing the configurations and then copying the modified distribution out to all of your servers.

tar -xzf imply-3.4.4.tar.gz
cd imply-3.4.4

In this package, you'll find:

We'll be editing the files in conf/ in order to get things running.

Configure Master server address

In this simple cluster, you will deploy a single Master server running a Druid Coordinator, a Druid Overlord, a ZooKeeper server, and an embedded Derby metadata store.

In conf/druid/_common/common.runtime.properties, update these properties by replacing "master.example.com" with the IP address of the machine that you will use as your Master server:

Configure deep storage

Druid relies on a distributed filesystem or binary object store for data storage. The backing deep storage systems commonly used with Druid include S3 (popular for those on AWS), HDFS (popular if you already have a Hadoop deployment), Azure, and GCS.

S3

In conf/druid/_common/common.runtime.properties,

After this, you should have made the following changes:

druid.extensions.loadList=["druid-parser-route", "druid-s3-extensions"]

#druid.storage.type=local
#druid.storage.storageDirectory=var/druid/segments

druid.storage.type=s3
druid.storage.bucket=your-bucket
druid.storage.baseKey=druid/segments
druid.s3.accessKey=...
druid.s3.secretKey=...

#druid.indexer.logs.type=file
#druid.indexer.logs.directory=var/druid/indexing-logs

druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=your-bucket
druid.indexer.logs.s3Prefix=druid/indexing-logs

HDFS

In conf/druid/_common/common.runtime.properties,

After this, you should have made the following changes:

druid.extensions.loadList=["druid-parser-route", "druid-hdfs-storage"]

#druid.storage.type=local
#druid.storage.storageDirectory=var/druid/segments

druid.storage.type=hdfs
druid.storage.storageDirectory=hdfs://namenode.example.com:9000/druid/segments

#druid.indexer.logs.type=file
#druid.indexer.logs.directory=var/druid/indexing-logs

druid.indexer.logs.type=hdfs
druid.indexer.logs.directory=hdfs://namenode.example.com:9000/druid/indexing-logs

Also,

Configure Hadoop connection (optional)

If you want to use Hadoop for data ingestion, you can configure that now. See Connecting to Hadoop.

Configuration tuning

Druid benefits greatly from being tuned to the hardware that it runs on. If you are using r4.2xlarge EC2 instances or similar hardware, the configuration in the distribution is a reasonable starting point.

If you are using different hardware, we recommend adjusting configurations for your specific hardware. The most commonly adjusted configurations are:

Please see the Druid configuration documentation for a full description of all possible configuration options.

Start Master server

Copy the Imply distribution and your edited configurations to your new Master server. If you have been editing the configurations on your local machine, you can use rsync to copy them:

rsync -az imply-3.4.4/ MASTER_SERVER:imply-3.4.4/

On your Master server, cd into the distribution and run this command to start a Master:

bin/supervise -c conf/supervise/master-with-zk.conf

You should see a log message printed out for each service that starts up. You can view detailed logs for any service by looking in the var/sv/ directory using another terminal.

Start Query server

Copy the Imply distribution and your edited configurations to your Query servers. On each one, cd into the distribution and run this command to start a Query server:

bin/supervise -c conf/supervise/query.conf

The default Query server configuration launches a Druid Router, Druid Broker, and Pivot.

Start Data servers

Copy the Imply distribution and your edited configurations to your Data servers. On each one, cd into the distribution and run this command to start a Data server:

bin/supervise -c conf/supervise/data.conf

The default Data server configuration launches a Druid Historical and Druid MiddleManager process. New Data servers will automatically join the existing cluster. These services can be scaled out as much as necessary simply by starting more Data servers.

Service supervision and logging

You can use the supervise command to manage Imply service lifecycles and access console logs. The command is configured through a single configuration file per machine. Each machine can potentially start many services. For example, when you run the command:

bin/supervise -c conf/supervise/master-with-zk.conf

This tells supervise to use the file conf/supervise/master-with-zk.conf to select which services to run.

By default, the supervise program runs in the foreground. You can run supervision in the background, if you want, by adding the --daemon argument:

bin/supervise -c conf/supervise/master-with-zk.conf --daemon

You can restart an individual service using its name. For example, to restart the zk service, run bin/service --restart zk from the distribution.

To shut down all services on a machine, kill the supervise process (CTRL-C or kill SUPERVISE_PID both work) or run the command bin/service --down from the distribution.

Logging

By default, logs are written to var/sv/<service>/current in the distribution. You can write these files to any location you want by passing the -d <directory> argument to bin/supervise.

For added convenience, you can also tail log files by running bin/service --tail <service>.

On MacOS and Linux, logs are automatically rotated using the included logger program. On other platforms, to prevent log files from growing forever, you can periodically truncate the logs using truncate -s 0 <logfile>.

Customizing supervision

You can modify the provided supervision files or create new files of your own. There are two kinds of lines in supervision files:

Updating a cluster

When updating a supervise-managed cluster to a newer version, you should follow the procedure for a Druid Rolling Update.

If you have deployed your cluster with the Master, Query, and Data server configuration, take note of the following:

Druid operations

Please see the Druid operations documentation for tips on best practices, extension usage, monitoring suggestions, multitenancy information, performance optimization, and many more topics.

Overview

Tutorial

Deploy

Administer

Manage Data

Query Data

Visualize

Configure

Misc