Skip to main content

Unmanaged Imply

The installation mode described here is intended for trial and learning scenarios only; it is not intended for use in production. See the deployment options in Deployment planning for more information about production deployments.

Imply includes a script, the supervise script, that lets you install and start up an Imply cluster quickly. This deployment mode is not recommended for production environments. It's suitable for single-machine deployments, and is the mode used in the Imply quickstart. If you're not familiar with Imply, you should start with the Quickstart.

In this document, we'll set up a simple cluster made up of a single data, query, and master server. Later, we'll discuss how you can configure high availability and scale up this cluster.

Before starting, see planning information for machine sizing, OS requirements, network configurationfor example, ports to open, and other considerations.

Download the distribution

First, download Imply {IMPLYVERSION} from imply.io/get-started and unpack the release archive. It's best to do this on a single machine at first, since you will be editing the configurations and then copying the modified distribution out to all of your servers.

tar -xzf imply-2024.10.1.tar.gz
cd imply-2024.10.1

In this package, you'll find:

  • bin/ - run scripts for included software.
  • conf/ - template configurations for a clustered setup.
  • conf-quickstart/* - configurations for the single-machine quickstart.
  • dist/ - all included software.
  • quickstart/ - files related to the single-machine quickstart.

We'll be editing the files in conf/ in order to get things running.

Configure Master server address

In this simple cluster, you will deploy a single Master server running a Druid Coordinator, a Druid Overlord, a ZooKeeper server, and an embedded Derby metadata store.

In conf/druid/_common/common.runtime.properties, update these properties by replacing master.example.com with the IP address of the machine that you will use as your Master server:

  • druid.zk.service.host
  • druid.metadata.storage.connector.connectURI
  • druid.metadata.storage.connector.host

Configure deep storage

Druid relies on a distributed filesystem or binary object store for data storage. The backing deep storage systems commonly used with Druid include S3 (popular for those on AWS), HDFS (popular if you already have a Hadoop deployment), Azure, and GCS.

S3

In conf/druid/_common/common.runtime.properties,

  • Add druid-s3-extensions to druid.extensions.loadList. If for example the list already contains "druid-parser-route," the final property should look like: druid.extensions.loadList=["druid-parser-route", "druid-s3-extensions"].

  • Comment out the configurations for local storage under Deep Storage and Indexing service logs.

  • Uncomment and configure appropriate values in the For S3 sections of Deep Storage and Indexing service logs.

After this, you should have made the following changes:

druid.extensions.loadList=["druid-parser-route", "druid-s3-extensions"]

#druid.storage.type=local
#druid.storage.storageDirectory=var/druid/segments

druid.storage.type=s3
druid.storage.bucket=your-bucket
druid.storage.baseKey=druid/segments
druid.s3.accessKey=...
druid.s3.secretKey=...

#druid.indexer.logs.type=file
#druid.indexer.logs.directory=var/druid/indexing-logs

druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=your-bucket
druid.indexer.logs.s3Prefix=druid/indexing-logs

HDFS

In conf/druid/_common/common.runtime.properties,

  • Add "druid-hdfs-storage" to druid.extensions.loadList. If for example the list already contains "druid-parser-route", the final property should look like: druid.extensions.loadList=["druid-parser-route", "druid-hdfs-storage"].

  • Comment out the configurations for local storage under "Deep Storage" and "Indexing service logs".

  • Uncomment and configure appropriate values in the "For HDFS" sections of "Deep Storage" and "Indexing service logs".

After this, you should have made the following changes:

druid.extensions.loadList=["druid-parser-route", "druid-hdfs-storage"]

#druid.storage.type=local
#druid.storage.storageDirectory=var/druid/segments

druid.storage.type=hdfs
druid.storage.storageDirectory=hdfs://namenode.example.com:9000/druid/segments

#druid.indexer.logs.type=file
#druid.indexer.logs.directory=var/druid/indexing-logs

druid.indexer.logs.type=hdfs
druid.indexer.logs.directory=hdfs://namenode.example.com:9000/druid/indexing-logs

Also, place your Hadoop configuration XMLs (core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml) on the classpath of your Druid nodes. You can do this by copying them into conf/druid/_common/.

Configure Hadoop connection (optional)

If you want to use Hadoop for data ingestion, you can configure that now. See Hadoop.

Configuration tuning

Druid benefits greatly from being tuned to the hardware that it runs on. If you are using r4.2xlarge EC2 instances or similar hardware, the configuration in the distribution is a reasonable starting point.

If you are using different hardware, we recommend adjusting configurations for your specific hardware. The most commonly adjusted configurations are:

  • -Xmx and -Xms
  • druid.server.http.numThreads
  • druid.cache.sizeInBytes
  • druid.processing.buffer.sizeBytes
  • druid.processing.numMergeBuffers
  • druid.processing.numThreads
  • druid.query.groupBy.maxIntermediateRows
  • druid.query.groupBy.maxResults
  • druid.server.maxSize and druid.segmentCache.locations on Historical Nodes
  • druid.worker.capacity on MiddleManagers

Please see the Druid configuration documentation for a full description of all possible configuration options.

Start Master server

Copy Imply's distribution of Apache Druid® and your edited configurations to your new Master server. If you have been editing the configurations on your local machine, you can use rsync to copy them:

rsync -az imply-2024.10.1/ MASTER_SERVER:imply-2024.10.1/

On your Master server, cd into the distribution and run this command to start a Master:

bin/supervise -c conf/supervise/master-with-zk.conf

You should see a log message printed out for each service that starts up. You can view detailed logs for any service by looking in the var/sv/ directory using another terminal.

Start Query server

Copy Imply's distribution of Apache Druid and your edited configurations to your Query servers. On each one, cd into the distribution and run this command to start a Query server:

bin/supervise -c conf/supervise/query.conf

The default Query server configuration launches a Druid Router, Druid Broker, and Pivot.

Start Data servers

Copy Imply's distribution of Apache Druid and your edited configurations to your Data servers. On each one, cd into the distribution and run this command to start a Data server:

bin/supervise -c conf/supervise/data.conf

The default Data server configuration launches a Druid Historical and Druid MiddleManager process. New Data servers will automatically join the existing cluster. These services can be scaled out as much as necessary simply by starting more Data servers.

Service supervision and logging

You can use the supervise command to manage Imply service lifecycles and access console logs. The command is configured through a single configuration file per machine. Each machine can potentially start many services. For example, when you run the command:

bin/supervise -c conf/supervise/master-with-zk.conf

This tells supervise to use the file conf/supervise/master-with-zk.conf to select which services to run.

By default, the supervise program runs in the foreground. You can run supervision in the background, if you want, by adding the --daemon argument:

bin/supervise -c conf/supervise/master-with-zk.conf --daemon

You can restart an individual service using its name. For example, to restart the zk service, run bin/service --restart zk from the distribution.

To shut down all services on a machine, kill the supervise process (CTRL-C or kill SUPERVISE_PID both work) or run the command bin/service --down from the distribution.

Logging

By default, logs are written to var/sv/<service>/current in the distribution. You can write these files to any location you want by passing the -d <directory> argument to bin/supervise.

For added convenience, you can also tail log files by running bin/service --tail <service>.

On MacOS and Linux, logs are automatically rotated using the included logger program. On other platforms, to prevent log files from growing forever, you can periodically truncate the logs using truncate -s 0 <logfile>.

Customizing supervision

You can modify the provided supervision files or create new files of your own. There are two kinds of lines in supervision files:

  • :verify some-program will run some-program on startup. If the program exits successfully, supervise will continue. Otherwise, supervise will exit.

  • foo some-program will supervise a service named foo by running the program some-program. If the program exits, supervise will start it back up. Its console logs will be logged to a file named var/sv/foo/current.

Updating a cluster

When updating an unmanaged cluster to a newer version, you should follow the procedure for a Druid Rolling Update.

If you have deployed your cluster with the Master, Query, and Data server configuration, take note of the following:

  • Update Data servers first, then Query servers, then Master servers.
  • Your Data servers run Druid MiddleManagers as part of Druid's Indexing Service. If you have indexing tasks that you do not want to be interrupted by a rolling update, you can use the Rolling restart (graceful-termination-based) method to prepare the MiddleManagers for clean restart.
  • Your Data servers run Druid Historical Nodes, so you should wait for each server to fully come back online before restarting the next.