Unmanaged Imply
The installation mode described here is intended for trial and learning scenarios only; it is not intended for use in production. See the deployment options in Deployment planning for more information about production deployments.
Imply includes a script, the supervise
script, that lets you install and start up an Imply cluster quickly. This deployment mode is not recommended for production environments. It's suitable for single-machine deployments, and is the mode used in the Imply quickstart. If you're not familiar with Imply, you should start with the Quickstart.
In this document, we'll set up a simple cluster made up of a single data, query, and master server. Later, we'll discuss how you can configure high availability and scale up this cluster.
Before starting, see planning information for machine sizing, OS requirements, network configuration—for example, ports to open, and other considerations.
Download the distribution
First, download Imply {IMPLYVERSION} from imply.io/get-started and unpack the release archive. It's best to do this on a single machine at first, since you will be editing the configurations and then copying the modified distribution out to all of your servers.
tar -xzf imply-2024.10.2.tar.gz
cd imply-2024.10.2
In this package, you'll find:
bin/
- run scripts for included software.conf/
- template configurations for a clustered setup.conf-quickstart/*
- configurations for the single-machine quickstart.dist/
- all included software.quickstart/
- files related to the single-machine quickstart.
We'll be editing the files in conf/
in order to get things running.
Configure Master server address
In this simple cluster, you will deploy a single Master server running a Druid Coordinator, a Druid Overlord, a ZooKeeper server, and an embedded Derby metadata store.
In conf/druid/_common/common.runtime.properties
, update these properties by replacing
master.example.com
with the IP address of the machine that you will use as your Master server:
druid.zk.service.host
druid.metadata.storage.connector.connectURI
druid.metadata.storage.connector.host
Configure deep storage
Druid relies on a distributed filesystem or binary object store for data storage. The backing deep storage systems commonly used with Druid include S3 (popular for those on AWS), HDFS (popular if you already have a Hadoop deployment), Azure, and GCS.
S3
In conf/druid/_common/common.runtime.properties
,
Add
druid-s3-extensions
todruid.extensions.loadList
. If for example the list already contains "druid-parser-route," the final property should look like:druid.extensions.loadList=["druid-parser-route", "druid-s3-extensions"]
.Comment out the configurations for local storage under Deep Storage and Indexing service logs.
Uncomment and configure appropriate values in the For S3 sections of Deep Storage and Indexing service logs.
After this, you should have made the following changes:
druid.extensions.loadList=["druid-parser-route", "druid-s3-extensions"]
#druid.storage.type=local
#druid.storage.storageDirectory=var/druid/segments
druid.storage.type=s3
druid.storage.bucket=your-bucket
druid.storage.baseKey=druid/segments
druid.s3.accessKey=...
druid.s3.secretKey=...
#druid.indexer.logs.type=file
#druid.indexer.logs.directory=var/druid/indexing-logs
druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=your-bucket
druid.indexer.logs.s3Prefix=druid/indexing-logs
HDFS
In conf/druid/_common/common.runtime.properties
,
Add "druid-hdfs-storage" to
druid.extensions.loadList
. If for example the list already contains "druid-parser-route", the final property should look like:druid.extensions.loadList=["druid-parser-route", "druid-hdfs-storage"]
.Comment out the configurations for local storage under "Deep Storage" and "Indexing service logs".
Uncomment and configure appropriate values in the "For HDFS" sections of "Deep Storage" and "Indexing service logs".
After this, you should have made the following changes:
druid.extensions.loadList=["druid-parser-route", "druid-hdfs-storage"]
#druid.storage.type=local
#druid.storage.storageDirectory=var/druid/segments
druid.storage.type=hdfs
druid.storage.storageDirectory=hdfs://namenode.example.com:9000/druid/segments
#druid.indexer.logs.type=file
#druid.indexer.logs.directory=var/druid/indexing-logs
druid.indexer.logs.type=hdfs
druid.indexer.logs.directory=hdfs://namenode.example.com:9000/druid/indexing-logs
Also, place your Hadoop configuration XMLs (core-site.xml, hdfs-site.xml, yarn-site.xml, mapred-site.xml) on the classpath of your Druid nodes. You can do this by copying them into conf/druid/_common/
.
Configure Hadoop connection (optional)
If you want to use Hadoop for data ingestion, you can configure that now. See Hadoop.
Configuration tuning
Druid benefits greatly from being tuned to the hardware that it runs on. If you are using r4.2xlarge EC2 instances or similar hardware, the configuration in the distribution is a reasonable starting point.
If you are using different hardware, we recommend adjusting configurations for your specific hardware. The most commonly adjusted configurations are:
- -Xmx and -Xms
- druid.server.http.numThreads
- druid.cache.sizeInBytes
- druid.processing.buffer.sizeBytes
- druid.processing.numMergeBuffers
- druid.processing.numThreads
- druid.query.groupBy.maxIntermediateRows
- druid.query.groupBy.maxResults
- druid.server.maxSize and druid.segmentCache.locations on Historical Nodes
- druid.worker.capacity on MiddleManagers
Please see the Druid configuration documentation for a full description of all possible configuration options.
Start Master server
Copy Imply's distribution of Apache Druid® and your edited configurations to your new Master server. If you have been editing the configurations on your local machine, you can use rsync to copy them:
rsync -az imply-2024.10.2/ MASTER_SERVER:imply-2024.10.2/
On your Master server, cd
into the distribution and run this command to start a Master:
bin/supervise -c conf/supervise/master-with-zk.conf
You should see a log message printed out for each service that starts up. You can view detailed logs
for any service by looking in the var/sv/
directory using another terminal.
Start Query server
Copy Imply's distribution of Apache Druid and your edited configurations to your Query
servers. On each one, cd
into the distribution and run this command to start a Query server:
bin/supervise -c conf/supervise/query.conf
The default Query server configuration launches a Druid Router, Druid Broker, and Pivot.
Start Data servers
Copy Imply's distribution of Apache Druid and your edited configurations to your Data
servers. On each one, cd
into the distribution and run this command to start a Data server:
bin/supervise -c conf/supervise/data.conf
The default Data server configuration launches a Druid Historical and Druid MiddleManager process. New Data servers will automatically join the existing cluster. These services can be scaled out as much as necessary simply by starting more Data servers.
Service supervision and logging
You can use the supervise command to manage Imply service lifecycles and access console logs. The command is configured through a single configuration file per machine. Each machine can potentially start many services. For example, when you run the command:
bin/supervise -c conf/supervise/master-with-zk.conf
This tells supervise to use the file conf/supervise/master-with-zk.conf
to select which services
to run.
By default, the supervise program runs in the foreground. You can run supervision in the
background, if you want, by adding the --daemon
argument:
bin/supervise -c conf/supervise/master-with-zk.conf --daemon
You can restart an individual service using its name. For example, to restart the zk service,
run bin/service --restart zk
from the distribution.
To shut down all services on a machine, kill the supervise process (CTRL-C or kill SUPERVISE_PID
both work) or run the command bin/service --down
from the distribution.
Logging
By default, logs are written to var/sv/<service>/current
in the distribution. You can write these
files to any location you want by passing the -d <directory>
argument to bin/supervise
.
For added convenience, you can also tail log files by running bin/service --tail <service>
.
On MacOS and Linux, logs are automatically rotated using the included logger program. On other platforms, to prevent log files from growing forever, you can periodically truncate the logs using
truncate -s 0 <logfile>
.
Customizing supervision
You can modify the provided supervision files or create new files of your own. There are two kinds of lines in supervision files:
:verify some-program
will run some-program on startup. If the program exits successfully, supervise will continue. Otherwise, supervise will exit.foo some-program
will supervise a service named foo by running the program some-program. If the program exits, supervise will start it back up. Its console logs will be logged to a file namedvar/sv/foo/current
.
Updating a cluster
When updating an unmanaged cluster to a newer version, you should follow the procedure for a Druid Rolling Update.
If you have deployed your cluster with the Master, Query, and Data server configuration, take note of the following:
- Update Data servers first, then Query servers, then Master servers.
- Your Data servers run Druid MiddleManagers as part of Druid's Indexing Service. If you have indexing tasks that you do not want to be interrupted by a rolling update, you can use the Rolling restart (graceful-termination-based) method to prepare the MiddleManagers for clean restart.
- Your Data servers run Druid Historical Nodes, so you should wait for each server to fully come back online before restarting the next.