The following considerations apply to self-hosted, Imply Private deployments.
Open ports (if using a firewall)
If you're using a firewall or some other system that only allows traffic on specific ports, allow inbound connections on the following:
- 1527 (Derby; not needed if you are using a separate metadata store like MySQL or PostgreSQL)
- 2181 (ZooKeeper; not needed if you are using a separate ZooKeeper cluster)
- 8081 (Druid Coordinator)
- 8090 (Druid Overlord)
- 8082 (Druid Broker)
- 8888 (Druid Router)
- 9095 (Pivot/Clarity)
- 8083 (Druid Historical)
- 8091 (Druid Middle Manager)
- 8100–8199 (Druid Task JVMs, spawned by Middle Managers)
In its out-of-the-box configuration, Imply is not intended to be exposed to untrusted users or to an untrusted network, such as the Internet. It is possible to expose Pivot in a limited fashion with a secure, custom configuration. However, Druid or its APIs should never be exposed in this manner. For details on configuring Pivot for secure access on untrusted networks, contact your Imply representative.
Note that for Imply Private on Kubernetes, the Helm chart provides many of the following as default settings. For the easiest installation, we recommend using a Helm chart-assisted installation.
Achieving scalability and fault tolerance for Master and Query servers requires some additional configuration steps.
For the Master server, which runs Derby (metadata storage), ZooKeeper, and the Druid Coordinator and Overlord:
- For highly-available ZooKeeper, you will need a cluster of 3 or 5 ZooKeeper nodes. We recommend either installing ZooKeeper on its own hardware, or running 3 or 5 Master servers and configuring ZooKeeper on them appropriately. See the ZooKeeper admin guide for more details.
- For highly-available metadata storage we recommend PostgreSQL or MySQL with replication and failover enabled. Sample, commented-out Druid configurations for both are included in common.runtime.properties in the Imply distribution.
- Configuring highly-available Druid Coordinators and Overlords is simple: just start up multiple servers. If they are all configured to use the same ZooKeeper cluster and metadata storage, then they will automatically failover between each other as necessary. Only one will be active at a time, but inactive servers will redirect to the currently active server.
For the Query server:
- Druid Brokers and Routers can be scaled out and all running servers will be active and queryable. We recommend placing Routers behind a load balancer and using them as a unified query and API endpoint.
- Pivot should be configured to use a database for settings before scaling out. Once you have set up a highly available metadata storage for Druid, you can configure Pivot to use the same server. Refer to the commented-out "Database-backed settings" section of Pivot's config.yaml for sample configurations. Once this is done, all running servers will be active and queryable. We recommend placing them behind a load balancer.
Data servers can be scaled out without any additional configuration.
Geographically distributed deployment
Deployments across geographically distributed datacenters typically involve independent active clusters. For example, for a deployment across two datacenters, you can set up a separate Imply cluster in each datacenter. To ensure that each cluster loads the same data, there are two possible approaches:
- Have each cluster load data from the same source (HDFS cluster, S3 bucket, Kafka cluster, etc). In this case, data loading will happen over a long-distance link.
- Set up replication at the data input system level. For example, using a tool like DistCp for HDFS, or MirrorMaker for Kafka, to replicate the same input data into every datacenter.
Imply does not currently provide tools to simplify multi-datacenter deployments; users typically develop site-specific scripts and procedures for keeping multiple clusters synchronized.
Druid's critical data is all stored in deep storage (e.g., Azure, S3, or HDFS) and in its metadata store (e.g., PostgreSQL or MySQL). It is important to back up your metadata store.
Deep storage is often infeasible to back up due to its size, but it is important to ensure you have sufficient procedures in place to avoid losing data from deep storage. This could involve backups or could involve replication and proactive operations procedures.
Druid does not store any critical data in ZooKeeper, and does not store any critical data on disk if you have an independent metadata store and deep storage configured.