Planning

Deploying an Imply cluster requires planning and preparing its intended operating environment. You will need to choose the hardware, decide how to provision users, prepare the network, and more. This document outlines some of these considerations.

This document describes general Imply cluster requirements and considerations. For specific Imply Manager requirements, see Deploy with Docker or Deploy with Kubernetes.

In its out-of-the-box configuration, Imply is not intended to be exposed to untrusted users or to an untrusted network, such as the Internet. It is possible to expose Pivot in a limited fashion with a secure, custom configuration. However, Druid or its APIs should never be exposed in this manner. For details on configuring Pivot for secure access on untrusted networks, contact your Imply representative.

Select OS

We recommend running Imply on your favorite Linux distribution. You will also need Java 8.

Your OS package manager should be able to help with installing Java. If your Ubuntu-based OS does not have a recent version of Java, Azul offers Zulu, an open source OpenJDK-based package with packages for Red Hat, Ubuntu, Debian, and other popular Linux distributions.

Select hardware

It's possible to run the entire Imply stack on a single machine with only a few GB of RAM, as you've done in the quickstart. For a clustered deployment, however, larger machines are typically required.

A simple, medium-sized cluster—often the starting point for an Imply deployment—needs one Master server, one Query server, and as many Data servers as necessary to index and store data.

For clusters with complex resource allocation needs, you can break apart the data server components and deploy them individually. This allows you to scale Druid Historical Nodes independently of Druid MiddleManagers, and eliminates the possibility of resource contention between historical workloads and real-time workloads.

The following recommendations outline machine requirements for a medium-sized cluster.

Data servers

Data servers run Druid Historical services (for storage and processing of large amounts of immutable data) and Druid MiddleManagers (for data ingestion and processing). These servers benefit greatly from CPU, RAM, and SSDs. The equivalent of an AWS r4.2xlarge is a good starting point. This hardware offers:

Query servers

Query servers are the endpoints that users and client applications interact with. Query servers run a Druid Broker that route queries to the appropriate data nodes, and a Druid Router that acts as a thin reverse proxy layer and unified query and API endpoint. They include Pivot as a way to directly explore and visualize your data, Druid's native SQL and JSON-over-HTTP query support. These servers benefit greatly from CPU and RAM, and can also be deployed on the equivalent of an AWS r4.2xlarge. This hardware offers:

Master servers

Master servers coordinate data ingestion and storage in your Druid cluster. They are not involved in queries. They are responsible for coordinating ingestion jobs and for handling failover of the Druid Historical Node and Druid MiddleManager processes running on your Data servers. The equivalent of an AWS m5.xlarge is sufficient for most clusters. This hardware offers:

Deep storage

Druid relies on a distributed filesystem or binary object store for data storage. The backing deep storage systems commonly used with Druid include Amazon S3 (popular for those on AWS), HDFS (popular for those who already have Hadoop in their environment), Microsoft Azure, or Google Cloud Storage (GCS).

How you configure deep storage varies depending on how you deploy Imply, as follows:

For more information, see Deep Storage in the Druid documentation.

Open ports (if using a firewall)

If you're using a firewall or some other system that only allows traffic on specific ports, allow inbound connections on the following:

Master Server

Query Server

Data Server

Production setup

High availability

A quickstart cluster created in the Imply Manager runs a single Master server, a single Query server, and one or more Data servers. In this scenario, Data servers are scalable and fault tolerant, but achieving the same for Master and Query servers requires some additional configuration steps.

For the Master server, which runs Derby (metadata storage), ZooKeeper, and the Druid Coordinator and Overlord:

For the Query server:

Data servers can be scaled out without any additional configuration.

Geographically distributed deployment

Deployments across geographically distributed datacenters typically involve independent active clusters. For example, for a deployment across two datacenters, you can set up a separate Imply cluster in each datacenter. To ensure that each cluster loads the same data, there are two possible approaches:

  1. Have each cluster load data from the same source (HDFS cluster, S3 bucket, Kafka cluster, etc). In this case, data loading will happen over a long-distance link.
  2. Set up replication at the data input system level. For example, using a tool like DistCp for HDFS, or MirrorMaker for Kafka, to replicate the same input data into every datacenter.

Imply does not currently provide tools to simplify multi-datacenter deployments; users typically develop site-specific scripts and procedures for keeping multiple clusters synchronized.

Backup

Druid's critical data is all stored in deep storage (e.g., Azure, S3, or HDFS) and in its metadata store (e.g., PostgreSQL or MySQL). It is important to back up your metadata store.

Deep storage is often infeasible to back up due to its size, but it is important to ensure you have sufficient procedures in place to avoid losing data from deep storage. This could involve backups or could involve replication and proactive operations procedures.

Druid does not store any critical data in ZooKeeper, and does not store any critical data on disk if you have an independent metadata store and deep storage configured.

Overview

Tutorial

Deploy

Administer

Manage Data

Query Data

Visualize

Configure

Special UI Features

Misc