2021.02

2021.02

  • Imply
  • Pivot
  • Druid
  • Manager
  • Clarity

›Imply Private

Overview

  • Imply Overview
  • Design
  • Release notes

Tutorials

  • Quickstart
  • Data ingestion tutorial
  • Kafka ingestion tutorial
  • Connect to Kinesis
  • Querying data

Deploy

  • Deployment planning
  • Imply Managed

    • Imply Cloud overview
    • Imply Cloud security
    • Direct access Pivot
    • On-prem Cloud crossover

    Imply Private

    • Imply Private overview
    • Install Imply on Minikube
    • Imply Private on Kubernetes
    • Imply Private on Azure Kubernetes Service
    • Enhanced Imply Private on Google Kubernetes Engine
    • Kubernetes Scaling Reference
    • Kubernetes Deep Storage Reference
    • Imply Private on Linux
    • Pivot state sharing
    • Migrate to Imply

    Unmanaged Imply

    • Unmanaged Imply deploy

Misc

  • Druid API users
  • Extensions
  • Third-party software licenses
  • Experimental features

Imply Private overview

The following considerations apply to self-hosted, Imply Private deployments.

As described in Deployment planning, Imply Private is an alternative deployment approach to Imply Cloud. It lets you maintain private control of the entire Imply deployment, including the Imply Manager. Imply Private provides distributions designed for easy installation and operation over Kubernetes, along with a binary distribution tar archive.

This topic provides general information regarding Imply Private deployments. For more specific information, see these topics:

  • Imply Private on Kubernetes
  • Imply Private on Google Kubernetes Engine
  • Imply Private on Azure Kubernetes Service
  • Imply Private on Linux

If you are exploring Imply in general, especially Imply Private on Kubernetes, a good place to start is Imply Private on Minikube.

Open ports (if using a firewall)

If you're using a firewall or some other system that only allows traffic on specific ports, allow inbound connections on the following:

Master Server

  • 1527 (Derby; not needed if you are using a separate metadata store like MySQL or PostgreSQL)
  • 2181 (ZooKeeper; not needed if you are using a separate ZooKeeper cluster)
  • 8081 (Druid Coordinator)
  • 8090 (Druid Overlord)

Query Server

  • 8082 (Druid Broker)
  • 8888 (Druid Router)
  • 9095 (Pivot/Clarity)

Data Server

  • 8083 (Druid Historical)
  • 8091 (Druid Middle Manager)
  • 8100–8199 (Druid Task JVMs, spawned by Middle Managers)

Production setup

In its out-of-the-box configuration, Imply is not intended to be exposed to untrusted users or to an untrusted network, such as the Internet. It is possible to expose Pivot in a limited fashion with a secure, custom configuration. However, Druid or its APIs should never be exposed in this manner. For details on configuring Pivot for secure access on untrusted networks, contact your Imply representative.

Note that for Imply Private on Kubernetes, the Helm chart provides many of the following as default settings. For the easiest installation, we recommend using a Helm chart-assisted installation.

High availability

Achieving scalability and fault tolerance for Master and Query servers requires some additional configuration steps.

For the Master server, which runs Derby (metadata storage), ZooKeeper, and the Druid Coordinator and Overlord:

  • For highly-available ZooKeeper, you will need a cluster of 3 or 5 ZooKeeper nodes. We recommend either installing ZooKeeper on its own hardware, or running 3 or 5 Master servers and configuring ZooKeeper on them appropriately. See the ZooKeeper admin guide for more details.
  • For highly-available metadata storage we recommend PostgreSQL or MySQL with replication and failover enabled. Sample, commented-out Druid configurations for both are included in common.runtime.properties in the Imply distribution.
  • Configuring highly-available Druid Coordinators and Overlords is simple: just start up multiple servers. If they are all configured to use the same ZooKeeper cluster and metadata storage, then they will automatically failover between each other as necessary. Only one will be active at a time, but inactive servers will redirect to the currently active server.

For the Query server:

  • Druid Brokers and Routers can be scaled out and all running servers will be active and queryable. We recommend placing Routers behind a load balancer and using them as a unified query and API endpoint.
  • Pivot should be configured to use a database for settings before scaling out. Once you have set up a highly available metadata storage for Druid, you can configure Pivot to use the same server. Refer to the commented-out "Database-backed settings" section of Pivot's config.yaml for sample configurations. Once this is done, all running servers will be active and queryable. We recommend placing them behind a load balancer.

Data servers can be scaled out without any additional configuration.

Geographically distributed deployment

Deployments across geographically distributed datacenters typically involve independent active clusters. For example, for a deployment across two datacenters, you can set up a separate Imply cluster in each datacenter. To ensure that each cluster loads the same data, there are two possible approaches:

  1. Have each cluster load data from the same source (HDFS cluster, S3 bucket, Kafka cluster, etc). In this case, data loading will happen over a long-distance link.
  2. Set up replication at the data input system level. For example, using a tool like DistCp for HDFS, or MirrorMaker for Kafka, to replicate the same input data into every datacenter.

Imply does not currently provide tools to simplify multi-datacenter deployments; users typically develop site-specific scripts and procedures for keeping multiple clusters synchronized.

Backup

Druid's critical data is all stored in deep storage (e.g., Azure, S3, or HDFS) and in its metadata store (e.g., PostgreSQL or MySQL). It is important to back up your metadata store.

Deep storage is often infeasible to back up due to its size, but it is important to ensure you have sufficient procedures in place to avoid losing data from deep storage. This could involve backups or could involve replication and proactive operations procedures.

Druid does not store any critical data in ZooKeeper, and does not store any critical data on disk if you have an independent metadata store and deep storage configured.

← On-prem Cloud crossoverInstall Imply on Minikube →
  • Open ports (if using a firewall)
    • Master Server
    • Query Server
    • Data Server
  • Production setup
    • High availability
    • Geographically distributed deployment
    • Backup
2021.02
Key links
Try ImplyApache Druid siteImply GitHub
Get help
Stack OverflowSupportContact us
Learn more
BlogApache Druid docs
Copyright © 2021 Imply Data, Inc