2020.12

2020.12

  • Imply
  • Pivot
  • Druid
  • Manager
  • Clarity

›Deploy

Overview

  • Imply Overview
  • Design
  • Release notes

Tutorials

  • Quickstart
  • Data ingestion tutorial
  • Kafka ingestion tutorial
  • Connect to Kinesis
  • Querying data

Deploy

  • Deployment planning
  • Imply Managed

    • Imply Cloud overview
    • Imply Cloud security
    • Direct access Pivot
    • On-prem Cloud crossover

    Imply Private

    • Imply Private overview
    • Install Imply on Minikube
    • Imply Private on Kubernetes
    • Imply Private on Azure Kubernetes Service
    • Enhanced Imply Private on Google Kubernetes Engine
    • Kubernetes Scaling Reference
    • Kubernetes Deep Storage Reference
    • Imply Private on Linux
    • Pivot state sharing
    • Migrate to Imply

    Unmanaged Imply

    • Unmanaged Imply deploy

Misc

  • Druid API users
  • Extensions
  • Third-party software licenses

Deployment planning

Imply is available in a range of deployment options, and offers the possibility of matching self-hosted and Imply-managed components in a variety of ways.

Architecture

Before planning your deployment, it's helpful to understand the overall architecture.

Imply components are organized by control plane and data plane. The control plane contains the Imply Manager, while the data plane contains the Druid Cluster.

Diagram

Deployment options

The following table outlines deployment modes.

Imply productCloud On-prem
AWSGCPAzure
Imply CloudAvailable———
Imply Private on KubernetesBasicEnhancedBasicBasic
Imply Private on LinuxAvailableAvailableAvailableAvailable

The options, including the basic and enhanced modes, are further described in the following sections.

Imply Cloud

With Imply Cloud, the control plane (which includes the Imply Cloud Manager, and other control components) resides in Imply's VPC, while the data plane resides in your own VPC.

Your users access the Imply UIs through the control plane, which serves as a proxy for Pivot. Administrators perform administration functions, create clusters, and apply updates from the Imply Manager in the control plane.

Ingested data resides in the data plane, in this case, Amazon S3. Terms of the Amazon S3 Service Level Agreement apply to ingested data.

Imply Private

If you want to maintain private control of both the data plane and control plane, you can use one of the self-hosted deployment options, or Imply Private. Imply Private provides distributions designed for easy installation and operation over Kubernetes, along with a binary distribution tar archive, which you can install and orchestrate using tools other than Kubernetes.

There are several ways to deploy a self-hosted Imply cluster. In general, however, the installation is optimized for Kubernetes, with tools specifically to facilitate Kubernetes-based deployment. These tools are divided into these categories:

  • Basic: Imply provides a Helm chart for deployment. See Kubernetes for more information.
  • Enhanced: Scripts further ease deployment by providing an interactive workflow for deploying Imply over Kubernetes and tighter integration with Kubernetes from the Manager UI.

Imply Private for Kubernetes

If you want the Imply control plane and data plane to reside entirely on your own managed VPC or network, you can use the Imply Private for Kubernetes deployment option.

Imply offers a Helm chart that you can use to install Imply on Kubernetes in Azure or AWS. For easiest installation, Imply provides an enhanced installation experience for Google Cloud Platform (GCP), in which most of the cloud set up is automated.

Imply Private for Linux

Helm and GCP ease the process of deploying Imply across machines. However, if using another orchestration framework, or using Imply without orchestration, you can install Imply as a binary form from a set of tar archive files.

Architecture

Before deploying an on-prem Imply cluster, it's helpful to understand the servers and the functions they perform for the Imply cluster.

The components are:

  • Query servers running Druid Routers, Druid Brokers, and Imply Pivot.
  • Data servers running Druid Historical Nodes and Druid MiddleManagers.
  • Master server(s) running a Druid Coordinator and Druid Overlord.

Query server

Query servers are the endpoints that users and client applications interact with. Query servers run a Druid Broker that routes queries to the appropriate data nodes, and a Druid Router that acts as a unified query and API endpoint. They also include an Imply Pivot server as a way to directly explore and visualize your data.

Data server

Data servers store and ingest data. Data servers run Druid Historical Nodes for storage and processing of large amounts of immutable data, Druid MiddleManagers for ingestion and processing of data.

For clusters with complex resource allocation needs, you can break apart the pre-packaged Data server and scale the components individually. This allows you to scale Druid Historical Nodes independently of Druid MiddleManagers, as well as eliminate the possibility of resource contention between historical workloads and real-time workloads.

Master server

The Master server coordinates data ingestion and storage in your Druid cluster. It is not involved in queries. It is responsible for starting new ingestion jobs and for handling failover of the Druid Historical Node and Druid MiddleManager processes running on your Data servers.

Imply Manager

The Imply Manager lets you perform these tasks from an easy-to-use, point-and-click interface:

  • Set up Druid and Pivot.
  • Start, stop and shut down clusters.
  • Apply version updates to a cluster, by rolling update or all at once.
  • Monitor the cluster and access server logs for troubleshooting.

Select hardware

The following describes general Imply deployment and machine guidelines. Your unique environment may have more specific requirements.

A simple, medium-sized cluster—often the starting point for an Imply deployment—needs one Master server, one Query server, and as many Data servers as necessary to index and store data.

For clusters with complex resource allocation needs, you can deploy the data server components individually. This allows you to scale Druid Historical Nodes independently of Druid MiddleManagers, and eliminates the possibility of resource contention between historical workloads and real-time workloads.

The following recommendations outline machine requirements for a medium-sized cluster.

Data servers

Data servers run Druid Historical services (for storage and processing of large amounts of immutable data) and Druid MiddleManagers (for data ingestion and processing). These servers benefit greatly from CPU, RAM, and SSDs. Recommended machine specifications, as a starting point, are:

Bare metalAWS
• 8 vCPUs
• 61 GB RAM
• variable EBS storage (recommended minimum 160 GB SSD)
r4.2xlarge

Query servers

Query servers are the endpoints that users and client applications interact with. Query servers run a Druid Broker that route queries to the appropriate data nodes, and a Druid Router that acts as a thin reverse proxy layer and unified query and API endpoint. They include Pivot as a way to directly explore and visualize your data, Druid's native SQL and JSON-over-HTTP query support. These servers benefit greatly from CPU and RAM. Recommended machine specifications, as a starting point, are:

Bare metalAWS
• 8 vCPUs
• 61 GB RAM
• variable EBS storage (recommended minimum 20 GB SSD)
r4.2xlarge

Master servers

Master servers coordinate data ingestion and storage in your Druid cluster. They are not involved in queries. They are responsible for coordinating ingestion jobs and for handling failover of the Druid Historical Node and Druid MiddleManager processes running on your Data servers.

Master servers can be deployed standalone, or in a highly-available configuration with failover. For failover-based configurations, we recommend separating ZooKeeper and the metadata store into their own hardware.

Recommended machine specifications, as a starting point, are:

Bare metalAWS
• 4 vCPUs
• 16 GB RAM
• variable EBS storage (recommended minimum 20 GB SSD)
m5.xlarge

Deep storage

Druid relies on a distributed filesystem or binary object store for data storage. The backing deep storage systems commonly used with Druid include Amazon S3 (popular for those on AWS), HDFS (popular for those who already have Hadoop in their environment), Microsoft Azure, or Google Cloud Storage (GCS).

For more information, see deep storage information pertaining to the Helm chart and Deep Storage in the Druid documentation.

← Querying dataImply Cloud overview →
  • Architecture
  • Deployment options
    • Imply Cloud
    • Imply Private
  • Architecture
    • Query server
    • Data server
    • Master server
    • Imply Manager
  • Select hardware
    • Data servers
    • Query servers
    • Master servers
  • Deep storage
2020.12
Key links
Try ImplyApache Druid siteImply GitHub
Get help
Stack OverflowSupportContact us
Learn more
BlogApache Druid docs
Copyright © 2021 Imply Data, Inc