Deployment planning
Imply is available in a range of deployment options, and offers the possibility of matching self-hosted and Imply-managed components in a variety of ways.
Architecture
Before planning your deployment, it's helpful to understand the overall architecture.
Imply components are organized by control plane and data plane. The control plane contains the Imply Manager, while the data plane contains the Druid Cluster.
Deployment options
The following table outlines deployment modes.
Imply product | Cloud | On-prem | ||
---|---|---|---|---|
AWS | GCP | Azure | ||
Imply Cloud | Available | — | — | — |
Imply Private on Kubernetes | Basic | Enhanced | Basic | Basic |
Imply Private on Linux | Available | Available | Available | Available |
The options, including the basic and enhanced modes, are further described in the following sections.
Imply Cloud
With Imply Cloud, the control plane (which includes the Imply Cloud Manager, and other control components) resides in Imply's VPC, while the data plane resides in your own VPC.
Your users access the Imply UIs through the control plane, which serves as a proxy for Pivot. Administrators perform administration functions, create clusters, and apply updates from the Imply Manager in the control plane.
Ingested data resides in the data plane, in this case, Amazon S3. Terms of the Amazon S3 Service Level Agreement apply to ingested data.
Imply Private
If you want to maintain private control of both the data plane and control plane, you can use one of the self-hosted deployment options, or Imply Private. Imply Private provides distributions designed for easy installation and operation over Kubernetes, along with a binary distribution tar archive, which you can install and orchestrate using tools other than Kubernetes.
There are several ways to deploy a self-hosted Imply cluster. In general, however, the installation is optimized for Kubernetes, with tools specifically to facilitate Kubernetes-based deployment. These tools are divided into these categories:
- Basic: Imply provides a Helm chart for deployment. See Kubernetes for more information.
- Enhanced: Scripts further ease deployment by providing an interactive workflow for deploying Imply over Kubernetes and tighter integration with Kubernetes from the Manager UI.
Imply Private for Kubernetes
If you want the Imply control plane and data plane to reside entirely on your own managed VPC or network, you can use the Imply Private for Kubernetes deployment option.
Imply offers a Helm chart that you can use to install Imply on Kubernetes in Azure or AWS. For easiest installation, Imply provides an enhanced installation experience for Google Cloud Platform (GCP), in which most of the cloud set up is automated.
Imply Private for Linux
Helm and GCP ease the process of deploying Imply across machines. However, if using another orchestration framework, or using Imply without orchestration, you can install Imply as a binary form from a set of tar archive files.
Architecture
Before deploying an on-prem Imply cluster, it's helpful to understand the servers and the functions they perform for the Imply cluster.
The components are:
- Query servers running Druid Routers, Druid Brokers, and Imply Pivot.
- Data servers running Druid Historical Nodes and Druid MiddleManagers.
- Master server(s) running a Druid Coordinator and Druid Overlord.
Query server
Query servers are the endpoints that users and client applications interact with. Query servers run a Druid Broker that routes queries to the appropriate data nodes, and a Druid Router that acts as a unified query and API endpoint. They also include an Imply Pivot server as a way to directly explore and visualize your data.
Data server
Data servers store and ingest data. Data servers run Druid Historical Nodes for storage and processing of large amounts of immutable data, Druid MiddleManagers for ingestion and processing of data.
For clusters with complex resource allocation needs, you can break apart the pre-packaged Data server and scale the components individually. This allows you to scale Druid Historical Nodes independently of Druid MiddleManagers, as well as eliminate the possibility of resource contention between historical workloads and real-time workloads.
Master server
The Master server coordinates data ingestion and storage in your Druid cluster. It is not involved in queries. It is responsible for starting new ingestion jobs and for handling failover of the Druid Historical Node and Druid MiddleManager processes running on your Data servers.
Imply Manager
The Imply Manager lets you perform these tasks from an easy-to-use, point-and-click interface:
- Set up Druid and Pivot.
- Start, stop and shut down clusters.
- Apply version updates to a cluster, by rolling update or all at once.
- Monitor the cluster and access server logs for troubleshooting.
Select hardware
The following describes general Imply deployment and machine guidelines. Your unique environment may have more specific requirements.
A simple, medium-sized cluster—often the starting point for an Imply deployment—needs one Master server, one Query server, and as many Data servers as necessary to index and store data.
For clusters with complex resource allocation needs, you can deploy the data server components individually. This allows you to scale Druid Historical Nodes independently of Druid MiddleManagers, and eliminates the possibility of resource contention between historical workloads and real-time workloads.
The following recommendations outline machine requirements for a medium-sized cluster.
Data servers
Data servers run Druid Historical services (for storage and processing of large amounts of immutable data) and Druid MiddleManagers (for data ingestion and processing). These servers benefit greatly from CPU, RAM, and SSDs. Recommended machine specifications, as a starting point, are:
Bare metal | AWS |
---|---|
• 8 vCPUs • 61 GB RAM • variable EBS storage (recommended minimum 160 GB SSD) | r4.2xlarge |
Query servers
Query servers are the endpoints that users and client applications interact with. Query servers run a Druid Broker that route queries to the appropriate data nodes, and a Druid Router that acts as a thin reverse proxy layer and unified query and API endpoint. They include Pivot as a way to directly explore and visualize your data, Druid's native SQL and JSON-over-HTTP query support. These servers benefit greatly from CPU and RAM. Recommended machine specifications, as a starting point, are:
Bare metal | AWS |
---|---|
• 8 vCPUs • 61 GB RAM • variable EBS storage (recommended minimum 20 GB SSD) | r4.2xlarge |
Master servers
Master servers coordinate data ingestion and storage in your Druid cluster. They are not involved in queries. They are responsible for coordinating ingestion jobs and for handling failover of the Druid Historical Node and Druid MiddleManager processes running on your Data servers.
Master servers can be deployed standalone, or in a highly-available configuration with failover. For failover-based configurations, we recommend separating ZooKeeper and the metadata store into their own hardware.
Recommended machine specifications, as a starting point, are:
Bare metal | AWS |
---|---|
• 4 vCPUs • 16 GB RAM • variable EBS storage (recommended minimum 20 GB SSD) | m5.xlarge |
Deep storage
Druid relies on a distributed filesystem or binary object store for data storage. The backing deep storage systems commonly used with Druid include Amazon S3 (popular for those on AWS), HDFS (popular for those who already have Hadoop in their environment), Microsoft Azure, or Google Cloud Storage (GCS).
For more information, see deep storage information pertaining to the Helm chart and Deep Storage in the Druid documentation.