Disaster recovery for Imply Enterprise Hybrid
Disaster Recovery (DR) refers to an organization's ability to regain access to its infrastructure and maintain continuity in the event of a natural or human-induced disaster. It contains processes, tools, and polices designed to maintain production during an outage and to restore systems operations to their original state after the incident.
This topic covers key concepts of developing and implementing a DR strategy for Imply Hybrid (formerly Imply Cloud).
Architecture
Imply deployment consists of two main parts: a control plane and a data plane. For the Imply Hybrid deployment model, Imply manages the control plane while you manage the data plane in your VPC.
The control plane runs in AWS and has the following components:
- Imply Manager, which manages Imply deployments
- Clarity for troubleshooting performance issues
The data plane has the following components:
- Master, Query, and Data servers that are responsible for ingesting data, querying data, and serving Pivot
- ZooKeeper, which acts as the coordination service for Imply
- Metadata storage in Amazon RDS for segments and Pivot assets
- Deep storage in Amazon S3 for ingested data segments
When designing your DR strategy, consider the processes running in both the control plane and the data plane. Note that the different parts of each plane may have multiple components or services associated with them.
High availability
The control plane of Imply Hybrid is deployed across multiple Availability Zones within a region. It is highly available by design.
Your data plane is considered highly available if you have at least three master nodes, more than one query node, more than one data node, and your data sources are replicated. For more information about the requirements for nodes, see Imply Cloud instance types.
Craft your DR strategy
You can use either an active-active or an active-passive DR strategy with Imply Hybrid:
- The active-active strategy offers better availability, Recovery Point Objectives (RPO), and Recovery Time Objectives (RTO), but it costs more because you are running two active deployments concurrently. In addition, you need to configure ingest pipelines and associated processors in the primary and secondary deployments.
- The active-passive strategy doesn't offer the same availability, RPO, and RTO as active-active, but the associated costs are lower because you don't spin up the backup instance until it's required.
For an effective DR strategy, you need to have procedures in place before an incident occurs. The planning can be broken down into the following categories:
Metrics
When planning a DR strategy, there are three main metrics to consider:
Recovery Point Objective (RPO) is the maximum acceptable period in which data might be lost when there's a major incident. You need to decide on two RPO numbers. One for the data stored in your storage service, such as S3. The other RPO number is for Imply and applies to objects such as data cubes, dashboards, and segments.
Recovery Time Objective (RTO) is the amount of downtime that's acceptable before the backup instance is available, whether it's an active or passive instance.
Cost needs to be considered from two angles: the cost of downtime and the cost of running the DR strategy.
Evaluating RPO, RTO, and cost metrics can help you determine which DR strategy is best suited for your business needs, budget, and recovery goals.
People
Determine which teams need to be available to execute the DR plan. Since Imply Hybrid runs in your AWS VPC, users tasked with implementing your DR strategy must have permissions for both Imply Hybrid and your AWS infrastructure.
In addition, develop a plan to communicate with affected users.
Tools
Identify the tools you need for setting up and deploying clusters as well as for modifying network connectivity. Ideally, you should be able to replicate all of your configurations in a different region. In addition, you'll need to replicate metadata manually, so familiarize yourself with database backup tools like mysqldump.
DR strategies
Implementing either an active-active or active-passive DR strategy requires access to both Imply Hybrid and AWS. Verify that you can make changes to both environments, including spinning up new resources.
Active-active
Active-active DR involves using a primary Imply deployment with a secondary deployment running alongside it. When an incident occurs, the impact on users should be minimal since the secondary deployment is identical to the primary and is already running.
The following image shows the architecture of an active-active setup:
To implement an active-active strategy, perform the following steps:
- In Imply Hybrid, create an account in a region different from the region hosting your primary Imply account.
- Identify the cluster sizing requirements for the secondary data plane deployment. Imply recommends using identical primary and secondary clusters.
- Create and start the secondary data plane cluster.
- Make sure that the S3 bucket you use is different from the S3 bucket for you primary data plane. Imply Manager creates a new RDS database and uses the provided S3 bucket.
- Use the Imply Manager UI to copy any advanced configurations from the primary region to the secondary region. You need to implement a process to keep the configuration in the secondary region in sync with any updates you make to the primary region.
- Export Druid metadata from your primary RDS instance to the secondary instance with a tool like mysqldump:
- Connect the primary RDS instance to the secondary region.
- Import the records.
- As part of the import process, update the records in the
druid_segments
table to point to the secondary region's S3 bucket. Specifically, update theloadSpec.path
property.
- Copy the S3 bucket with the segment data from the primary region to the secondary region.
- Configure your upstream data sources so that the same data flows to both the primary and secondary instances at the same time. This also means that you need to create your batch and real-time ingestion pipelines in both clusters.
- Optionally, consider replicating raw data storage across regions. The diagram preceding these steps includes replicated data sources.
Active-passive
Active-passive DR involves using a primary Imply deployment with a secondary deployment that's ready but doesn't run concurrently with the primary deployment. When an incident occurs, the secondary deployment is spun up.
This approach relies on AWS's multi-region sync capability to keep the replica in the secondary region updated, which requires replicating metadata and deep storage components.
The following image shows the architecture of an active-passive setup:
To implement an active-passive strategy, perform the following steps:
In Imply Hybrid, create an account in a region different from the region hosting your primary Imply account.
Identify the cluster sizing requirements for the secondary data plane deployment. Imply recommends using identical primary and secondary clusters.
Create the secondary data plane cluster. Make sure to stop the secondary cluster after it starts. With an active-passive strategy, the secondary cluster doesn't run until needed.
Use the Imply Manager UI to copy any advanced configurations from the primary region to the secondary region. You need to implement a process to keep the configuration in the secondary region in sync with any updates you make to the primary region.
Create an Amazon RDS read replica of your database instance in the secondary region. You can enable RDS cross-region replicas using one of the following methods:
- through the AWS Management Console
- by running the
create-db-instance-read-replica
command - by calling the
CreateDBInstanceReadReplica
API operation
For details on creating a cross-region read replica, see Creating a read replica in a different AWS region.
Enable Cross-Region Replication (CRR) on the S3 bucket in your primary region. By default, S3 replicates only the new objects, so make sure to enable replication of existing objects as well. For more information, see Replicating objects.
Use the Manager UI in the primary region to point deep storage to the S3 bucket in the secondary region that has the replicated data.
Enable connectivity between the primary and the secondary regions. Imply recommends setting up a transit gateway system to peer VPCs across primary and secondary regions. If you access Pivot using direct-access or query the Druid API directly, make sure to configure your network to handle region failover.
Recover from an incident
The following sections describe failover and failback procedures for an active-passive solution.
In a DR situation, it's important to restore business operations and workflows from the standby system to the primary one as quickly and efficiently as possible while accounting for disaster-related issues and potential security risks.
With an active-passive setup, safeguarding information in a disaster situation is a two-stage process:
- In the initial stage, called failover, you direct your new data to a secondary system not affected by the outage.
- During the second stage, called failback, you copy the data recorded by the secondary system during the outage to the primary system.
Failover
Failover is the process of switching to a secondary system or region when the primary system or region encounters a serious outage such as an extended power failure or a natural disaster.
The following image shows the architecture of a failover setup in Imply Hybrid:
To implement failover, follow these steps:
- Configure the secondary region as described in the active-passive implementation.
- Start the cluster in the secondary region.
- Copy all the tables from the RDS replica to the RDS database in the secondary region. Because the RDS read replica is read-only, you cannot use it as your main RDS instance. You need to implement a process to keep the RDS database current to meet your RPO.
- Export Druid metadata from your primary RDS instance to the secondary instance with a tool like mysqldump:
- Connect the primary RDS instance to the secondary region.
- Import the records.
- As part of the import process, update the records in the
druid_segments
table to point to the secondary region's S3 bucket. Specifically, update theloadSpec.path
property.
- Access the S3 bucket in the secondary region and break the replication so that it becomes the primary bucket and is no longer updated once the primary region comes online.
- Implement a failover strategy for your upstream data sources. Verify that upstream data sources can connect to the Imply cluster in the secondary region.
- Based on how you query the data, do the following:
- If you use Pivot proxied via Imply Manager, use the vanity domain in the secondary region to query data.
- If you use direct-access Pivot or query Druid directly, then route traffic to the new region.
Failback
Failback is the process of restoring a system, component, or service previously in a state of failure back to its original, working state, and switching the standby system from functioning back to standby.
Consider incorporating the following steps into your Imply failback strategy:
- Once the network connectivity for the primary region is restored, play back the data from the time the incident occurred. This is necessary to update Druid segments with the new data.
- If you are streaming data from Kafka or Kinesis and assuming you are within the configured retention window, Druid automatically resumes from the last ingested offset upon coming back online.
- If you are ingesting batch data, you must resubmit any ingestion jobs that failed during the outage.
- Wait for the data replenishment process to complete before routing users to the primary region.