2022.07

2022.07

  • Imply
  • Ingest
  • Query
  • Visualize
  • Administer
  • Deploy

›Multi-stage query

Ingestion

  • Ingestion overview
  • Supported file formats
  • Data model
  • Data rollup
  • Partitioning
  • Ingestion spec
  • Schema design tips
  • Data management
  • Compaction
  • Automatic compaction
  • Troubleshooting FAQ

Multi-stage query

  • Overview
  • Concepts
  • Setup
  • UI walkthrough
  • Tutorial - Connect external data
  • Tutorial - Convert ingestion spec
  • Queries
  • API
  • Security
  • Advanced configs
  • Reference
  • Release notes

Stream ingestion

  • Apache Kafka ingestion
  • Apache Kafka supervisor
  • Apache Kafka operations
  • Amazon Kinesis
  • Tranquility
  • Realtime Process

Batch ingestion

  • Native batch
  • Simple task indexing
  • Input sources
  • Firehose
  • Hadoop-based
  • Load Hadoop data via Amazon EMR

Ingestion reference

  • Ingestion
  • Data formats
  • Task reference
  • Nested columns

Advanced configs

The Multi-Stage Query (MSQ) Framework is a preview feature available starting in Imply 2022.06. Preview features enable early adopters to benefit from new functionality while providing ongoing feedback to help shape and evolve the feature. All functionality documented on this page is subject to change or removal in future releases. Preview features are provided "as is" and are not subject to Imply SLAs.

Durable storage for mesh shuffle

By default, the Multi-Stage Query (MSQ) Framework uses the local storage of a node to store data from intermediate steps when executing a query. Although this method provides MSQ with better speed when executing a query, the data is lost if the node encounters an issue. When you enable durable storage, MSQ stores intermediate data in S3 instead. In essence, you trade some performance for better reliability. This is especially useful for long running queries.

To use durable storage for mesh shuffles:

  • Enable durable storage for mesh shuffle
  • Use the appropriate security settings for S3
  • Include the following context variable when you submit a query:

UI

--:context msqDurableShuffleStorage: true

API

"context": {
    "msqDurableShuffleStorage": true
}

The following table describes the properties used to configure durable storage for MSQ:

ConfigDescriptionRequiredDefault
druid.msq.intermediate.storage.enableSet to true to enable this feature.YesNone
druid.msq.intermediate.storage.typeMust be set to s3.YesNone
druid.msq.intermediate.storage.bucketS3 bucket to store intermediate stage. results.YesNone
druid.msq.intermediate.storage.prefixS3 prefix to store intermediate stage results. Provide a unique value for the prefix. Clusters should not share the same prefix.YesNone
druid.msq.intermediate.storage.tempDirDirectory path on the local disk to store intermediate stage results. temporarily.YesNone
druid.msq.intermediate.storage.maxResultsSizeMax size of each partition file per stage. It should be between 5MiB and 5TiB. Supports a human-readable format. For eg if a stage has 50 partitions we can effectively use s3 up to 250TIB of stage output assuming each partition file <=5TiB.No100MiB
druid.msq.intermediate.storage.chunkSizeImply recommends using the default value for most cases. This property defines the size of each chunk to temporarily store in druid.msq.intermediate.storage.tempDir. Druid computes the chunk size automatically if this property is not set. The chunk size must be between 5MiB and 5GiB.NoNone
druid.msq.intermediate.storage.maxTriesOnTransientErrorsImply recommends using the default value for most cases. This property defines the max number times to attempt S3 API calls to avoid failures due to transient errors.no10
Last updated on 7/21/2022
← SecurityReference →
  • Durable storage for mesh shuffle
2022.07
Key links
Try ImplyApache Druid siteImply GitHub
Get help
Stack OverflowSupportContact us
Learn more
Apache Druid forumsBlog
Copyright © 2022 Imply Data, Inc