Skip to main content

Cascading reindexing

info

Cascading reindexing is an experimental feature introduced in Druid 37. Its API may change in future releases. This feature is only for automatic compaction using compaction supervisors with the MSQ task engine.

Cascading reindexing is a compaction supervisor template that lets you define age-based rules to automatically apply different compaction configurations as data ages. Instead of a single flat compaction configuration for an entire datasource, you define rules that say "for data older than X, apply configuration Y." Reindexing is a more general term than compaction. Reindexing not only can merge segments with the same schema and partitioning, but also can change the segment schema, partitioning, and encoding. Cascading reindexing gives you fine-grained control over how your data evolves over time.

For example, you might want to:

  • Keep recent data in hourly segments, but coarsen to daily segments after 90 days to help reduce segment count and storage footprint.
  • Delete some unwanted rows from data older than 30 days.
  • Change compression settings for older data.
  • Roll up older data to a coarser query granularity

Cascading reindexing handles all of this automatically by generating a timeline of compaction intervals and applying the appropriate rules to each interval.

Prerequisites

Before using cascading reindexing, ensure your cluster meets the following requirements:

  • Compaction supervisors enabled: Set useSupervisors to true in the compaction dynamic config.
  • MSQ compaction engine: Set engine to msq in the compaction dynamic config or in the supervisor spec.
  • Incremental segment metadata caching: Set druid.manager.segments.useIncrementalCache to always or ifSynced in your Overlord and Coordinator runtime properties. See Segment metadata caching.
  • At least two compaction task slots: The MSQ task engine requires at least two tasks (one controller, one worker).

How cascading reindexing works

Rule-based configuration

Cascading reindexing uses a rule-based system where each rule controls a specific aspect of compaction and specifies an age threshold for when it applies. There are four rule types, each controlling an orthogonal aspect of the compaction config output for an interval:

Rule typeWhat it controlsAdditive?
PartitioningSegment granularity, partitions spec, optional virtual columns for range partitioningNo
DeletionRows to remove from segmentsYes
Index specCompression and encoding settingsNo
Data schemaDimensions, metrics, query granularity, rollup, projectionsNo

Every rule has an olderThan field — an ISO 8601 period that defines the age threshold. A rule with "olderThan": "P30D" applies to data whose interval ends before the current time minus 30 days.

Additive vs non-additive rules

Non-additive rules (partitioning, index spec, data schema): Only one rule of each type can apply per interval. When multiple rules of the same type match, the rule with the oldest threshold (largest period) takes precedence.

Additive rules (deletion): Multiple deletion rules can apply to the same interval. When they do, they combine as NOT(A OR B OR C), where A, B, and C are the deleteWhere filters from each rule. In other words, the compacted data retains only the rows that don't match any of the deletion filters.

Timeline generation

The cascading reindexing template generates a timeline of non-overlapping search intervals, each with its own set of applicable rules. Here is how the timeline is constructed:

  1. Build a base timeline from partitioning rules. Each partitioning rule defines a segment granularity and an age threshold. The template sorts rules by threshold (oldest first) and creates intervals with boundaries aligned to each rule's segment granularity.

  2. Split at non-partitioning rule thresholds. If deletion, index spec, or data schema rules have thresholds that fall inside a base interval, the template splits that interval at the threshold (aligned to the interval's segment granularity). This ensures rules are applied as precisely as possible.

  3. Validate granularity ordering. The template validates that segment granularity stays the same or becomes finer as you move from past to present. For example, DAY for old data and HOUR for recent data is valid, but HOUR for old data and DAY for recent data is not.

Timeline generation example

Suppose the current time is 2026-03-26T00:00:00Z and you configure the following rules:

  • Partitioning rule A: olderThan: P7D, segmentGranularity: HOUR
  • Partitioning rule B: olderThan: P90D, segmentGranularity: DAY
  • Deletion rule C: olderThan: P30D, deleteWhere: isRobot = true

The template generates these search intervals:

Search intervalSegment granularitySourceActive rules
[-inf, 2025-12-26)DAYPartitioning rule BB, C
[2025-12-26, 2026-02-24)DAYDefault (from template)C
[2026-02-24, 2026-03-19)HOURPartitioning rule AA

How this works step by step:

  1. Partitioning rule B (olderThan: P90D) creates the interval [-inf, 2025-12-26) with DAY granularity. Partitioning rule A (olderThan: P7D) creates [2025-12-26, 2026-03-19) with HOUR granularity.
  2. Deletion rule C (olderThan: P30D) has a threshold of 2026-02-24. This falls inside rule A's interval, so that interval is split at 2026-02-24 (which is already DAY-aligned). The older sub-interval [2025-12-26, 2026-02-24) picks up deletion rule C; the newer sub-interval [2026-02-24, 2026-03-19) does not.
  3. Granularity validation passes because DAY (older) to HOUR (newer) is valid — granularity becomes finer toward the present.

How defaults work

The template requires defaultSegmentGranularity and defaultPartitionsSpec. These are used for any interval where no partitioning rule matches. This happens in two scenarios:

  1. No partitioning rules defined at all. If you only define deletion, index spec, or data schema rules, all intervals use the default granularity and partitions spec.
  2. Non-partitioning rules have a more recent threshold than the newest partitioning rule. For example, if your only partitioning rule is olderThan: P90D but you have a deletion rule with olderThan: P30D, intervals between 30 and 90 days old will use the defaults.

Supervisor spec reference

To submit a cascading reindexing supervisor, wrap the template spec inside a compaction supervisor spec:

{
"type": "autocompact",
"spec": {
"type": "reindexCascade",
"dataSource": "wikipedia",
"ruleProvider": { ... },
"defaultSegmentGranularity": "DAY",
"defaultPartitionsSpec": {
"type": "dynamic",
"maxRowsPerSegment": 5000000
}
}
}

Template properties

The following table describes the properties of the reindexCascade template:

FieldDescriptionRequiredDefault
typeMust be reindexCascade.Yes
dataSourceThe datasource to compact.Yes
ruleProviderRule provider configuration that supplies reindexing rules.Yes
defaultSegmentGranularitySegment granularity used for intervals where no partitioning rule matches. Supported values: MINUTE, FIFTEEN_MINUTE, HOUR, DAY, MONTH, QUARTER, YEAR.Yes
defaultPartitionsSpecPartitions spec used for intervals where no partitioning rule matches. See MSQ task engine limitations for supported partitioning types.Yes
defaultPartitioningVirtualColumnsOptional virtual columns used if your defaultPartitionsSpec range partitioning definition references virtual columnsNo
taskPriorityPriority of compaction tasks.No25
inputSegmentSizeBytesMaximum total input segment size in bytes per compaction task.No100000000000000
taskContextContext map passed to compaction tasks. Use this to set MSQ context parameters such as maxNumTasks.No
skipOffsetFromLatestISO 8601 period. Skips data newer than this offset from the end of the latest segment. Mutually exclusive with skipOffsetFromNow.No
skipOffsetFromNowISO 8601 period. Skips data newer than this offset from the current time. Mutually exclusive with skipOffsetFromLatest.No
tuningConfigTuning config for compaction tasks. You cannot set partitionsSpec inside tuningConfig — partitioning is controlled by rules and supervisor default.No

Rule provider types

A rule provider supplies the reindexing rules to the template. Druid supports two provider types.

Inline provider

The inline provider (type: inline) defines rules directly in the supervisor spec. This is currently the only concrete implementation.

{
"type": "inline",
"partitioningRules": [ ... ],
"deletionRules": [ ... ],
"indexSpecRules": [ ... ],
"dataSchemaRules": [ ... ]
}
FieldDescriptionRequiredDefault
typeMust be inline.Yes
partitioningRulesList of partitioning rules.No[]
deletionRulesList of deletion rules.No[]
indexSpecRulesList of index spec rules.No[]
dataSchemaRulesList of data schema rules.No[]

At least one rule must be defined across all rule lists.

Composing provider

The composing provider (type: composing) chains multiple rule providers together with first-wins semantics. For each rule type, Druid uses the rules from the first provider that has non-empty rules of that type.

This rule provider exists in anticipation of future community contributed providers, such as a provider that sources rules from the Druid Catalog.

{
"type": "composing",
"providers": [
{ "type": "inline", "partitioningRules": [ ... ] },
...
]
}
FieldDescriptionRequiredDefault
typeMust be composing.Yes
providersOrdered list of rule providers. Provider order determines precedence.Yes

The composing provider is ready only when all child providers are ready.

Reindexing rule types

All rule types share the following common fields:

FieldDescriptionRequired
idUnique identifier for the rule.Yes
descriptionHuman-readable description.No
olderThanISO 8601 period defining the age threshold. The rule applies to data older than the current time minus this period. Must be non-negative.Yes

Partitioning rules

Partitioning rules control how data is physically laid out into segments. This includes the time bucketing (segment granularity) and how data within a time bucket is split (partitions spec).

This is a non-additive rule — only one partitioning rule applies per interval.

FieldDescriptionRequired
idRule identifier.Yes
descriptionHuman-readable description.No
olderThanISO 8601 period.Yes
segmentGranularityTime granularity for segment buckets. Supported values: MINUTE, FIFTEEN_MINUTE, HOUR, DAY, MONTH, QUARTER, YEAR.Yes
partitionsSpecDefines how data within each time bucket is split into segments. Supports dynamic and range types.Yes
virtualColumnsVirtual columns for partitioning by nested or derived fields.No

Example:

{
"id": "daily-range-30d",
"olderThan": "P30D",
"segmentGranularity": "DAY",
"partitionsSpec": {
"type": "range",
"targetRowsPerSegment": 5000000,
"partitionDimensions": ["channel", "countryName"]
},
"description": "Compact to daily segments with range partitioning for data older than 30 days"
}

Deletion rules

Deletion rules specify rows to remove during compaction. The deleteWhere field defines a Druid filter that matches rows to delete. During processing, Druid wraps these filters in NOT logic — the compacted data retains rows that do not match the filter.

This is an additive rule — multiple deletion rules can apply to the same interval.

FieldDescriptionRequired
idRule identifier.Yes
descriptionHuman-readable description.No
olderThanISO 8601 period.Yes
deleteWhereA Druid filter matching rows to remove.Yes
virtualColumnsVirtual columns for filtering on nested or derived fields. Virtual column names must be unique and consistent across rule evaluations.No

What you write vs what happens:

If you define two deletion rules:

  • Rule 1: deleteWhere: isRobot = true
  • Rule 2: deleteWhere: countryName = null

Druid applies them as: NOT(isRobot = true OR countryName = null). The compacted segments retain only rows where isRobot is not true and countryName is not null.

Example:

{
"id": "remove-robots-90d",
"olderThan": "P90D",
"deleteWhere": {
"type": "equals",
"column": "isRobot",
"matchValueType": "STRING",
"matchValue": "true"
},
"description": "Remove robot traffic from data older than 90 days"
}

Index spec rules

Index spec rules control compression and encoding settings for compacted segments, independently of partitioning.

This is a non-additive rule — only one index spec rule applies per interval.

FieldDescriptionRequired
idRule identifier.Yes
descriptionHuman-readable description.No
olderThanISO 8601 period.Yes
indexSpecAn IndexSpec object defining bitmap type, metric compression, and other encoding settings.Yes

Example:

{
"id": "compressed-90d",
"olderThan": "P90D",
"indexSpec": {
"bitmap": { "type": "roaring" },
"metricCompression": "lz4"
},
"description": "Use roaring bitmaps and lz4 compression for data older than 90 days"
}

Data schema rules

Data schema rules control the schema of compacted segments, including dimensions, metrics, query granularity, rollup, and projections.

This is a non-additive rule — only one data schema rule applies per interval. At least one of the optional fields must be non-null.

FieldDescriptionRequired
idRule identifier.Yes
descriptionHuman-readable description.No
olderThanISO 8601 period.Yes
dimensionsSpecDimensions config for the compacted segments.No
metricsSpecArray of aggregator factories for rollup metrics.No
queryGranularityQuery granularity for the compacted segments.No
rollupWhether to enable rollup. Set to true only when metricsSpec is defined.No
projectionsList of aggregate projections.No

Example:

{
"id": "rollup-30d",
"olderThan": "P30D",
"queryGranularity": "HOUR",
"rollup": true,
"metricsSpec": [
{ "type": "longSum", "name": "added", "fieldName": "added" },
{ "type": "longSum", "name": "deleted", "fieldName": "deleted" }
],
"description": "Roll up to hourly query granularity for data older than 30 days"
}

Example

The following example uses the wikipedia datasource and demonstrates a cascading reindexing supervisor with a partitioning rule and a deletion rule that does the following:

  • Data older than 30 days is compacted into daily range-partitioned segments.
  • Rows that have a isRobot column with a true value are deleted from data older than 90 days.
  • The skipOffsetFromLatest setting skips the most recent day of data.
curl --location --request POST 'http://localhost:8081/druid/indexer/v1/supervisor' \
--header 'Content-Type: application/json' \
--data-raw '{
"type": "autocompact",
"spec": {
"type": "reindexCascade",
"dataSource": "wikipedia",
"defaultSegmentGranularity": "HOUR",
"defaultPartitionsSpec": {
"type": "dynamic",
"maxRowsPerSegment": 5000000
},
"skipOffsetFromLatest": "P1D",
"ruleProvider": {
"type": "inline",
"partitioningRules": [
{
"id": "daily-30d",
"olderThan": "P30D",
"segmentGranularity": "DAY",
"partitionsSpec": {
"type": "range",
"targetRowsPerSegment": 5000000,
"partitionDimensions": ["channel", "countryName"]
},
"description": "Compact to daily range-partitioned segments after 30 days"
}
],
"deletionRules": [
{
"id": "remove-bots-90d",
"olderThan": "P90D",
"deleteWhere": {
"type": "equals",
"column": "isRobot",
"matchValueType": "STRING",
"matchValue": "true"
},
"description": "Remove robot edits from data older than 90 days"
}
]
},
"taskContext": {
"maxNumTasks": 3
}
}
}'

This creates three timeline intervals:

  • [-inf, now - 90D): DAY granularity, bot edits deleted.
  • [now - 90D, now - 30D): DAY granularity, no deletions.
  • [now - 30D, now - 1D): HOUR granularity (defaults), no deletions. Data within the last day is skipped.

Limitations

  • MSQ task engine only. Cascading reindexing requires the MSQ task engine. The native engine is not supported.
  • Compaction supervisors only. This feature is not available for auto-compaction using Coordinator duties.
  • No partitionsSpec in tuningConfig. Partitioning is controlled exclusively by rules and defaults. Setting partitionsSpec inside tuningConfig causes a validation error.
  • Granularity must not coarsen toward the present. Segment granularity must stay the same or become finer as you move from older to newer data. For example, DAY to HOUR is valid; HOUR to DAY is not.
  • skipOffsetFromLatest and skipOffsetFromNow are mutually exclusive. You can set one or the other, not both.
  • ALL segment granularity is not supported. This is the same limitation as standard auto-compaction.

Learn more

See the following topics for more information: