Compaction

Optimizing segments

A Druid datasource is partitioned into segments and Historical processes are responsible for serving those segments for query processing. As a result, it's very important to optimize segment sizes to acheive both efficient disk utilization and query processing. Merging small data segments is especially important for data ingested from streams because usually the ingestion rate varies over time and some data may even arrive late.

Druid supports compaction which merges small segments or splits large ones to optimize their size. You can automatically schedule compaction whenever it is needed. The auto compaction feature works by analyzing each time chunk for sub-optimally-sized segments, and kicking off tasks to merge them into optimally sized segments. Please check Coordinator's automatic segment compaction for details.

You can find the compaction config in the Druid console.

compaction config

The below tables show what each configuration is for and corresponding Druid configuration name. Please check Coordinator's dynamic configuration for more details.

Mandatory configurations

Name Druid configuration name Description
Target size of compacted segment (bytes) targetCompactionSizeBytes The target segment size, for each segment, after compaction. The actual sizes of compacted segments might be slightly larger or smaller than this value. Each compaction task may generate more than one output segment, and it will try to keep each output segment close to this configured size.
Skip offset from latest skipOffsetFromLatest The offset for searching for segments to be compacted. Strongly recommended to set for realtime datasources. Please check the below note for details.
Input segment size (bytes) inputSegmentSizeBytes Maximum number of total segment bytes processed per compaction task. Since a time chunk must be processed in its entirety, if the segments for a particular time chunk have a total size in bytes greater than this parameter, compaction will not run for that time chunk. Because each compaction task runs with a single thread, setting this value too far above 1–2GB will result in compaction tasks taking an excessive amount of time.

Advanced configurations

Name Druid configuration name Description
Max number of segments to compact maxNumSegmentsToCompact Maximum number of segments to compact together per compaction task. Since a time chunk must be processed in its entirety, if a time chunk has a total number of segments greater than this parameter, compaction will not run for that time chunk.
Compaction task priority taskPriority Priority of compaction tasks.
Tuning config tuningConfig Tuning config for compaction tasks. Please note that compaction tasks share the same tuning config with Druid's native local index task. See TuningConfig for details.
Task context taskContext Task context for compaction tasks.

Automatic compaction cannot run for time chunks that are currently receiving data (when new data comes in, any compaction in progress will be canceled). This means that you’ll want to use the “skip offset from latest” option to avoid recent time chunks where late data might still be coming in.

Overview

Deploy

Manage Data

Query Data

Visualize

Configure

Misc