Merging small data segments into larger, optimally-sized ones is important for achieving optimal Druid performance on data ingested from streams, especially when late data is in the mix. You can automatically schedule compactions whenever they are needed. The auto compaction feature works by analyzing each time chunk for sub-optimally-sized segments, and kicking off tasks to merge them into optimally sized segments.
There is a potential drawback to be aware of: an automatic compaction for a given time chunk runs in a single-threaded task, so for very large datasources, this will be unwieldy and you may want to look into other ways of compacting your data.
You can find the compaction config in the
Data tab of a data source.
Automatic compaction cannot run for time chunks that are currently receiving data (when new data comes in, any compaction in progress will be canceled). This means that you’ll want to use the “skip offset from latest” option to avoid recent time chunks where late data might still be coming in.