Apache Druid
  • Imply Documentation

›Operations

Getting started

  • Introduction to Apache Druid
  • Quickstart
  • Docker
  • Single server deployment
  • Clustered deployment

Tutorials

  • Loading files natively
  • Load from Apache Kafka
  • Load from Apache Hadoop
  • Querying data
  • Roll-up
  • Configuring data retention
  • Updating existing data
  • Compacting segments
  • Deleting data
  • Writing an ingestion spec
  • Transforming input data
  • Kerberized HDFS deep storage

Design

  • Design
  • Segments
  • Processes and servers
  • Deep storage
  • Metadata storage
  • ZooKeeper

Ingestion

  • Ingestion
  • Data formats
  • Schema design tips
  • Data management
  • Stream ingestion

    • Apache Kafka
    • Amazon Kinesis
    • Tranquility

    Batch ingestion

    • Native batch
    • Hadoop-based
  • Task reference
  • Troubleshooting FAQ

Querying

  • Druid SQL
  • Native queries
  • Query execution
  • Concepts

    • Datasources
    • Joins
    • Lookups
    • Multi-value dimensions
    • Multitenancy
    • Query caching
    • Context parameters

    Native query types

    • Timeseries
    • TopN
    • GroupBy
    • Scan
    • Search
    • TimeBoundary
    • SegmentMetadata
    • DatasourceMetadata

    Native query components

    • Filters
    • Granularities
    • Dimensions
    • Aggregations
    • Post-aggregations
    • Expressions
    • Having filters (groupBy)
    • Sorting and limiting (groupBy)
    • Sorting (topN)
    • String comparators
    • Virtual columns
    • Spatial filters

Configuration

  • Configuration reference
  • Extensions
  • Logging

Operations

  • Web console
  • Getting started with Apache Druid
  • Basic cluster tuning
  • API reference
  • High availability
  • Rolling updates
  • Retaining or automatically dropping data
  • Metrics
  • Alerts
  • Working with different versions of Apache Hadoop
  • HTTP compression
  • TLS support
  • Password providers
  • dump-segment tool
  • reset-cluster tool
  • insert-segment-to-db tool
  • pull-deps tool
  • Misc

    • Legacy Management UIs
    • Deep storage migration
    • Export Metadata Tool
    • Metadata Migration
    • Segment Size Optimization
    • Content for build.sbt

Development

  • Developing on Druid
  • Creating extensions
  • JavaScript functionality
  • Build from source
  • Versioning
  • Experimental features

Misc

  • Papers

Hidden

  • Apache Druid vs Elasticsearch
  • Apache Druid vs. Key/Value Stores (HBase/Cassandra/OpenTSDB)
  • Apache Druid vs Kudu
  • Apache Druid vs Redshift
  • Apache Druid vs Spark
  • Apache Druid vs SQL-on-Hadoop
  • Authentication and Authorization
  • Broker
  • Coordinator Process
  • Historical Process
  • Indexer Process
  • Indexing Service
  • MiddleManager Process
  • Overlord Process
  • Router Process
  • Peons
  • Approximate Histogram aggregators
  • Apache Avro
  • Microsoft Azure
  • Bloom Filter
  • DataSketches extension
  • DataSketches HLL Sketch module
  • DataSketches Quantiles Sketch module
  • DataSketches Theta Sketch module
  • DataSketches Tuple Sketch module
  • Basic Security
  • Kerberos
  • Cached Lookup Module
  • Apache Ranger Security
  • Google Cloud Storage
  • HDFS
  • Apache Kafka Lookups
  • Globally Cached Lookups
  • MySQL Metadata Store
  • ORC Extension
  • Druid pac4j based Security extension
  • Apache Parquet Extension
  • PostgreSQL Metadata Store
  • Protobuf
  • S3-compatible
  • Simple SSLContext Provider Module
  • Stats aggregator
  • Test Stats Aggregators
  • Ambari Metrics Emitter
  • Apache Cassandra
  • Rackspace Cloud Files
  • DistinctCount Aggregator
  • Graphite Emitter
  • InfluxDB Line Protocol Parser
  • InfluxDB Emitter
  • Kafka Emitter
  • Materialized View
  • Moment Sketches for Approximate Quantiles module
  • Moving Average Query
  • OpenTSDB Emitter
  • Druid Redis Cache
  • Microsoft SQLServer
  • StatsD Emitter
  • T-Digest Quantiles Sketch module
  • Thrift
  • Timestamp Min/Max aggregators
  • GCE Extensions
  • Aliyun OSS
  • Cardinality/HyperUnique aggregators
  • Select
  • Realtime Process
Edit

Retaining or automatically dropping data

In Apache Druid, Coordinator processes use rules to determine what data should be loaded to or dropped from the cluster. Rules are used for data retention and query execution, and are set on the Coordinator console (http://coordinator_ip:port).

There are three types of rules, i.e., load rules, drop rules, and broadcast rules. Load rules indicate how segments should be assigned to different historical process tiers and how many replicas of a segment should exist in each tier. Drop rules indicate when segments should be dropped entirely from the cluster. Finally, broadcast rules indicate how segments of different datasources should be co-located in Historical processes.

The Coordinator loads a set of rules from the metadata storage. Rules may be specific to a certain datasource and/or a default set of rules can be configured. Rules are read in order and hence the ordering of rules is important. The Coordinator will cycle through all used segments and match each segment with the first rule that applies. Each segment may only match a single rule.

Note: It is recommended that the Coordinator console is used to configure rules. However, the Coordinator process does have HTTP endpoints to programmatically configure rules.

Load rules

Load rules indicate how many replicas of a segment should exist in a server tier. Please note: If a Load rule is used to retain only data from a certain interval or period, it must be accompanied by a Drop rule. If a Drop rule is not included, data not within the specified interval or period will be retained by the default rule (loadForever).

Forever Load Rule

Forever load rules are of the form:

{
  "type" : "loadForever",
  "tieredReplicants": {
    "hot": 1,
    "_default_tier" : 1
  }
}
  • type - this should always be "loadForever"
  • tieredReplicants - A JSON Object where the keys are the tier names and values are the number of replicas for that tier.

Interval Load Rule

Interval load rules are of the form:

{
  "type" : "loadByInterval",
  "interval": "2012-01-01/2013-01-01",
  "tieredReplicants": {
    "hot": 1,
    "_default_tier" : 1
  }
}
  • type - this should always be "loadByInterval"
  • interval - A JSON Object representing ISO-8601 Intervals
  • tieredReplicants - A JSON Object where the keys are the tier names and values are the number of replicas for that tier.

Period Load Rule

Period load rules are of the form:

{
  "type" : "loadByPeriod",
  "period" : "P1M",
  "includeFuture" : true,
  "tieredReplicants": {
      "hot": 1,
      "_default_tier" : 1
  }
}
  • type - this should always be "loadByPeriod"
  • period - A JSON Object representing ISO-8601 Periods
  • includeFuture - A JSON Boolean indicating whether the load period should include the future. This property is optional, Default is true.
  • tieredReplicants - A JSON Object where the keys are the tier names and values are the number of replicas for that tier.

The interval of a segment will be compared against the specified period. The period is from some time in the past to the future or to the current time, which depends on includeFuture is true or false. The rule matches if the period overlaps the interval.

Drop Rules

Drop rules indicate when segments should be dropped from the cluster.

Forever Drop Rule

Forever drop rules are of the form:

{
  "type" : "dropForever"
}
  • type - this should always be "dropForever"

All segments that match this rule are dropped from the cluster.

Interval Drop Rule

Interval drop rules are of the form:

{
  "type" : "dropByInterval",
  "interval" : "2012-01-01/2013-01-01"
}
  • type - this should always be "dropByInterval"
  • interval - A JSON Object representing ISO-8601 Periods

A segment is dropped if the interval contains the interval of the segment.

Period Drop Rule

Period drop rules are of the form:

{
  "type" : "dropByPeriod",
  "period" : "P1M",
  "includeFuture" : true
}
  • type - this should always be "dropByPeriod"
  • period - A JSON Object representing ISO-8601 Periods
  • includeFuture - A JSON Boolean indicating whether the load period should include the future. This property is optional, Default is true.

The interval of a segment will be compared against the specified period. The period is from some time in the past to the future or to the current time, which depends on includeFuture is true or false. The rule matches if the period contains the interval. This drop rule always dropping recent data.

Period Drop Before Rule

Period drop before rules are of the form:

{
  "type" : "dropBeforeByPeriod",
  "period" : "P1M"
}
  • type - this should always be "dropBeforeByPeriod"
  • period - A JSON Object representing ISO-8601 Periods

The interval of a segment will be compared against the specified period. The period is from some time in the past to the current time. The rule matches if the interval before the period. If you just want to retain recent data, you can use this rule to drop the old data that before a specified period and add a loadForever rule to follow it. Notes, dropBeforeByPeriod + loadForever is equivalent to loadByPeriod(includeFuture = true) + dropForever.

Broadcast Rules

Broadcast rules indicate that segments of a data source should be loaded by all servers of a cluster of the following types: historicals, brokers, tasks, and indexers.

Note that the broadcast segments are only directly queryable through the historicals, but they are currently loaded on other server types to support join queries.

Forever Broadcast Rule

Forever broadcast rules are of the form:

{
  "type" : "broadcastForever"
}
  • type - this should always be "broadcastForever"

This rule applies to all segments of a datasource, covering all intervals.

Interval Broadcast Rule

Interval broadcast rules are of the form:

{
  "type" : "broadcastByInterval",
  "interval" : "2012-01-01/2013-01-01"
}
  • type - this should always be "broadcastByInterval"
  • interval - A JSON Object representing ISO-8601 Periods. Only the segments of the interval will be broadcasted.

Period Broadcast Rule

Period broadcast rules are of the form:

{
  "type" : "broadcastByPeriod",
  "period" : "P1M",
  "includeFuture" : true
}
  • type - this should always be "broadcastByPeriod"
  • period - A JSON Object representing ISO-8601 Periods
  • includeFuture - A JSON Boolean indicating whether the load period should include the future. This property is optional, Default is true.

The interval of a segment will be compared against the specified period. The period is from some time in the past to the future or to the current time, which depends on includeFuture is true or false. The rule matches if the period overlaps the interval.

Permanently deleting data

Druid can fully drop data from the cluster, wipe the metadata store entry, and remove the data from deep storage for any segments that are marked as unused (segments dropped from the cluster via rules are always marked as unused). You can submit a kill task to the Overlord to do this.

Reloading dropped data

Data that has been dropped from a Druid cluster cannot be reloaded using only rules. To reload dropped data in Druid, you must first set your retention period (i.e. changing the retention period from 1 month to 2 months), and then mark as used all segments belonging to the datasource in the Druid Coordinator console, or through the Druid Coordinator endpoints.

← Rolling updatesMetrics →
  • Load rules
    • Forever Load Rule
    • Interval Load Rule
    • Period Load Rule
  • Drop Rules
    • Forever Drop Rule
    • Interval Drop Rule
    • Period Drop Rule
    • Period Drop Before Rule
  • Broadcast Rules
    • Forever Broadcast Rule
    • Interval Broadcast Rule
    • Period Broadcast Rule
  • Permanently deleting data
  • Reloading dropped data

Technology · Use Cases · Powered by Druid · Docs · Community · Download · FAQ

 ·  ·  · 
Copyright © 2019 Apache Software Foundation.
Except where otherwise noted, licensed under CC BY-SA 4.0.
Apache Druid, Druid, and the Druid logo are either registered trademarks or trademarks of The Apache Software Foundation in the United States and other countries.