Data ingestion
Before you can use Pivot to visualize and analyze data, you need to load data. You do so by defining datasources. The Druid console offers a visual interface for identifying data sources, defining the schema for the data schema, and setting its life cycle properties.
Imply can ingest data in batch or streaming mode, depending on the nature of the data source. Batch sources are usually files retrieved from HDFS, S3, or local filesystem sources. Streaming data sources examples include Kafka or Kinesis.
Imply supports Druid's real-time and batch ingestion methods.
Getting started
The Quickstart takes you through the steps for loading data from files on your local disk. For information on other sources, see the Druid ingestion documentation.
Modeling data
Each datasource consists of the data itself and an ingestion spec. The ingestion spec describes the data, how it is organized and managed. The Druid Console data loader generates the ingestion spec for you. You can create them or modify a generated specs by hand.
By configuring the data appropriately, you can optimize query performance against the data. Aspects of the data source you can configure include compaction tasks, how it is partitioned into segments, and more. Data Ingestion, and related pages, covers many of these aspects in detail.
Hybrid batch/streaming
You can combine batch (file-based) and streaming methods in a hybrid batch/streaming architecture, sometimes called a "lambda architecture". In a hybrid architecture, you use a streaming method to do initial ingestion, and then periodically re-ingest finalized data in batch mode (typically every few hours or nightly).
Hybrid architectures are simple with Druid, since batch loaded data for a particular time range automatically replaces streaming loaded data for that same time range. All Druid queries seamlessly access historical data together with real-time data. We recommend this kind of architecture if you need real-time analytics but also need the ability to reprocess historical data. Common reasons for reprocessing historical data include:
Most streaming ingestion methods currently supported by Druid introduce the possibility of dropped or duplicated messages in certain failure scenarios, and batch re-ingestion eliminates this potential source of error for historical data.
You get the option to re-ingest your data if necessary in batch mode. This could occur if you missed some data the first time around, or because you need to revise your data. Because Druid's batch ingestion operates on specific slices of time, it is possible to simultaneously do a historical batch load and real-time streaming load.
With the Kafka indexing service, it is possible to reprocess historical data in a pure streaming architecture by migrating to a new stream-based datasource whenever you want to reprocess historical data. This is sometimes called a "kappa architecture".