Imply Cloud supports the following real-time and batch ingestion methods:
From Files — Batch ingestion from HDFS, S3, local files, and other filesystems. We recommend this method if your dataset is already in flat files.
The easiest ways to get started with loading data is to follow the included tutorials.
You can combine batch (file-based) and streaming methods in a hybrid batch/streaming architecture, sometimes called a "lambda architecture". In a hybrid architecture, you use a streaming method to do initial ingestion, and then periodically re-ingest 'finalized' data in batch mode (typically every few hours or nightly).
Hybrid architectures are simple with Druid, since batch loaded data for a particular time range automatically replaces streaming loaded data for that same time range. All Druid queries seamlessly access historical data together with real-time data. We recommend this kind of architecture if you need real-time analytics but also need the ability to reprocess historical data. Common reasons for reprocessing historical data include:
Most streaming ingestion methods currently supported by Druid introduce the possibility of dropped or duplicated messages in certain failure scenarios, and batch re-ingestion eliminates this potential source of error for historical data.
You get the option to re-ingest your data if necessary in batch mode. This could occur if you missed some data the first time around, or because you need to revise your data. Because Druid's batch ingestion operates on specific slices of time, it is possible to simultaneously do a historical batch load and real-time streaming load.
Note that with the Kafka indexing service, it is possible to reprocess historical data in a pure streaming architecture, by migrating to a new stream-based datasource whenever you want to reprocess historical data. This is sometimes called a "kappa architecture".