To load batch data, you'll need:
There are two supported methods for loading files into Druid:
Built-in ingestion with the "index" task. This performs the ingestion work on your Druid nodes. Each task processes data in a single thread, but you can parallelize ingestion by submitting multiple tasks.
EMR Hadoop-based ingestion with the "index_hadoop" task. This performs the ingestion work on an Amazon EMR cluster using Hadoop Map/Reduce, where it is automatically parallelized.
If you've never loaded data files into Druid before, we recommend trying out the example in the Getting Started page first and then coming back to this page.
Druid can load files using built-in ingestion with the "index" task. Each indexing task you submit will run single-threaded. To parallelize the data loading process, you can partition your data by time (e.g. hour, day, or some other time bucketing) and then submit an indexing task for each time partition. Indexing tasks for different intervals can run simultaneously.
Druid can leverage Hadoop Map/Reduce using Amazon EMR to scale out ingestion, allowing it to load data from files on S3 via parallelized YARN jobs. These jobs will scan through your raw data and produce optimized Druid data segments in S3. The data will then be loaded by Druid Historical Nodes. Once loading is complete, EMR is not involved in the query path of Druid in any way.
The main advantages of loading data using EMR is that it automatically parallelizes the batch data loading process, and that it uses EMR resources instead of using your Druid machines (leaving your Druid machines free to handle queries).
Please see EMR Setup for instructions on how to configure EMR-based batch ingestion.
When you load additional data into Druid using subsequent indexing tasks, the behavior depends on the intervals of the subsequent tasks. Batch loads in Druid act in a replace-by-interval manner, so if you submit two tasks for the same interval, only data from the later task will be visible. If you submit two tasks for different intervals, both sets of data will be visible.
This behavior makes it easy to reload data that you have corrected or amended in some way: just resubmit an indexing task for the same interval, but pointing at the new data. The replacement occurs atomically.
If you want to append to existing data for a given interval rather than replace it, you can do this in one of two ways: