The Imply Distribution of Apache Druid (Druid) powers Imply’s ability to deliver highly performant real-time analytics at scale. Within Druid, your data resides in table datasources which are similar to tables in relational database management systems (RDBMS).
Loading data into a datasource
The goal when loading data into a Druid datasource is to optimize the schema and layout so that Druid can deliver fast ad-hoc analytics for your end users.
- Connect to the system hosting your original data.
- Parse the original data.
- Perform ingestion-time data transformation including filtering, concatenation, string processing, and other data manipulation functions.
- Create the schema for a datasource if it doesn’t already exist.
- Add data to the datasource, including appending new segments to existing segment sets.
- Organize the data layout in segments on disk.
Ingestion tasks replace
SELECT... INTO, and similar commands used to load data into a relational database.
During ingestion, Druid transforms your original data into time-chunked files called segments. Segments reside on disks in deep storage. Druid data retrieval services called Historicals load the segment files from deep storage to make them available for querying.
Efficient organization of your data into segments on disk can improve query performance within Druid.
Data layout, schema design, and query performance
Queries perform best when Druid can distribute the compute load to retrieve data for responses across many Historicals. In general, queries run more quickly when Historicals:
- Can exclude segment files from processing based upon a query filter when building a query response.
- Have less data to process when responding to queries that target only a few rows and columns.
Before you ingest data, identify your performance goals and use the goals to define your ingestion strategy for Druid to optimize your schema and the layout of your segments.
To learn more about Historicals and other Druid services and their role in data processing, see Design.
Differences between Druid and other database systems
Even though Druid shares conceptual similarities with traditional relational databases and data warehousing tools, there are key differences in terms of loading data. For example:
In some data warehouses, you load all your data up front and figure out performance later at query time. Druid relies on the data layout on disk to deliver exceptional performance. Therefore, queries perform better if you plan your Schema design first and define your ingestion tasks accordingly.
Relational databases can be highly normalized, joining many tables together to return the results for a single query. Druid performs better with a flat data model where all the results from a query come from a single datasource.
To learn more about the differences between Druid and other database models see Schema design tips.
If you want to load data from files, such as csv or parquet, you should use native batch ingestion. Alternatively, if you have existing Hadoop infrastructure, you can use Hadoop-based ingestion for batch ingestion of file data.
You can combine batch and streaming methods in a hybrid batch/streaming architecture, sometimes called a "lambda architecture". In a hybrid architecture, you use streaming ingestion initially, and then periodically re-ingest finalized data in batch mode (typically every few hours or nightly).
Hybrid architectures are simple with Druid, since batch loaded data for a particular time range automatically replaces streaming loaded data for that same time range. All Druid queries seamlessly access historical data together with real-time data. We recommend this kind of architecture if you need real-time analytics but also need the ability to reprocess historical data. Common reasons for reprocessing historical data include:
Most streaming ingestion methods currently supported by Druid introduce the possibility of dropped or duplicated messages in certain failure scenarios, and batch re-ingestion eliminates this potential source of error for historical data.
You get the option to re-ingest your data if necessary in batch mode. This could occur if you missed some data the first time around, or because you need to revise your data. Because Druid's batch ingestion operates on specific slices of time, it is possible to simultaneously do a historical batch load and real-time streaming load.
With the Kafka indexing service, it is possible to reprocess historical data in a pure streaming architecture by migrating to a new stream-based datasource whenever you want to reprocess historical data. This is sometimes called a "kappa architecture".
See the following topics for more information: