The Imply Distribution of Apache Druid (Druid) powers Imply’s ability to deliver highly performant real-time analytics at scale. Within Druid, your data resides in table datasources which are similar to tables in relational database management systems (RDBMS).
Loading data into a datasource
The goal when loading data into a Druid datasource is to optimize the schema and layout so that Druid can deliver fast ad-hoc analytics for your end users.
- Connect to the system hosting your original data.
- Parse the original data.
- Perform ingestion-time data transformation including filtering, concatenation, string processing, and other data manipulation functions.
- Create the schema for a datasource if it doesn’t already exist.
- Add data to the datasource, including appending new segments to existing segment sets.
- Organize the data layout in segments on disk.
Ingestion tasks replace
SELECT... INTO, and similar commands used to load data into a relational database.
During ingestion, Druid transforms your original data into time-chunked files called segments. Segments reside on disks in deep storage. Druid data retrieval services called Historicals load the segment files from deep storage to make them available for querying.
Efficient organization of your data into segments on disk can improve query performance within Druid.
Data layout, schema design, and query performance
Queries perform best when Druid can distribute the compute load to retrieve data for responses across many Historicals. In general, queries run more quickly when Historicals:
- Can exclude segment files from processing based upon a query filter when building a query response.
- Have less data to process when responding to queries that target only a few rows and columns.
Before you ingest data, identify your performance goals and use the goals to define your ingestion strategy for Druid to optimize your schema and the layout of your segments.
To learn more about Historicals and other Druid services and their role in data processing, see Design.
Differences between Druid and other database systems
Even though Druid shares conceptual similarities with traditional relational databases and data warehousing tools, there are key differences in terms of loading data. For example:
In some data warehouses, you load all your data up front and figure out performance later at query time. Druid relies on the data layout on disk to deliver exceptional performance. Therefore, queries perform better if you plan your Schema design first and define your ingestion tasks accordingly.
Relational databases can be highly normalized, joining many tables together to return the results for a single query. Druid performs better with a flat data model where all the results from a query come from a single datasource.
To learn more about the differences between Druid and other database models see Schema design tips.
If you want to load data from files, such as csv or parquet, you should use native batch ingestion. Alternatively, if you have existing Hadoop infrastructure, you can use Hadoop-based ingestion for batch ingestion of file data.
See the following topics for more information: