Skip to main content

SQL-based ingestion known issues

info

This page describes SQL-based batch ingestion using the druid-multi-stage-query extension, new in Druid 24.0. Refer to the ingestion methods table to determine which ingestion method is right for you.

Multi-stage query task runtime

  • Fault tolerance is partially implemented. Workers get relaunched when they are killed unexpectedly. The controller does not get relaunched if it is killed unexpectedly.

  • Worker task stage outputs are stored in the working directory given by druid.indexer.task.baseDir. Stages that generate a large amount of output data may exhaust all available disk space. In this case, the query fails with an UnknownError with a message including "No space left on device".

  • Starting in 2023.06, multi-stage query (MSQ) tasks for SQL-based ingestion now honor the size you set for task directories. This change allows the MSQ task engine to sort more data at the cost of performance. If a task requires more storage than the size you set, data spills over to S3, which can have performance impacts. To mitigate the performance impact, you can either increase the number of tasks or increase the size you set for druid.worker.baseTaskDirs.

SELECT Statement

  • GROUPING SETS are not implemented. Queries using these features return a QueryNotSupported error.

INSERT and REPLACE Statements

  • The INSERT and REPLACE statements with column lists, like INSERT INTO tbl (a, b, c) SELECT ..., is not implemented.

  • INSERT ... SELECT and REPLACE ... SELECT insert columns from the SELECT statement based on column name. This differs from SQL standard behavior, where columns are inserted based on position.

  • INSERT and REPLACE do not support all options available in ingestion specs, including the createBitmapIndex and multiValueHandling dimension properties, and the indexSpec tuningConfig property.

EXTERN Function

  • The schemaless dimensions feature is not available. All columns and their types must be specified explicitly using the signature parameter of the EXTERN function.

  • EXTERN with input sources that match large numbers of files may exhaust available memory on the controller task.

  • EXTERN refers to external files. Use FROM to access druid input sources.