Skip to main content

SQL queries using the Dart query profile

Experimental

Dart is experimental. Use it in situations where it fits your use case better than the native query engine. But be aware that Dart has not received as much testing as the other query engines.

Dart is a profile of the MSQ engine that runs SELECT queries on Brokers and Historicals instead of on tasks. The Brokers act as controllers and the Historicals act as workers.

Use Dart as an alternative to the native query engine since it offers better parallelism, excelling at queries that involve:

  • large joins, which Dart performs using parallel sort-merges
  • high-cardinality exact groupBys
  • high-cardinality exact count distinct

When processing these kinds of queries, Dart can parallelize through the entire query, leading to better performance.

By default, Dart queries include results form published segments and realtime tasks.

Enable Dart

To enable Dart, add the following line to your _common/common.runtime.properties files:

druid.msq.dart.enabled = true

Configure resource consumption

You can configure the Broker and the Historical to tune Dart's resource consumption. Since Brokers only act as controllers, they don't require substantial resources. Historicals, on the other hand, are processing the queries. More resources for Historicals can result in faster query processing.

For Brokers, you can set the following configs:

Property nameDescriptionDefault
druid.msq.dart.controller.concurrentQueriesMaximum number of query controllers that can run concurrently on that Broker. Additional controllers are queued. Queries can get stuck waiting for each other if the total value on Brokers exceeds the setting on a single Historical (druid.msq.dart.worker.concurrentQueries ).1
druid.msq.dart.query.context.targetPartitionsPerWorkerNumber of available threads on workers (druid.processing.numThreads)1 (Multithreading is turned off on Historicals)

For Historicals, you can set the following configs:

Property nameDescriptionDefault Value
druid.msq.dart.worker.concurrentQueriesMaximum number of query workers that can run concurrently on that Historical. Set this to a value equal to or larger than druid.msq.dart.controller.concurrentQueries on your Brokers. If you don't, queries can get stuck waiting for each other.Equal to the number of merge buffers
druid.msq.dart.worker.heapFractionMaximum amount of heap available for use across all Dart queries as a decimal.0.35 (35% of heap)

Run a Dart query

Once enabled, you can use Dart in the Druid console or the SQL query API to issue queries.

Druid console

In the Query view, select Engine: SQL (Dart) from the engine selector menu.

API

Dart uses the SQL endpoint /druid/v2/sql. To use Dart, include the query context parameter engine and set it to msq-dart:

As part of your query using SET engine = 'msq-dart':

"query":"SET \"engine\"='msq-dart';\nSELECT\n  user,\n  commentLength,\n  COUNT(*) AS \"COUNT\"\nFROM \"wikipedia\"\nGROUP BY 1, 2\nORDER BY 2 DESC"

Query context parameters

You can use query context parameters to control Dart's behavior. The following table lists the supported query context parameters:

ParameterDescriptionDefault value
finalizeAggregationsDetermines the type of aggregation to return. If true, Druid finalizes the results of complex aggregations that directly appear in query results. If false, Druid returns the aggregation's intermediate type rather than finalized type. This parameter is useful during ingestion, where it enables storing sketches directly in Druid tables. For more information about aggregations, see SQL aggregation functions.true
includeSegmentSourceControls the sources that are queried for results in addition to the segments present on deep storage. Can be NONE or REALTIME. If this value is NONE, only non-realtime (published and used) segments will be downloaded from deep storage. If this value is REALTIME, results will also be included from realtime tasks.REALTIME
removeNullBytesThe MSQ engine cannot process null bytes in strings and throws InvalidNullByteFault if it encounters them in the source data. If the parameter is set to true, The MSQ engine will remove the null bytes in string fields when reading the data.false
maxConcurrentStagesNumber of stages that can run concurrently for a query. A higher number can potentially improve pipelining but results in less memory available for each stage.2
maxNonLeafWorkersNumber of workers to use for stages beyond the leaf stage1 (Scatter-gather style)
sqlJoinAlgorithmAlgorithm to use for JOIN. Use broadcast (the default) for broadcast hash join or sortMerge for sort-merge join. Affects all JOIN operations in the query. This is a hint to the MSQ engine and the actual joins in the query may proceed in a different way than specified.broadcast
targetPartitionsPerWorkerNumber of partitions Druid generates for each worker. This number controls how much parallelism can be maintained throughout a query.1

Known issues and limitations

  • Dart doesn't do the following:
    • verify that druid.msq.dart.controller.concurrentQueries is set properly. If set too high, queries can get stuck on each other.
    • use the query cache.
    • perform query prioritization or laning
  • Dart doesn't support JDBC connections. The engine context parameter gets ignored.
  • Realtime scans from the MSQ engine cannot reliably read complex types. This can happen in situations such as if your data includes HLL Sketches for realtime data. Dart returns a NullPointerException. For more information, see #18340.
  • The NilStageOutputReader can sometimes lead to a NoClassDefFoundError. For more information, see #18336.