2022.06

2022.06

  • Imply
  • Ingest
  • Query
  • Visualize
  • Administer
  • Deploy

›Performance tuning

Overview

  • About Imply administration

Manager

  • Using Imply Manager
  • Managing Imply clusters
  • Imply Manager security
  • Extensions

Users

  • Imply Manager users
  • Druid API access
  • Authentication and Authorization

    • Get started with Imply Hybrid Auth
    • Authentication
    • Local users
    • User roles
    • User groups
    • User sessions
    • Brute force attack detection
    • Identity provider integration
    • Okta OIDC integration
    • Okta SAML integration
    • LDAP integration
    • OAuth client authentication

Clarity

  • Monitoring
  • Set up Clarity
  • Cloudwatch monitoring
  • Metrics

Druid administration

  • Configuration reference
  • Logging
  • Druid design

    • Design
    • Segments
    • Processes and servers
    • Deep storage
    • Metadata storage
    • ZooKeeper

    Security

    • Security overview
    • User authentication and authorization
    • LDAP auth
    • Dynamic Config Providers
    • Password providers
    • Authentication and Authorization
    • TLS support
    • Row and column level security

    Performance tuning

    • Basic cluster tuning
    • Segment size optimization
    • Mixed workloads
    • HTTP compression
    • Automated metadata cleanup
  • API reference
  • View Manager

    • View Manager
    • View Manager API
    • Create a view
    • List views
    • Delete a view
    • Inspect view load status
  • Rolling updates
  • Retaining or automatically dropping data
  • Alerts
  • Working with different versions of Apache Hadoop
  • Misc

    • dump-segment tool
    • reset-cluster tool
    • pull-deps tool
    • Deep storage migration
    • Export Metadata Tool
    • Metadata Migration

Segment size optimization

In Apache Druid, it's important to optimize the segment size because

  1. Druid stores data in segments. If you're using the best-effort roll-up mode, increasing the segment size might introduce further aggregation which reduces the dataSource size.
  2. When a query is submitted, that query is distributed to all Historicals and realtime tasks which hold the input segments of the query. Each process and task picks a thread from its own processing thread pool to process a single segment. If segment sizes are too large, data might not be well distributed between data servers, decreasing the degree of parallelism possible during query processing. At the other extreme where segment sizes are too small, the scheduling overhead of processing a larger number of segments per query can reduce performance, as the threads that process each segment compete for the fixed slots of the processing pool.

It would be best if you can optimize the segment size at ingestion time, but sometimes it's not easy especially when it comes to stream ingestion because the amount of data ingested might vary over time. In this case, you can create segments with a sub-optimized size first and optimize them later using compaction.

You may need to consider the followings to optimize your segments.

  • Number of rows per segment: it's generally recommended for each segment to have around 5 million rows. This setting is usually more important than the below "segment byte size". This is because Druid uses a single thread to process each segment, and thus this setting can directly control how many rows each thread processes, which in turn means how well the query execution is parallelized.
  • Segment byte size: it's recommended to set 300 ~ 700MB. If this value doesn't match with the "number of rows per segment", please consider optimizing number of rows per segment rather than this value.

The above recommendation works in general, but the optimal setting can vary based on your workload. For example, if most of your queries are heavy and take a long time to process each row, you may want to make segments smaller so that the query processing can be more parallelized. If you still see some performance issue after optimizing segment size, you may need to find the optimal settings for your workload.

There might be several ways to check if the compaction is necessary. One way is using the System Schema. The system schema provides several tables about the current system status including the segments table. By running the below query, you can get the average number of rows and average size for published segments.

SELECT
  "start",
  "end",
  version,
  COUNT(*) AS num_segments,
  AVG("num_rows") AS avg_num_rows,
  SUM("num_rows") AS total_num_rows,
  AVG("size") AS avg_size,
  SUM("size") AS total_size
FROM
  sys.segments A
WHERE
  datasource = 'your_dataSource' AND
  is_published = 1
GROUP BY 1, 2, 3
ORDER BY 1, 2, 3 DESC;

Please note that the query result might include overshadowed segments. In this case, you may want to see only rows of the max version per interval (pair of start and end).

Once you find your segments need compaction, you can consider the below two options:

  • Turning on the automatic compaction of Coordinators. The Coordinator periodically submits compaction tasks to re-index small segments. To enable the automatic compaction, you need to configure it for each dataSource via Coordinator's dynamic configuration. See Automatic compaction configuration API and Automatic compaction dynamic configuration for details.
  • Running periodic Hadoop batch ingestion jobs and using a dataSource inputSpec to read from the segments generated by the Kafka indexing tasks. This might be helpful if you want to compact a lot of segments in parallel. Details on how to do this can be found on the Updating existing data section of the data management page.

Learn more

For an overview of compaction and how to submit a manual compaction task, see Compaction.

Last updated on 6/16/2022
← Basic cluster tuningMixed workloads →
  • Learn more
2022.06
Key links
Try ImplyApache Druid siteImply GitHub
Get help
Stack OverflowSupportContact us
Learn more
Apache Druid forumsBlog
Copyright © 2022 Imply Data, Inc