• Developer guide
  • API reference

›Tables and data

Getting started

  • Introduction to Imply Polaris
  • Quickstart
  • Execute a POC
  • Create a dashboard
  • Navigate the console
  • Key concepts

Tables and data

  • Overview
  • Introduction to tables
  • Table schema
  • Create an ingestion job
  • Timestamp expressions
  • Data partitioning
  • Introduction to rollup
  • Approximation algorithms
  • Replace data

Ingestion sources

  • Ingestion sources overview
  • Supported data formats
  • Create a connection
  • Ingest from files
  • Ingest data from a table
  • Ingest from S3
  • Ingest from Kafka and MSK
  • Ingest from Kinesis
  • Ingest from Confluent Cloud
  • Kafka Connector for Imply Polaris
  • Push event data
  • Connect to Confluent Schema Registry

Analytics

  • Overview
  • Manage data cubes
  • Visualize data
  • Data cube dimensions
  • Data cube measures
  • Dashboards
  • Visualizations reference
  • Set up alerts
  • Set up reports
  • Embed visualizations
  • Query data

Monitoring

  • Overview

Management

  • Overview
  • Pause and resume a project

Billing

  • Overview
  • Polaris plans
  • Estimate project costs

Usage

  • Overview

Security

    Polaris access

    • Overview
    • Invite users to your organization
    • Manage users
    • Permissions reference
    • Manage user groups
    • Enable SSO
    • SSO settings reference
    • Map IdP groups

    Secure networking

    • Connect to AWS
    • Create AWS PrivateLink connection

Developer guide

  • Overview
  • Authentication

    • Overview
    • Authenticate with API keys
    • Authenticate with OAuth
  • Manage users and groups
  • Migrate deprecated resources
  • Create a table
  • Define a schema
  • Upload files
  • Create an ingestion job
  • Ingestion sources

    • Ingest from files
    • Ingest from a table
    • Get ARN for AWS access
    • Ingest from Amazon S3
    • Ingest from Kafka and MSK
    • Ingest from Amazon Kinesis
    • Ingest from Confluent Cloud
    • Push event data
    • Kafka Connector for Imply Polaris
    • Kafka Connector reference
  • Filter data to ingest
  • Ingest nested data
  • Ingest and query sketches
  • Specify data schema
  • Query data
  • Update a project
  • Link to BI tools
  • Connect over JDBC
  • Query parameters reference
  • API documentation

    • OpenAPI reference
    • Query API

Product info

  • Release notes
  • Known limitations
  • Druid extensions

Data partitioning

Partitioning is a method of organizing a large dataset into partitions to aid in data management and improve query performance in Imply Polaris.

By distributing data across multiple partitions, you decrease the amount of data that needs to be scanned at query time, which reduces the overall query response time.

For example, if you always filter your data by country, you can use the country dimension to partition your data. This improves the query performance because Polaris only needs to scan the rows related to the country filter.

Time partitioning

Polaris partitions datasets by timestamp based on the selected time partitioning granularity.

By default, time partitioning is set to day, which is sufficient for most applications. Depending on the use case and the size of your dataset, you may benefit from a finer or a coarser setting. For example:

  • For highly aggregated datasets, where a single day contains less than one million rows, a coarser time partitioning may be appropriate.
  • For datasets with finer granularity timestamps where queries often run on smaller intervals within a singe day, a finer time partitioning may be more suitable.

When using partitioning with rollup, partitioning time granularity must be coarser than or equal to the rollup granularity.

To change the time partitioning, in the Edit schema view, click Partitioning and choose from the available time partitioning settings. You can partition your data by hour, day, week, month, and year.

Generally, fine-tuning clustering and rollup is more impactful on performance than using time partitioning alone.

Clustering

In addition to partitioning by time, you can partition further using other columns. This is often referred to as clustering or secondary partitioning.

To achieve the best performance and the smallest overall memory footprint, we recommend choosing the columns you most frequently filter on. Doing so decreases access time and improves data locality, the practice of storing similar data together.

Sort order

When configuring clustering, select the column you filter on the most as your first dimension. This signals Polaris to sort the rows within each partition by that column, which often improves data compression.

Polaris always sorts the rows within a partition by timestamp first.

You can drag and drop the columns to change the order in with they appear for clustering.

The following screenshot shows a table before ingestion with time partitioning set to week and clustering configured on the continent, country, and language columns in that order.

Polaris clustering columns

Limitations

Time partitioning granularity must be coarser than or equal to the rollup granularity.

← Timestamp expressionsIntroduction to rollup →
  • Time partitioning
  • Clustering
    • Sort order
  • Limitations
Key links
Try ImplyApache Druid siteImply GitHub
Get help
Stack OverflowSupportContact us
Learn more
BlogApache Druid docs
Copyright © 2023 Imply Data, Inc