Skip to main content

Data partitioning

Partitioning is a method of organizing a large dataset into partitions to aid in data management and improve query performance in Imply Polaris.

By distributing data across multiple partitions, you decrease the amount of data that needs to be scanned at query time, which reduces the overall query response time.

For example, if you always filter your data by country, you can use the country dimension to partition your data. This improves the query performance because Polaris only needs to scan the rows related to the country filter.

Time partitioning

Polaris partitions datasets by timestamp based on the time partitioning granularity you select.

You set the partitioning in the Map source to table step of an ingestion job. To change the time partitioning for an existing table, go to the table view and click Manage > Edit table. Click Partitioning from the menu bar to update the table's partitioning settings.

You can partition your data by the following time periods:

  • Hour
  • Day
  • Month
  • Year
  • All (group all data into a single bucket)

By default, time partitioning is set to day, which is sufficient for most applications. Depending on the use case and the size of your dataset, you may benefit from a finer or a coarser setting. For example:

  • For highly aggregated datasets, where a single day contains less than one million rows, a coarser time partitioning may be appropriate.
  • For datasets with finer granularity timestamps where queries often run on smaller intervals within a singe day, a finer time partitioning may be more suitable.

Relation to rollup

When using partitioning with rollup, partitioning time granularity must be coarser than or equal to the rollup granularity.

Generally, fine-tuning clustering and rollup is more impactful on performance than using time partitioning alone.

Relation to replacing data

The table's time partitioning determines the granularity of data replacement. The replacement time interval must be coarser than the time partitioning. If you set the time partitioning to all, any data replacement job must replace all data within the table.

Clustering

In addition to partitioning by time, you can partition further using other columns. This is often referred to as clustering or secondary partitioning. Declare a column in the schema before you can use it as a clustering column.

To achieve the best performance and the smallest overall memory footprint, we recommend choosing the columns you most frequently filter on. Doing so decreases access time and improves data locality, the practice of storing similar data together.

Sort order

When configuring clustering, select the column you filter on the most as your first dimension. This signals Polaris to sort the rows within each partition by that column, which often improves data compression.

info

Polaris always sorts the rows within a partition by timestamp first.

You can drag and drop the columns to change the order in with they appear for clustering.

Example

The following screenshot shows a table with time partitioning set to day and clustering configured on continent and country, in that order.

Polaris clustering columns

Segment generation

Partitioning controls how data is stored in files known as segments. The partitioning granularity translates to the interval by which segments are generated and stored. For example, if your data spans one week, and your partitioning granularity is set to day, there are seven one-day intervals that can each contain segments.

A given interval may have zero segments if no data exists for that time interval. A given interval may have more than one segment if there is a lot of data within the time interval. Polaris may create multiple segments for a given interval to achieve optimal segment file sizes for efficient performance.

In most cases, you do not directly interact with segment files in Polaris. To recover deleted data, you restore the corresponding interval and version that contain the segments.

For more information, see Recover or permanently delete data.