Skip to main content

Introduction to data rollup

Modern day applications emit millions of events in streaming data per day. As data accumulates, it increases the storage footprint, often leading to higher storage costs and decreased query performance. Imply Polaris uses the Apache Druid data rollup feature to aggregate raw data at predefined intervals during ingestion. By decreasing row counts, rollup can dramatically reduce the size of stored data and improve query performance.

This topic provides an overview of data rollup in Polaris.

Data rollup

Rollup is a form of time-based data aggregation. It combines multiple rows with the same timestamp and dimension values into segments, resulting in a condensed data set.

You enable rollup by specifying the aggregate table type during table creation. You can then configure the table's time granularity before ingesting data to maximize performance.

When you select the detail table type, Polaris stores each record as it is ingested, without performing any form of aggregation.

The following are optimal scenarios to create an aggregate table with rollup:

  • You want optimal performance or you have strict space constraints.
  • You don't need raw values from high-cardinality dimensions.

Conversely, create a detail table without rollup when any of the following conditions hold:

  • You want to preserve results for individual rows.
  • You don't have any measures that you want to aggregate during the ingestion process.
  • You have many high-cardinality dimensions.

Time granularity

Time granularity determines how to bucket data across the timestamp dimension using UTC timedays start at 00:00 UTC.

Polaris supports the following time granularity options to bucket input data:

Time granularityISO 8601 notationExample
Millisecond2016-04-01T01:02:33.080Z
SecondPT1S2016-04-01T01:02:33.000Z
MinutePT1M2016-04-01T01:02:00.000Z
15 minutePT15M2016-04-01T01:15:00Z
30 minutePT30M2016-04-01T01:30:00Z
HourPT1H2016-04-01T01:00:00.000Z
DayP1D2016-04-01T00:00:00.000Z
WeekP1W2016-06-27T00:00:00.000Z
MonthP1M2016-06-01T00:00:00.000Z
QuarterP3M2016-04-01T00:00:00.000Z
YearP1Y2016-01-01T00:00:00.000Z

You can also set the rollup granularity to all, which allows you to group data by dimensions regardless of the timestamp. This option is primarily useful for non-time-series data.

By default, Polaris buckets input data at millisecond granularity.

Example

The following example shows how to create an aggregate table and specify its rollup time granularity. The dataset is a sample of network flow event data, representing packet and byte counts for an IP traffic that occurred within a particular second.

To create an aggregate table and specify its rollup granularity, follow these steps:

  1. Download this JSON file containing the sample input data.
  2. Click Tables from the left navigation menu of the Polaris UI.
  3. Click Create table.
  4. Enter a unique name for your table, select the Aggregate table type, and select the Strict schema mode. Click Next.
  5. From the table view, click Load data > Insert data and select the file you downloaded, rollup-data.json.
  6. Click Next > Continue.
  7. On the Map source to table step, click on the timestamp dimension, then click Edit.
  8. In the timestamp dialog, select Minute from the Rollup granularity drop-down. This tells Polaris to bucket the timestamps of the original input data by minute.
  9. Click Start ingestion.

Your table should look similar to the following:

Polaris rollup ingestion

The input data has nine rows, but with rollup applied, the table stores five rows. All aggregate tables automatically include a __count measure. This measure counts the number of source data rows that were rolled up into a given row. For more information, see Schema measures.

The following events were aggregated:

  • Events that occurred during 2018-01-01 01:01:

    {"timestamp":"2018-01-01T01:01:35Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":20,"bytes":9024}
    {"timestamp":"2018-01-01T01:01:51Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":255,"bytes":21133}
    {"timestamp":"2018-01-01T01:01:59Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":11,"bytes":5780}
  • Events that occurred during 2018-01-01 01:02:

    {"timestamp":"2018-01-01T01:02:14Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":38,"bytes":6289}
    {"timestamp":"2018-01-01T01:02:29Z","srcIP":"1.1.1.1", "dstIP":"2.2.2.2","packets":377,"bytes":359971}
  • Events that occurred during 2018-01-02 21:33:

    {"timestamp":"2018-01-02T21:33:14Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":38,"bytes":6289}
    {"timestamp":"2018-01-02T21:33:45Z","srcIP":"7.7.7.7", "dstIP":"8.8.8.8","packets":123,"bytes":93999}
info

If your table contains the flexible schema mode, Polaris treats all columns as dimensions. Since rollup only applies when rows have the same timestamp and dimension values, no rows will be rolled up in this example.

To get the same behavior in flexible mode, while on the Map source to table step, edit the packets column as follows:

  • Declare it in the table schema
  • Select the Measure column type
  • Define the input expression SUM("packets")

Apply the same actions to the bytes column, using the input expression SUM("bytes"). Before you start ingestion, your table should look like the following:

Polaris rollup ingestion with flexible mode

Limitations

The following restrictions apply to aggregate tables:

  • Rollup is set for aggregate tables only. Tables are either aggregate or detail at creation. Once a table is created, you cannot change its type.
  • Once you add data to an aggregate table and specify its rollup granularity, you can only make the granularity coarserfor example,Minute to Hour. Polaris makes the granularity change during compaction. If an aggregate table does not contain data and there is not an active ingestion job associated with the table, you can change the rollup granularity to a finer granularityfor example, Hour to Minute.
  • Polaris does not support rollup for nested data. You can only create JSON columns in detail tables. To ingest the data into an aggregate table, either flatten the nested data into dimensions using JSON_VALUE or ingest the data into a string-typed column. Note that rollup is most effective when the data has low cardinality. If the data has high cardinality, meaning there are more distinct values in the dimension, fewer rows will be rolled up.
  • Polaris does not support rollup for the ipAddress, ipPrefix, or json data types. These complex data types can only be ingested into detail tables.

Learn more

See the following topics for more information: