# Compute results with approximation algorithms

Analyzing large datasets to produce exact counts consumes considerable compute resources.
Operations that don't require exact counts and can tolerate a certain error rate may benefit from a class of specialized approximation algorithms, called *sketches*.
Sketches are approximate, mergeable, and limited-size representations of potentially large amounts of input data that trade off accuracy for reduced storage and improved performance.
They use various randomization techniques for approximate distinct counting and produce results with mathematically proven error bounds.

Suppose you want to know the number of distinct users for your app without capturing the user ID in your query results. You can use sketches to produce an approximate aggregate for the number of unique users, which is orders of magnitude faster and more cost effective than computing raw data.

This topic provides an overview of sketches supported by Imply Polaris. All examples are based on the sample Wikipedia dataset.

## Types of sketches

Polaris relies on the Apache DataSketches library for implementation of approximation algorithms. The DataSketches library includes various types of sketches, each one designed for a different sort of problem. Polaris supports HyperLogLog (HLL) and Theta sketch modules.

### HyperLogLog

The HyperLogLog (HLL) sketch is a widely used algorithm for approximate distinct counting. During ingestion, the HLL aggregator creates HLL sketch objects. At query time, Polaris reads and merges the HLL sketch objects together returning approximate count of distinct values. For more information on HLL sketches, see DataSketches HLL Sketch module and Druid SQL HLL sketch functions.

The following example demonstrates how enabling the HLL sketch for the user column (in your schema) can save you 9,868 rows in storage after ingestion.

This query, which does not use sketches, returns 11,666 rows:

```
SELECT
APPROX_COUNT_DISTINCT(
CONCAT(
FLOOR(__time to hour),
COALESCE(countryName, ''),
COALESCE(cityName, ''),
COALESCE(user, '')
))
FROM wikipedia
```

This query, which uses the HLL distinct count sketch for the same data, returns 1,798 rows:

```
SELECT
APPROX_COUNT_DISTINCT_DS_HLL(
CONCAT(
FLOOR(__time to hour),
COALESCE(countryName, ''),
COALESCE(cityName, '')
))
FROM wikipedia
```

### Theta

The Theta sketch is an algorithm for approximate distinct counting with set intersection and set difference operations. Theta sketches let you count the overlap between sets. They are useful for retention analysis and order-independent funnel analysis, where the order of the stages is not part of the analysis. During ingestion, the Theta aggregator creates Theta sketch objects. At query time, Polaris reads and merges the Theta sketch objects together returning approximate count of unique entries. For more information on Theta sketches, see DataSketches Theta Sketch module and Druid SQL Theta sketch functions.

## When to use sketches

You can apply sketches during ingestion and at query time. Creating sketches at ingestion time summarizes input data, which improves rollup, reduces memory footprint, and increases query speed performance.

Although, you can query a sketch on a regular or sketch column, it is recommended that you apply sketches both during ingestion and at query time to improve query responsiveness and speed.

### Apply sketches at ingestion

The Polaris table schema only accepts sketch columns as measures in an aggregate table. If you plan to apply sketches at ingestion time, make sure to specify the aggregate table type when defining the table schema.

To apply sketches to table columns during ingestion in the Polaris UI, do the following:

Create an aggregate table.

On the

**Add data**page, where you map input fields to table columns, click**Edit**next to the measure you want to apply a sketch to.In the

**Data type**field, select the sketch type.In the

**Input expression**field, enter the appropriate sketch aggregation function.- For HLL sketches, use the HLL aggregator
`DS_HLL(expr, [lgK, tgtHllType])`

—for example,`DS_HLL("inputField")`

. - For Theta sketches, use the Theta aggregator
`DS_THETA(expr, [size])`

—for example,`DS_THETA("inputField")`

.

You can tune the size and accuracy of a sketch through the optional parameters passed into the sketch aggregator. For more information, see DataSketches HLL Sketch module and DataSketches Theta Sketch module.

The following screenshot shows the

`user`

column with the`HLLSketch`

data type and the`DS_HLL("user")`

input expression:- For HLL sketches, use the HLL aggregator

Note that Polaris displays sketch columns as Base64-encoded strings.

The following screenshot shows the aggregate table with the HLL sketch `user`

column after ingestion. Using the HLL sketch on the `user`

column, decreased the data size from 11.4 MiB to 6.58 MB.

### Use sketch functions for querying

You can call a sketch aggregation function to query a regular column. Calling a sketch aggregation function on a regular column returns the approximate cardinality of the column.

To query a sketch column, you can call sketch aggregation functions and sketch scalar functions.

The following sample query uses Theta sketches to analyze the sample Wikipedia dataset. This query examines how many users from time period A also appear in time period B.

```
SELECT
countryName,
THETA_SKETCH_ESTIMATE(
THETA_SKETCH_INTERSECT(
DS_THETA(user)
FILTER(WHERE __time >= TIMESTAMP '2016-06-27 00:00:00'
AND __time < TIMESTAMP '2016-06-27 12:00:00'),
DS_THETA(user)
FILTER(WHERE __time >= TIMESTAMP '2016-06-27 12:00:00'
AND __time < TIMESTAMP '2016-06-28 00:00:00')
)
) user_set_retained
FROM wikipedia
GROUP BY countryName
```

## Decide which sketch type to use

HLL and Theta sketches both support approximate distinct counting; however, the HLL sketch produces more accurate results and consumes less storage space. Theta sketches are more flexible but require significantly more memory. When choosing the best approximation algorithm for your use case, consider the following:

- If your case entails count distinct queries and merging sketch objects, use the HLL sketch.
- If you need to perform set intersection and set difference operations, use the Theta sketch.

Note that you cannot merge HLL sketches with Theta sketches.

## Relative error of sketches

Sketches are stochastic processes and the estimates produced are random variables that follow a normal distribution given the central limit theorem. For scenarios in which the error distribution is not normally distributed, see The Error Distribution is Not Gaussian.

Because sketches are probabilistic, they don't have a "maximum error." Sketch accuracy is usually measured in terms of relative error or expected error.

You can calculate error bounds for distinct count estimates on sketches using the following sketch scalar functions:

`HLL_SKETCH_ESTIMATE_WITH_ERROR_BOUNDS(sketchColumn, numStdDev)`

`THETA_SKETCH_ESTIMATE_WITH_ERROR_BOUNDS(sketchColumn, numStdDev)`

Each function takes a parameter for the number of standard deviations, which determines the confidence intervals. Choose the number of standard deviations based on the desired relative error. You can use relative error at the 99.73% confidence level as a reference benchmark; most of the time your error will be less than this value.

### HLL sketch relative error

For HLL sketches created in Polaris using the default size and accuracy parameters, the relationship applies for the number of standard deviations, confidence intervals, and relative error:

Number of standard deviations | Confidence intervals | HLL relative error |
---|---|---|

1 | 68.27% | 1.63% |

2 | 95.45% | 3.25% |

3 | 99.73% | 4.88% |

If you query the sketch column with 3 standard deviations for the 99.73% confidence interval, the average relative error of the distinct count result is within 4.88% of the true value.

### Theta sketch relative error

For Theta sketches created in Polaris using the default size and accuracy parameters, the relationship applies for the number of standard deviations, confidence intervals, and relative error:

Number of standard deviations | Confidence intervals | Theta relative error |
---|---|---|

1 | 68.27% | 0.78% |

2 | 95.45% | 1.56% |

3 | 99.73% | 2.34% |

### Comparing sketch relative error

While the relative errors for Theta sketches appear lower than HLL sketches, these reflect differences in the default parameters that control size and accuracy when creating sketches. The following values describe the default tuning parameters when creating sketches:

- For HLL sketches,
`lgK`

is 12. - For Theta sketches,
`size`

is 16,384.

The analogous log base 2 value for Theta sketches is `log2(16384) = 14`

, which is greater than 12.
For equivalent log base 2 of size values, both sketches have similar accuracy although Theta sketches take thirty times more memory and is slower than HLL sketches.

When deciding what sketch type to use, do not use accuracy as the deciding factor. Prioritize your query use cases; you can increase the sketch size for more accurate estimates.

For more details on relative error of sketches, see the following resources:

## Learn more

See the following topics for more information:

- Create an ingestion job for mapping and transforming ingestion-time input data with input expressions.
- Ingest and query sketches by API for ingesting and querying sketches programmatically.