# Compute results with approximation algorithms

Analyzing large datasets to produce exact counts consumes considerable compute resources. Operations that don't require exact counts and can tolerate a certain error rate may benefit from a class of specialized approximation algorithms, called sketches.

Sketches are approximate, mergeable, and limited-size representations of potentially large amounts of input data that trade off accuracy for reduced storage and improved performance. Sketches use various randomization techniques to produce approximate aggregates of data and deliver results with mathematically proven error bounds. Using sketches at ingestion and query time can significantly reduce processing costs and improve query performance.

This topic provides an overview of approximate algorithms supported by Imply Polaris. All examples are based on the sample Wikipedia dataset.

## Types of sketches

Polaris relies on the Apache DataSketches software library of stochastic streaming algorithms for implementation of sketch modules. The DataSketches library includes various types of sketches, each one designed to solve a different sort of problem. Polaris supports HyperLogLog (HLL) and Theta sketch modules.

### HyperLogLog

The HyperLogLog (HLL) sketch aggregator is a very compact sketch algorithm for approximate distinct counting. During ingestion, the HLL aggregator creates HLL sketch objects. At query time, Polaris reads and merges the HLL sketch objects together returning approximate count of distinct values. You can also use the HLL aggregator to calculate a union of HLL sketches. For more information on HLL sketches, see DataSketches HLL Sketch module and Druid SQL HLL sketch functions.

### Theta

The Theta sketch is a sketch algorithm for approximate distinct counting with set operations. Theta sketches let you count the overlap between sets, so that you can compute the union, intersection, or set difference between sketch objects. During ingestion, the Theta aggregator creates Theta sketch objects. At query time, Polaris reads and merges the Theta sketch objects together returning approximate count of unique entries. For more information on Theta sketches, see DataSketches Theta Sketch module and Druid SQL Theta sketch functions.

### Choosing a sketch type

HLL and Theta sketches both support approximate distinct counting; however, the HLL sketch produces more accurate results and consumes less storage space. Theta sketches are more flexible but require significantly more memory.

When choosing an approximation algorithm for your use case, consider the following:

- If your use case entails distinct counting and merging sketch objects, use the HLL sketch.
- If you need to evaluate union, intersection, or difference set operations, use the Theta sketch.

You cannot merge HLL sketches with Theta sketches.

## Ingest and query sketches

Polaris supports sketches at ingestion and query time. Using sketches at ingestion summarizes input data, which improves rollup, reduces memory footprint, and increases query performance. Using sketch functions at query time improves query responsiveness and speed. See sketch aggregation and scalar functions for reference on the functions to ingest and query sketches.

### Apply sketches at ingestion

Polaris does not support ingesting pre-generated sketches.

The Polaris table schema only accepts sketch columns as measures in an aggregate table.

To apply sketches to table columns during ingestion in the Polaris UI, do the following:

- Create an aggregate table.
- On the
**Insert data**page, where you map input fields to table columns, click the plus sign to add a new measure. - In the
**Measure**dialog, enter the column information:**Name**: Enter the column name.**Data type**: Choose between Theta sketch or HLL sketch data type.**Input expression**: Enter the appropriate sketch aggregation function. You can tune the size and accuracy of a sketch through the optional parameters passed into the sketch aggregator.- For HLL sketches, use the HLL aggregator
`DS_HLL(expr, [lgK, tgtHllType])`

—for example,`DS_HLL("inputField")`

. For more information, see DataSketches HLL Sketch module. - For Theta sketches, use the Theta aggregator
`DS_THETA(expr, [size])`

—for example,`DS_THETA("inputField")`

. For more information, see DataSketches Theta Sketch module.

- For HLL sketches, use the HLL aggregator

The following screenshot shows the `Sketches-demo`

aggregate table with the HLL sketch `HLL_user`

column after ingestion.
Note that Polaris displays sketch columns as Base64-encoded strings.

### Use sketch functions for querying

To use sketches during query in the Polaris UI, do the following:

On the

**Table detail**page, click**Query**in the top-right corner.Select

**SQL console**.The following screenshot shows the

**Query**dropdown menu:Enter your query in the SQL query tab.

Click

**Run**.

#### Sample queries

These examples are based on the sample Wikipedia dataset.

The following sample query uses the `APPROX_COUNT_DISTINCT_DS_HLL`

aggregation function to count distinct values of a regular column named `user`

:

```
SELECT APPROX_COUNT_DISTINCT_DS_HLL(user) AS unique_users
FROM "Sketches-demo"
WHERE isRobot = 'false'
```

The following sample query uses the `HLL_SKETCH_ESTIMATE_WITH_ERROR_BOUNDS`

function to return the distinct count estimate with error bounds:

```
SELECT HLL_SKETCH_ESTIMATE_WITH_ERROR_BOUNDS(
DS_HLL(user)
) AS error_bounds
FROM "Sketches-demo"
WHERE isRobot = 'false'
```

The following sample query uses the `HLL_SKETCH_UNION`

function to calculate the number of users who visited pages in the main namespace of the Spanish and Catalan editions of Wikipedia:

```
SELECT HLL_SKETCH_ESTIMATE(
HLL_SKETCH_UNION(
DS_HLL(user) FILTER(WHERE "channel" = '#en.wikipedia' AND "namespace" = 'Main'),
DS_HLL(user) FILTER(WHERE "channel" = '#ca.wikipedia' AND "namespace" = 'Main')
) AS users
FROM "Sketches-demo"
```

The following sample query examines how many users from one time period, `2016-06-27 00:00:00 to 2016-06-27 12:00:00`

, also appear in another time period, `2016-06-27 12:00:00 to 2016-06-28 00:00:00`

:

```
SELECT
THETA_SKETCH_ESTIMATE(
THETA_SKETCH_INTERSECT(
DS_THETA(user)
FILTER(WHERE __time >= TIMESTAMP '2016-06-27 00:00:00'
AND __time < TIMESTAMP '2016-06-27 12:00:00'),
DS_THETA(user)
FILTER(WHERE __time >= TIMESTAMP '2016-06-27 12:00:00'
AND __time < TIMESTAMP '2016-06-28 00:00:00')
)
)
FROM "Sketches-demo"
```

## Relative error of sketches

Sketches are stochastic processes and the estimates produced are random variables that follow a normal distribution given the central limit theorem. For scenarios in which the error distribution is not normally distributed, see The Error Distribution is Not Gaussian.

Because sketches are probabilistic, they don't have a "maximum error." Sketch accuracy is usually measured in terms of relative error or expected error.

You can calculate error bounds for distinct count estimates on sketches using the following sketch scalar functions:

`HLL_SKETCH_ESTIMATE_WITH_ERROR_BOUNDS(sketchColumn, numStdDev)`

`THETA_SKETCH_ESTIMATE_WITH_ERROR_BOUNDS(sketchColumn, numStdDev)`

Each function takes a parameter for the number of standard deviations, which determines the confidence intervals. Choose the number of standard deviations based on the desired relative error. You can use relative error at the 99.73% confidence level as a reference benchmark; most of the time your error will be less than this value.

### HLL sketch relative error

For HLL sketches created in Polaris using the default size and accuracy parameters, the relationship applies for the number of standard deviations, confidence intervals, and relative error:

Number of standard deviations | Confidence intervals | HLL relative error |
---|---|---|

1 | 68.27% | 1.63% |

2 | 95.45% | 3.25% |

3 | 99.73% | 4.88% |

If you query the sketch column with 3 standard deviations for the 99.73% confidence interval, the average relative error of the distinct count result is within 4.88% of the true value.

### Theta sketch relative error

For Theta sketches created in Polaris using the default size and accuracy parameters, the relationship applies for the number of standard deviations, confidence intervals, and relative error:

Number of standard deviations | Confidence intervals | Theta relative error |
---|---|---|

1 | 68.27% | 0.78% |

2 | 95.45% | 1.56% |

3 | 99.73% | 2.34% |

### Comparing sketch relative error

While the relative errors for Theta sketches appear lower than HLL sketches, these reflect differences in the default parameters that control size and accuracy when creating sketches. The following values describe the default tuning parameters when creating sketches:

- For HLL sketches,
`lgK`

is 12. - For Theta sketches,
`size`

is 16,384.

The analogous log base 2 value for Theta sketches is `log2(16384) = 14`

, which is greater than 12.
For equivalent log base 2 of size values, both sketches have similar accuracy although Theta sketches take thirty times more memory and is slower than HLL sketches.

When deciding what sketch type to use, do not use accuracy as the deciding factor. Prioritize your query use cases; you can increase the sketch size for more accurate estimates.

For more details on relative error of sketches, see the following resources:

## Learn more

See the following topics for more information:

- Create an ingestion job for mapping and transforming ingestion-time input data with input expressions.
- Ingest and query sketches by API for ingesting and querying sketches programmatically.