Skip to main content

Ingest and query sketches by API

Imply Polaris uses HyperLogLog (HLL), Theta, and Quantiles sketches to calculate fast approximations of distinct count queries and estimate the distribution of values.

A sketch provides an approximate summary of the data. Polaris uses this summary to perform operations with greater memory efficiency and faster query processing than it could achieve using the raw data. Polaris leverages Apache DataSketches to implement HLL, Theta, and Quantiles sketches.

When you select a sketch type, consider your target query. For example, Theta sketches offer set operations such as union and intersection; however, Theta sketches are more memory intensive than HLL sketches.

Keep the following in mind when building sketches from raw data in Polaris:

  • Sketches only apply to aggregate tables, not detail tables.
  • A sketch column is a measure column with the data type HLLSketch, thetaSketch, or quantilesDoublesSketch.
  • You specify the sketch creation function as a mapping in the ingestion job spec. For example, DS_HLL(\"input_field\"), DS_THETA(\"input_field\"), or DS_QUANTILES_SKETCH(\"input_field\", 128).

This topic explains how to ingest data as Theta and Quantiles sketches and query the sketched data. The examples use the following data:

date,uid,show,episode,minutes_watched
2022-05-19,alice,Game of Thrones,S1E1,54
2022-05-19,alice,Game of Thrones,S1E2,41
2022-05-19,alice,Game of Thrones,S1E1,2
2022-05-19,bob,Bridgerton,S1E1,61
2022-05-20,alice,Game of Thrones,S1E1,5
2022-05-20,carol,Bridgerton,S1E2,57
2022-05-20,dan,Bridgerton,S1E1,50
2022-05-21,alice,Game of Thrones,S1E1,10
2022-05-21,carol,Bridgerton,S1E1,55
2022-05-21,erin,Game of Thrones,S1E1,49
2022-05-21,alice,Bridgerton,S1E1,56
2022-05-22,bob,Game of Thrones,S1E1,51
2022-05-22,bob,Bridgerton,S1E1,6
2022-05-22,carol,Bridgerton,S1E2,19
2022-05-22,bob,Bridgerton,S1E1,15
2022-05-22,erin,Game of Thrones,S1E1,8
2022-05-22,erin,Bridgerton,S1E2,52
2022-05-23,erin,Game of Thrones,S1E1,13
2022-05-23,alice,Game of Thrones,S1E1,7

Prerequisites

To complete the steps in this topic, you need:

  • Data in the file shows.csv. Upload this file to your Polaris staging area. To upload files using the Files API, visit Upload files.

  • An API key with the ManageTables, ManageIngestionJobs, and AccessQueries permissions.
    In the examples below, Polaris reads the key value from the variable named POLARIS_API_KEY. For information about how to obtain an API key and assign permissions, see API key authentication. For more information on permissions, visit Permissions reference.

You do not have to create a table before starting an ingestion job. When you set createTableIfNotExists to true in the ingestion job spec, Polaris automatically determines the table attributes from the job spec. For details, see Automatically created tables.

Start an ingestion job

Send a POST request to the /v1/projects/PROJECT_ID/jobs endpoint to create an ingestion job. In the request payload, specify the data source and name of the table to store the ingested data.

The ingestion job spec in this example also directs Polaris to automatically create the table if it doesn't exist. Polaris creates the table with the following columns:

  • __time: the primary timestamp of the table
  • theta_uid: a Theta sketch column representing user IDs
  • show: a string column with TV show names
  • episode: a string column with season and episode identifiers
  • quantiles_minutes_watched: a Quantiles sketch column representing minutes watched
  • max_minutes_watched: a column of long types containing the maximum number of minutes watched for each rolled up row

See the Job v1 API documentation for a description of required parameters.

Sample request

The following example request creates an ingestion job for the Sketch demo table using the shows.csv file.

curl --location --request POST "https://ORGANIZATION_NAME.REGION.CLOUD_PROVIDER.api.imply.io/v1/projects/PROJECT_ID/jobs" \
--header "Authorization: Basic $POLARIS_API_KEY" \
--header "Content-Type: application/json" \
--data-raw '{
"type": "batch",
"target": {
"type": "table",
"tableName": "Sketch demo"
},
"createTableIfNotExists": true,
"source": {
"type": "uploaded",
"fileList": [
"shows.csv"
],
"inputSchema": [
{
"name": "date",
"dataType": "string"
},
{
"name": "uid",
"dataType": "string"
},
{
"name": "show",
"dataType": "string"
},
{
"name": "episode",
"dataType": "string"
},
{
"name": "minutes_watched",
"dataType": "long"
}
],
"formatSettings": {
"format": "csv"
}
},
"mappings": [
{
"columnName": "__time",
"expression": "TIME_PARSE(CONCAT(\"date\", '\''T02:47:05.474Z'\''))"
},
{
"columnName": "theta_uid",
"expression": "DS_THETA(\"uid\")",
"isAggregation": true
},
{
"columnName": "show",
"expression": "\"show\""
},
{
"columnName": "episode",
"expression": "\"episode\""
},
{
"columnName": "quantiles_minutes_watched",
"expression": "DS_QUANTILES_SKETCH(\"minutes_watched\", 128)",
"isAggregation": true
},
{
"columnName": "max_minutes_watched",
"expression": "MAX(\"minutes_watched\")",
"isAggregation": true
}
]
}'

The TIME_PARSE expression in the job request combines a time value to the date field to provide a correctly formatted timestamp for Polaris. For more details on parsing and transforming timestamps, see Timestamp expressions.

Sample response

A successful request returns a 201 Created response and the ingestion job specification.

Click to view the response
{
"source": {
"fileList": [
"shows.csv"
],
"formatSettings": {
"columns": [],
"delimiter": null,
"listDelimiter": null,
"skipHeaderRows": 0,
"format": "csv"
},
"inputSchema": [
{
"dataType": "string",
"name": "date"
},
{
"dataType": "string",
"name": "uid"
},
{
"dataType": "string",
"name": "show"
},
{
"dataType": "string",
"name": "episode"
},
{
"dataType": "long",
"name": "minutes_watched"
}
],
"type": "uploaded"
},
"filterExpression": null,
"ingestionMode": "append",
"mappings": [
{
"columnName": "__time",
"expression": "TIME_PARSE(CONCAT(\"date\", 'T02:47:05.474Z'))",
"isAggregation": null
},
{
"columnName": "theta_uid",
"expression": "DS_THETA(\"uid\")",
"isAggregation": true
},
{
"columnName": "show",
"expression": "\"show\"",
"isAggregation": null
},
{
"columnName": "episode",
"expression": "\"episode\"",
"isAggregation": null
},
{
"columnName": "quantiles_minutes_watched",
"expression": "DS_QUANTILES_SKETCH(\"minutes_watched\", 128)",
"isAggregation": true
},
{
"columnName": "max_minutes_watched",
"expression": "MAX(\"minutes_watched\")",
"isAggregation": true
},
{
"columnName": "__count",
"expression": "COUNT(*)",
"isAggregation": true
}
],
"maxParseExceptions": 2147483647,
"query": "INSERT INTO \"Sketch demo\"\nSELECT\n TIME_PARSE(CONCAT(\"date\", 'T02:47:05.474Z')) AS \"__time\",\n \"show\" AS \"show\",\n \"episode\" AS \"episode\",\n DS_THETA(\"uid\") AS \"theta_uid\",\n DS_QUANTILES_SKETCH(\"minutes_watched\", 128) AS \"quantiles_minutes_watched\",\n MAX(\"minutes_watched\") AS \"max_minutes_watched\",\n COUNT(*) AS \"__count\"\nFROM TABLE(\n POLARIS_SOURCE(\n '{\"fileList\":[\"shows.csv\"],\"formatSettings\":{\"columns\":[],\"skipHeaderRows\":0,\"format\":\"csv\"},\"inputSchema\":[{\"dataType\":\"string\",\"name\":\"date\"},{\"dataType\":\"string\",\"name\":\"uid\"},{\"dataType\":\"string\",\"name\":\"show\"},{\"dataType\":\"string\",\"name\":\"episode\"},{\"dataType\":\"long\",\"name\":\"minutes_watched\"}],\"type\":\"uploaded\"}'\n )\n)\n\nGROUP BY 1, 2, 3\nPARTITIONED BY DAY",
"createdBy": {
"username": "api-key-pok_vipgj...bjjvyo",
"userId": "a52cacf6-3ddc-48e5-8675-xxxxxxxxxxxx"
},
"createdTimestamp": "2023-09-12T00:09:24.842822735Z",
"desiredExecutionStatus": "running",
"executionStatus": "pending",
"health": {
"status": "ok"
},
"id": "018a86b8-ae6b-7535-a98d-1960492581dd",
"lastModifiedBy": {
"username": "api-key-pok_vipgj...bjjvyo",
"userId": "a52cacf6-3ddc-48e5-8675-xxxxxxxxxxxx"
},
"lastUpdatedTimestamp": "2023-09-12T00:09:24.842822735Z",
"spec": {
"source": {
"fileList": [
"shows.csv"
],
"formatSettings": {
"columns": [],
"delimiter": null,
"listDelimiter": null,
"skipHeaderRows": 0,
"format": "csv"
},
"inputSchema": [
{
"dataType": "string",
"name": "date"
},
{
"dataType": "string",
"name": "uid"
},
{
"dataType": "string",
"name": "show"
},
{
"dataType": "string",
"name": "episode"
},
{
"dataType": "long",
"name": "minutes_watched"
}
],
"type": "uploaded"
},
"target": {
"tableName": "Sketch demo",
"type": "table",
"intervals": []
},
"createTableIfNotExists": true,
"filterExpression": null,
"ingestionMode": "append",
"mappings": [
{
"columnName": "__time",
"expression": "TIME_PARSE(CONCAT(\"date\", 'T02:47:05.474Z'))",
"isAggregation": null
},
{
"columnName": "theta_uid",
"expression": "DS_THETA(\"uid\")",
"isAggregation": true
},
{
"columnName": "show",
"expression": "\"show\"",
"isAggregation": null
},
{
"columnName": "episode",
"expression": "\"episode\"",
"isAggregation": null
},
{
"columnName": "quantiles_minutes_watched",
"expression": "DS_QUANTILES_SKETCH(\"minutes_watched\", 128)",
"isAggregation": true
},
{
"columnName": "max_minutes_watched",
"expression": "MAX(\"minutes_watched\")",
"isAggregation": true
},
{
"columnName": "__count",
"expression": "COUNT(*)",
"isAggregation": true
}
],
"maxParseExceptions": 2147483647,
"type": "batch",
"desiredExecutionStatus": "running"
},
"target": {
"tableName": "Sketch demo",
"type": "table",
"intervals": []
},
"type": "batch",
"completedTimestamp": null,
"startedTimestamp": null
}

Query the Theta sketch

Use the Query API to query the data in your table.

The following example queries the number of users, represented by the theta_uid Theta sketch column, who have watched either episode "S1E1" or "S1E2" of Bridgerton:

SELECT THETA_SKETCH_ESTIMATE(
THETA_SKETCH_UNION(
DS_THETA(theta_uid) FILTER(WHERE "show" = 'Bridgerton' AND "episode" = 'S1E1'),
DS_THETA(theta_uid) FILTER(WHERE "show" = 'Bridgerton' AND "episode" = 'S1E2')
)
) AS users
FROM "Sketch demo"

Sample request

The following example queries the Sketch demo table:

curl --location --request POST "https://ORGANIZATION_NAME.REGION.CLOUD_PROVIDER.api.imply.io/v1/projects/PROJECT_ID/query/sql" \
--header "Content-Type: application/json" \
--header "Authorization: Basic $POLARIS_API_KEY" \
--data-raw '{
"query": "SELECT THETA_SKETCH_ESTIMATE(THETA_SKETCH_UNION(DS_THETA(theta_uid) FILTER(WHERE \"show\" = '\''Bridgerton'\'' AND \"episode\" = '\''S1E1'\''), DS_THETA(theta_uid) FILTER(WHERE \"show\" = '\''Bridgerton'\'' AND \"episode\" = '\''S1E2'\''))) AS users FROM \"Sketch demo\""
}'

Sample response

A successful request returns a 200 OK response code and the query result, for example:

[
{
"users": 5.0
}
]

Query the Quantiles sketch

The following example computes the median (k=0.5) and the third quartile (k=0.75) of minutes watched:

SELECT
APPROX_QUANTILE_DS(quantiles_minutes_watched, 0.5) AS median_m,
APPROX_QUANTILE_DS(quantiles_minutes_watched, 0.75) AS quartile_m
FROM "Sketch demo"

Sample request

The following example queries the Sketch demo table:

curl --location --request POST "https://ORGANIZATION_NAME.REGION.CLOUD_PROVIDER.api.imply.io/v1/projects/PROJECT_ID/query/sql" \
--header "Content-Type: application/json" \
--header "Authorization: Basic $POLARIS_API_KEY" \
--data-raw '{"query": "SELECT APPROX_QUANTILE_DS(quantiles_minutes_watched, 0.5) AS median_m, APPROX_QUANTILE_DS(quantiles_minutes_watched, 0.75) AS quartile_m FROM \"Sketch demo\""
}'

Sample response

A successful request returns a 200 OK response code and the query result:

[
{
"median_m": 41.0,
"quartile_m": 54.0
}
]

In this example, the median amount of time spent watching the shows was 41 minutes, and 75% of the sessions watched were below 54 minutes.

Learn more

See the following topics for more information: