Ingest and query sketches by API
Imply Polaris uses HyperLogLog (HLL), Theta, and Quantiles sketches to calculate fast approximations of distinct count queries and estimate the distribution of values.
A sketch provides an approximate summary of the data. Polaris uses this summary to perform operations with greater memory efficiency and faster query processing than it could achieve using the raw data. Polaris leverages Apache DataSketches to implement HLL, Theta, and Quantiles sketches.
When you select a sketch type, consider your target query. For example, Theta sketches offer set operations such as union and intersection; however, Theta sketches are more memory intensive than HLL sketches.
Keep the following in mind when building sketches from raw data in Polaris:
- Sketches only apply to aggregate tables, not detail tables.
- A sketch column is a measure column with the data type
HLLSketch
,thetaSketch
, orquantilesDoublesSketch
. - You specify the sketch creation function as a mapping in the ingestion job spec.
For example,
DS_HLL(\"input_field\")
,DS_THETA(\"input_field\")
, orDS_QUANTILES_SKETCH(\"input_field\", 128)
.
This topic explains how to ingest data as Theta and Quantiles sketches and query the sketched data. The examples use the following data:
date,uid,show,episode,minutes_watched
2022-05-19,alice,Game of Thrones,S1E1,54
2022-05-19,alice,Game of Thrones,S1E2,41
2022-05-19,alice,Game of Thrones,S1E1,2
2022-05-19,bob,Bridgerton,S1E1,61
2022-05-20,alice,Game of Thrones,S1E1,5
2022-05-20,carol,Bridgerton,S1E2,57
2022-05-20,dan,Bridgerton,S1E1,50
2022-05-21,alice,Game of Thrones,S1E1,10
2022-05-21,carol,Bridgerton,S1E1,55
2022-05-21,erin,Game of Thrones,S1E1,49
2022-05-21,alice,Bridgerton,S1E1,56
2022-05-22,bob,Game of Thrones,S1E1,51
2022-05-22,bob,Bridgerton,S1E1,6
2022-05-22,carol,Bridgerton,S1E2,19
2022-05-22,bob,Bridgerton,S1E1,15
2022-05-22,erin,Game of Thrones,S1E1,8
2022-05-22,erin,Bridgerton,S1E2,52
2022-05-23,erin,Game of Thrones,S1E1,13
2022-05-23,alice,Game of Thrones,S1E1,7
Prerequisites
To complete the steps in this topic, you need:
Data in the file
shows.csv
. Upload this file to your Polaris staging area. To upload files using the Files API, visit Upload files.An API key with the
ManageTables
,ManageIngestionJobs
, andAccessQueries
permissions.
In the examples below, Polaris reads the key value from the variable namedPOLARIS_API_KEY
. For information about how to obtain an API key and assign permissions, see API key authentication. For more information on permissions, visit Permissions reference.
You do not have to create a table before starting an ingestion job. When you set createTableIfNotExists
to true
in the ingestion job spec, Polaris automatically determines the table attributes from the job spec.
For details, see Automatically created tables.
Start an ingestion job
Send a POST
request to the /v1/projects/PROJECT_ID/jobs
endpoint to create an ingestion job.
In the request payload, specify the data source and name of the table to store the ingested data.
The ingestion job spec in this example also directs Polaris to automatically create the table if it doesn't exist. Polaris creates the table with the following columns:
__time
: the primary timestamp of the tabletheta_uid
: a Theta sketch column representing user IDsshow
: a string column with TV show namesepisode
: a string column with season and episode identifiersquantiles_minutes_watched
: a Quantiles sketch column representing minutes watchedmax_minutes_watched
: a column of long types containing the maximum number of minutes watched for each rolled up row
See the Job v1 API documentation for a description of required parameters.
Sample request
The following example request creates an ingestion job for the Sketch demo
table using the shows.csv
file.
- cURL
- Python
curl --location --request POST "https://ORGANIZATION_NAME.REGION.CLOUD_PROVIDER.api.imply.io/v1/projects/PROJECT_ID/jobs" \
--header "Authorization: Basic $POLARIS_API_KEY" \
--header "Content-Type: application/json" \
--data-raw '{
"type": "batch",
"target": {
"type": "table",
"tableName": "Sketch demo"
},
"createTableIfNotExists": true,
"source": {
"type": "uploaded",
"fileList": [
"shows.csv"
],
"inputSchema": [
{
"name": "date",
"dataType": "string"
},
{
"name": "uid",
"dataType": "string"
},
{
"name": "show",
"dataType": "string"
},
{
"name": "episode",
"dataType": "string"
},
{
"name": "minutes_watched",
"dataType": "long"
}
],
"formatSettings": {
"format": "csv"
}
},
"mappings": [
{
"columnName": "__time",
"expression": "TIME_PARSE(CONCAT(\"date\", '\''T02:47:05.474Z'\''))"
},
{
"columnName": "theta_uid",
"expression": "DS_THETA(\"uid\")",
"isAggregation": true
},
{
"columnName": "show",
"expression": "\"show\""
},
{
"columnName": "episode",
"expression": "\"episode\""
},
{
"columnName": "quantiles_minutes_watched",
"expression": "DS_QUANTILES_SKETCH(\"minutes_watched\", 128)",
"isAggregation": true
},
{
"columnName": "max_minutes_watched",
"expression": "MAX(\"minutes_watched\")",
"isAggregation": true
}
]
}'
import os
import requests
import json
url = "https://ORGANIZATION_NAME.REGION.CLOUD_PROVIDER.api.imply.io/v1/projects/PROJECT_ID/jobs"
apikey = os.getenv("POLARIS_API_KEY")
payload = json.dumps({
"type": "batch",
"target": {
"type": "table",
"tableName": "Sketch demo"
},
"createTableIfNotExists": True,
"source": {
"type": "uploaded",
"fileList": [
"shows.csv"
],
"inputSchema": [
{
"name": "date",
"dataType": "string"
},
{
"name": "uid",
"dataType": "string"
},
{
"name": "show",
"dataType": "string"
},
{
"name": "episode",
"dataType": "string"
},
{
"name": "minutes_watched",
"dataType": "long"
}
],
"formatSettings": {
"format": "csv"
}
},
"mappings": [
{
"columnName": "__time",
"expression": "TIME_PARSE(CONCAT(\"date\", 'T02:47:05.474Z'))"
},
{
"columnName": "theta_uid",
"expression": "DS_THETA(\"uid\")",
"isAggregation": true
},
{
"columnName": "show",
"expression": "\"show\""
},
{
"columnName": "episode",
"expression": "\"episode\""
},
{
"columnName": "quantiles_minutes_watched",
"expression": "DS_QUANTILES_SKETCH(\"minutes_watched\", 128)",
"isAggregation": true
},
{
"columnName": "max_minutes_watched",
"expression": "MAX(\"minutes_watched\")",
"isAggregation": true
}
]
})
headers = {
'Authorization': f'Basic {apikey}',
'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
The TIME_PARSE
expression in the job request combines a time value to the date
field to provide a correctly formatted timestamp for Polaris.
For more details on parsing and transforming timestamps, see Timestamp expressions.
Sample response
A successful request returns a 201 Created
response and the ingestion job specification.
Click to view the response
{
"source": {
"fileList": [
"shows.csv"
],
"formatSettings": {
"columns": [],
"delimiter": null,
"listDelimiter": null,
"skipHeaderRows": 0,
"format": "csv"
},
"inputSchema": [
{
"dataType": "string",
"name": "date"
},
{
"dataType": "string",
"name": "uid"
},
{
"dataType": "string",
"name": "show"
},
{
"dataType": "string",
"name": "episode"
},
{
"dataType": "long",
"name": "minutes_watched"
}
],
"type": "uploaded"
},
"filterExpression": null,
"ingestionMode": "append",
"mappings": [
{
"columnName": "__time",
"expression": "TIME_PARSE(CONCAT(\"date\", 'T02:47:05.474Z'))",
"isAggregation": null
},
{
"columnName": "theta_uid",
"expression": "DS_THETA(\"uid\")",
"isAggregation": true
},
{
"columnName": "show",
"expression": "\"show\"",
"isAggregation": null
},
{
"columnName": "episode",
"expression": "\"episode\"",
"isAggregation": null
},
{
"columnName": "quantiles_minutes_watched",
"expression": "DS_QUANTILES_SKETCH(\"minutes_watched\", 128)",
"isAggregation": true
},
{
"columnName": "max_minutes_watched",
"expression": "MAX(\"minutes_watched\")",
"isAggregation": true
},
{
"columnName": "__count",
"expression": "COUNT(*)",
"isAggregation": true
}
],
"maxParseExceptions": 2147483647,
"query": "INSERT INTO \"Sketch demo\"\nSELECT\n TIME_PARSE(CONCAT(\"date\", 'T02:47:05.474Z')) AS \"__time\",\n \"show\" AS \"show\",\n \"episode\" AS \"episode\",\n DS_THETA(\"uid\") AS \"theta_uid\",\n DS_QUANTILES_SKETCH(\"minutes_watched\", 128) AS \"quantiles_minutes_watched\",\n MAX(\"minutes_watched\") AS \"max_minutes_watched\",\n COUNT(*) AS \"__count\"\nFROM TABLE(\n POLARIS_SOURCE(\n '{\"fileList\":[\"shows.csv\"],\"formatSettings\":{\"columns\":[],\"skipHeaderRows\":0,\"format\":\"csv\"},\"inputSchema\":[{\"dataType\":\"string\",\"name\":\"date\"},{\"dataType\":\"string\",\"name\":\"uid\"},{\"dataType\":\"string\",\"name\":\"show\"},{\"dataType\":\"string\",\"name\":\"episode\"},{\"dataType\":\"long\",\"name\":\"minutes_watched\"}],\"type\":\"uploaded\"}'\n )\n)\n\nGROUP BY 1, 2, 3\nPARTITIONED BY DAY",
"createdBy": {
"username": "api-key-pok_vipgj...bjjvyo",
"userId": "a52cacf6-3ddc-48e5-8675-xxxxxxxxxxxx"
},
"createdTimestamp": "2023-09-12T00:09:24.842822735Z",
"desiredExecutionStatus": "running",
"executionStatus": "pending",
"health": {
"status": "ok"
},
"id": "018a86b8-ae6b-7535-a98d-1960492581dd",
"lastModifiedBy": {
"username": "api-key-pok_vipgj...bjjvyo",
"userId": "a52cacf6-3ddc-48e5-8675-xxxxxxxxxxxx"
},
"lastUpdatedTimestamp": "2023-09-12T00:09:24.842822735Z",
"spec": {
"source": {
"fileList": [
"shows.csv"
],
"formatSettings": {
"columns": [],
"delimiter": null,
"listDelimiter": null,
"skipHeaderRows": 0,
"format": "csv"
},
"inputSchema": [
{
"dataType": "string",
"name": "date"
},
{
"dataType": "string",
"name": "uid"
},
{
"dataType": "string",
"name": "show"
},
{
"dataType": "string",
"name": "episode"
},
{
"dataType": "long",
"name": "minutes_watched"
}
],
"type": "uploaded"
},
"target": {
"tableName": "Sketch demo",
"type": "table",
"intervals": []
},
"createTableIfNotExists": true,
"filterExpression": null,
"ingestionMode": "append",
"mappings": [
{
"columnName": "__time",
"expression": "TIME_PARSE(CONCAT(\"date\", 'T02:47:05.474Z'))",
"isAggregation": null
},
{
"columnName": "theta_uid",
"expression": "DS_THETA(\"uid\")",
"isAggregation": true
},
{
"columnName": "show",
"expression": "\"show\"",
"isAggregation": null
},
{
"columnName": "episode",
"expression": "\"episode\"",
"isAggregation": null
},
{
"columnName": "quantiles_minutes_watched",
"expression": "DS_QUANTILES_SKETCH(\"minutes_watched\", 128)",
"isAggregation": true
},
{
"columnName": "max_minutes_watched",
"expression": "MAX(\"minutes_watched\")",
"isAggregation": true
},
{
"columnName": "__count",
"expression": "COUNT(*)",
"isAggregation": true
}
],
"maxParseExceptions": 2147483647,
"type": "batch",
"desiredExecutionStatus": "running"
},
"target": {
"tableName": "Sketch demo",
"type": "table",
"intervals": []
},
"type": "batch",
"completedTimestamp": null,
"startedTimestamp": null
}
Query the Theta sketch
Use the Query API to query the data in your table.
The following example queries the number of users,
represented by the theta_uid
Theta sketch column,
who have watched either episode "S1E1" or "S1E2" of Bridgerton:
SELECT THETA_SKETCH_ESTIMATE(
THETA_SKETCH_UNION(
DS_THETA(theta_uid) FILTER(WHERE "show" = 'Bridgerton' AND "episode" = 'S1E1'),
DS_THETA(theta_uid) FILTER(WHERE "show" = 'Bridgerton' AND "episode" = 'S1E2')
)
) AS users
FROM "Sketch demo"
Sample request
The following example queries the Sketch demo
table:
- cURL
- Python
curl --location --request POST "https://ORGANIZATION_NAME.REGION.CLOUD_PROVIDER.api.imply.io/v1/projects/PROJECT_ID/query/sql" \
--header "Content-Type: application/json" \
--header "Authorization: Basic $POLARIS_API_KEY" \
--data-raw '{
"query": "SELECT THETA_SKETCH_ESTIMATE(THETA_SKETCH_UNION(DS_THETA(theta_uid) FILTER(WHERE \"show\" = '\''Bridgerton'\'' AND \"episode\" = '\''S1E1'\''), DS_THETA(theta_uid) FILTER(WHERE \"show\" = '\''Bridgerton'\'' AND \"episode\" = '\''S1E2'\''))) AS users FROM \"Sketch demo\""
}'
import os
import requests
import json
url = "https://ORGANIZATION_NAME.REGION.CLOUD_PROVIDER.api.imply.io/v1/projects/PROJECT_ID/query/sql"
apikey = os.getenv("POLARIS_API_KEY")
payload = json.dumps({
"query": "SELECT THETA_SKETCH_ESTIMATE(THETA_SKETCH_UNION(DS_THETA(theta_uid) FILTER(WHERE \"show\" = 'Bridgerton' AND \"episode\" = 'S1E1'), DS_THETA(theta_uid) FILTER(WHERE \"show\" = 'Bridgerton' AND \"episode\" = 'S1E2'))) AS users FROM \"Sketch demo\""
})
headers = {
'Authorization': f'Basic {apikey}',
'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
Sample response
A successful request returns a 200 OK
response code and the query result, for example:
[
{
"users": 5.0
}
]
Query the Quantiles sketch
The following example computes the median (k=0.5) and the third quartile (k=0.75) of minutes watched:
SELECT
APPROX_QUANTILE_DS(quantiles_minutes_watched, 0.5) AS median_m,
APPROX_QUANTILE_DS(quantiles_minutes_watched, 0.75) AS quartile_m
FROM "Sketch demo"
Sample request
The following example queries the Sketch demo
table:
- cURL
- Python
curl --location --request POST "https://ORGANIZATION_NAME.REGION.CLOUD_PROVIDER.api.imply.io/v1/projects/PROJECT_ID/query/sql" \
--header "Content-Type: application/json" \
--header "Authorization: Basic $POLARIS_API_KEY" \
--data-raw '{"query": "SELECT APPROX_QUANTILE_DS(quantiles_minutes_watched, 0.5) AS median_m, APPROX_QUANTILE_DS(quantiles_minutes_watched, 0.75) AS quartile_m FROM \"Sketch demo\""
}'
import os
import requests
import json
url = "https://ORGANIZATION_NAME.REGION.CLOUD_PROVIDER.api.imply.io/v1/projects/PROJECT_ID/query/sql"
apikey = os.getenv("POLARIS_API_KEY")
payload = json.dumps({
"query": "SELECT APPROX_QUANTILE_DS(quantiles_minutes_watched, 0.5) AS median_m,APPROX_QUANTILE_DS(quantiles_minutes_watched, 0.75) AS quartile_m FROM \"Sketch demo\""
})
headers = {
'Authorization': f'Basic {apikey}',
'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
Sample response
A successful request returns a 200 OK
response code and the query result:
[
{
"median_m": 41.0,
"quartile_m": 54.0
}
]
In this example, the median amount of time spent watching the shows was 41 minutes, and 75% of the sessions watched were below 54 minutes.
Learn more
See the following topics for more information:
- Compute results with cardinality sketches for information on sketches in Polaris.
- Tables v1 API for information on creating and managing tables.
- Files v1 API for reference on uploading files.
- Jobs v1 API for information on creating and managing ingestion jobs.