Input expressions reference
Ingestion jobs in Imply Polaris map input fields to table columns. An ingestion job mapping takes an input expression as well as the name of an output column. An input expression describes how to compute the value of a table's column from the input fields. An input expression can map an input field as is or apply one or more Apache Druid® SQL functions to transform the data.
Some use cases require complex data types, such as cardinality sketches or IP addresses. These data types may require specific transforms during ingestion to properly parse and store the data.
This topic lists the Druid SQL functions you can use in input expressions. It also references the specific input expressions required to ingest complex data types.
For information and examples on input expressions in ingestion jobs, see Map and transform data with input expressions.
Input expressions for dimensions
The following resources list the operations supported in input expressions for dimensions:
- Druid SQL operators
- Druid SQL scalar functions
- Druid SQL JSON functions for nested data
- Druid SQL multi-value string functions for multi-value string dimensions
Input expressions for measures
If the table column is a measure, the input expression must use one of the following aggregation functions:
COUNT(*)
: Counts the number of rows ofexpr
.SUM(expr)
: Sums numbers.MIN(expr)
: Takes the minimum of numbers.MAX(expr)
: Takes the maximum of numbers.- Functions to measure dispersion:
VARIANCE(expr)
orVAR_SAMP(expr)
: Computes the sample variance of numbers.VAR_POP(expr)
: Computes the population variance of numbers.
- Functions to estimate distinct counts:
COUNT(DISTINCT expr)
: Counts the number of distinct values ofexpr
. In Polaris, this is the same asAPPROX_COUNT_DISTINCT
andAPPROX_COUNT_DISTINCT_DS_HLL
. Polaris uses the HLL approximation algorithm to estimate distinct values. To compute distinct count estimates from sketches, see Query sketched data.APPROX_COUNT_DISTINCT_DS_THETA
: Counts the number of distinct count values using Theta sketches. Theta sketches require more memory than HLL sketches.
- Sketch creation functions. For more information, see Cardinality sketches and Quantiles sketches.
DS_HLL(expr, [lgK, tgtHllType])
: Creates an HLL sketch on the values of the expression.DS_THETA(expr, [size])
: Creates a Theta sketch on the values of the expression.DS_QUANTILES_SKETCH(expr, [k])
: Creates a Quantiles sketch on the values of the expression.
- Functions to aggregate based on time. For more information, see Aggregate value on ingestion based on earliest or latest time.
EARLIEST_BY(expr, timestampExpr, maxBytesPerString)
: For the earliest time value intimestampExpr
, returns the associated value of a string or numeric expressionexpr
.LATEST_BY(expr, timestampExpr, maxBytesPerString)
: For the latest time value intimestampExpr
, returns the associated value of a string or numeric expressionexpr
.
For more information on these functions, see Druid SQL aggregation functions.
Handling incompatible data types
When there are input values incompatible with a column's data type, Polaris does the following:
- Coerce the type: An incompatible data that can be cast. For example, a long-typed value in a string-typed column.
- Ingest null value: An incompatible data type that can't be cast. For example, a string-typed value in a long-typed column.
- Convert value to null and apply the function: An incompatible input type for a function. For example, dividing a string by the number 2.
The job logs for streaming jobs display the error messages for any parse exceptions Polaris encountered.
For more information on how data types are cast, see SQL data types.
Ingesting specific data types
By default, Polaris determines the appropriate data type for your table using schema detection for tables. Polaris determines data types for table columns based on the input data as well as any transforms you apply. Specific transforms are required to compute aggregations, such as aggregating data into sketches, or to properly parse an input field, such as IP addresses.
If you want to ensure that a column has a specific type, you can explicitly declare it with the intended data type.
This section lists the simple data types supported in Polaris as well as more complex types that require specific transforms.
The data types are supported for both detail and aggregate tables unless otherwise documented.
For aggregate tables, the SQL examples aggregate the timestamp values using the TIME_FLOOR
function to apply rollup at ingestion.
Simple data types
bigint
Aliased tolong
in the Polaris API.double
: A 64-bit floating point decimal number.float
: A 32-bit floating point decimal number.long
: A 64-bit integer.
When ingesting into a flexible table without a declared schema, Polaris ingests Boolean values aslong
types.string
: UTF-8 encoded text, including multi-value dimensions.varchar
: Aliased tostring
in the Polaris API.
Earliest or latest
Supported for schema measures in aggregate tables only.
Ingest the earliest or latest value from a set of aggregated rows. You can apply the EARLIEST or LATEST functions on the aggregated values at query time. For more information, see Earliest or latest aggregation.
Earliest
Ingest the earliest value from aggregated rows.
- Table data type:
longStringPair
- Input expression:
EARLIEST_BY(expr, timestampExpr)
- Input data type:
expr
is a string or numeric input field
INSERT INTO "example_table"
SELECT
TIME_FLOOR(TIME_PARSE("timestamp"), 'PT1H') AS "__time",
EARLIEST_BY("score", TIME_PARSE("event_time"))
FROM ...
Latest
Ingest the latest value from aggregated rows.
- Table data type:
longStringPair
- Input expression:
LATEST_BY(expr, timestampExpr)
- Input data type:
expr
is a string or numeric input field
INSERT INTO "example_table"
SELECT
TIME_FLOOR(TIME_PARSE("timestamp"), 'PT1H') AS "__time",
LATEST_BY("score", TIME_PARSE("event_time"))
FROM ...
HLL sketch
Supported for schema measures in aggregate tables only.
Ingest data as HLL sketch objects for approximate distinct counting.
You can provide an optional property in the input expression function to tune the size and accuracy of the created sketches.
For more information, see Compute results with cardinality sketches.
- Table data type:
HLLSketch
- Input expression:
DS_HLL(expr)
- Input data type:
expr
can be any type
INSERT INTO "example_table"
SELECT
TIME_FLOOR(TIME_PARSE("timestamp"), 'PT1H') AS "__time",
DS_HLL("example_column")
FROM ...
IP types
Ingest IP addresses or IP prefixes from a string input field.
For examples of queries on IP type columns, see Enhanced IP support.
IP address
Ingest IP addresses stored in 128-bit binary format.
- Table data type:
ipAddress
- Input expression:
IP_PARSE(expr)
- Input data type:
expr
is a string input field
IP_PARSE
raises an error if an input value is not a valid IP address.
If you prefer to bypass the parsing errors,
use the input expression IP_TRY_PARSE(expr)
, which returns a null value for invalid IP addresses.
INSERT INTO "example_table"
SELECT
TIME_PARSE("timestamp") AS "__time",
IP_PARSE("example_column")
FROM ...
IP prefix
Ingest IP prefixes stored in 136-bit binary format, comprised of a 128-bit address and an 8-bit prefix value.
- Table data type:
ipPrefix
- Input expression:
IP_PREFIX_PARSE(expr)
- Input data type:
expr
is a string input field
IP_PREFIX_PARSE
raises an error if an input value is not a valid IP prefix.
If you prefer to bypass the parsing errors,
use the input expression IP_PREFIX_TRY_PARSE(expr)
, which returns a null value for invalid IP prefixes.
INSERT INTO "example_table"
SELECT
TIME_PARSE("timestamp") AS "__time",
IP_PREFIX_PARSE("example_column")
FROM ...
JSON
Ingest nested JSON data. For more information, see Ingest nested data.
- Table data type:
json
- Input expression: None. Polaris detects the type from the data.
- Input data type: The input field is of type
json
orcomplex<json>
.
INSERT INTO "example_table"
SELECT
TIME_PARSE("timestamp") AS "__time",
"example_column"
FROM ...
Multi-value dimensions
Ingest multi-value dimensions (MVDs). MVDs are string types (SQL VARCHAR).
- Table data type:
string
- Input expression:
- For a detail table, you don't need an input expression.
- For an aggregate table, use
MV_TO_ARRAY(inputField)
. The function coerces the MVD into an array for grouping (SQL GROUP BY), but the data is still stored as a string MVD.
- Input data type: The input field is of type string.
The following example shows how to ingest MVDs into a detail table:
INSERT INTO "simple_detail_mvdmode"
SELECT
TIME_PARSE("timestamp") AS "__time",
"tags" AS "tags"
FROM TABLE(POLARIS_SOURCE('{"fileList":["array-mvd-agg.json"],"inputSchema":[{"dataType":"string","name":"timestamp"},{"dataType":"string","name":"tags"}],"formatSettings":{"format":"nd-json"},"type":"uploaded"}'))
PARTITIONED BY DAY
The following example shows how to ingest MVDs into an aggregate table:
INSERT INTO "simple_agg_mvdmode"
SELECT
TIME_FLOOR(TIME_PARSE("timestamp"), 'P1D', NULL, 'UTC') AS "__time",
MV_TO_ARRAY("tags") AS "tags",
COUNT(*) AS "__count"
FROM TABLE(POLARIS_SOURCE('{"fileList":["array-mvd-agg.json"],"inputSchema":[{"dataType":"string","name":"timestamp"},{"dataType":"string","name":"tags"}],"formatSettings":{"format":"nd-json"},"type":"uploaded"}'))
GROUP BY 1, 2
PARTITIONED BY DAY
Quantiles sketch
Supported for schema measures in aggregate tables only.
Ingest numeric data as Quantile sketches.
You can provide an optional property in the input expression function to tune the size and accuracy of the created sketches.
For more information, see Estimate distributions with quantile sketches.
- Table data type:
quantilesDoublesSketch
- Input expression:
DS_QUANTILES_SKETCH(expr)
- Input data type:
expr
is a numeric type
INSERT INTO "example_table"
SELECT
TIME_FLOOR(TIME_PARSE("timestamp"), 'PT1H') AS "__time",
DS_QUANTILES_SKETCH("example_column")
FROM ...
Theta sketch
Supported for schema measures in aggregate tables only.
Ingest data as Theta sketch objects for approximate distinct counting with set operations.
You can provide an optional property in the input expression function to tune the size and accuracy of the created sketches.
For more information, see Compute results with cardinality sketches.
- Table data type:
thetaSketch
- Input expression:
DS_THETA(expr)
- Input data type:
expr
can be any type
INSERT INTO "example_table"
SELECT
TIME_FLOOR(TIME_PARSE("timestamp"), 'PT1H') AS "__time",
DS_THETA("example_column")
FROM ...
Timestamp
Ingest the primary timestamp.
Only use the timestamp
data type for the table's __time
column.
The input timestamp field must be parsed from ISO 8601 or milliseconds since epoch format.
For more information and examples, see Timestamp expressions.
- Table data type:
timestamp
- Input expression:
TIME_PARSE(timestamp)
orMILLIS_TO_TIMESTAMP(timestamp)
or__time
- Input type: The input field is string for
TIME_PARSE
or long forMILLIS_TO_TIMESTAMP
.
INSERT INTO "example_table"
SELECT
TIME_PARSE("timestamp") AS "__time"
FROM ...
Variance
Supported for schema measures in aggregate tables only.
Ingest variance measurements on a numeric input field.
You can apply VARIANCE, STDDEV, and related functions on the aggregated values at query time.
For information on the statistics functions, see Stats aggregator.
- Table data type:
variance
- Input expression: any of
VARIANCE(expr)
,VAR_SAMP(expr)
,VAR_POP(expr)
- Input data type:
expr
is a numeric type
INSERT INTO "example_table"
SELECT
TIME_FLOOR(TIME_PARSE("timestamp"), 'PT1H') AS "__time",
VARIANCE("example_column")
FROM ...
Learn more
See the following topics for more information: