Input expressions reference

Ingestion jobs in Imply Polaris map input fields to table columns. An ingestion job mapping takes an input expression as well as the name of an output column. An input expression describes how to compute the value of a table's column from the input fields. An input expression can map an input field as is or apply one or more Apache Druid® SQL functions to transform the data.

Some use cases require complex data types, such as cardinality sketches or IP addresses. These data types may require specific transforms during ingestion to properly parse and store the data.

This topic lists the Druid SQL functions you can use in input expressions. It also references the specific input expressions required to ingest complex data types.

For information and examples on input expressions in ingestion jobs, see Map and transform data with input expressions.

Input expressions for dimensions

The following resources list the operations supported in input expressions for dimensions:

Input expressions for measures

If the table column is a measure, the input expression must use one of the following aggregation functions:

COUNT(*): Counts the number of rows of expr.
SUM(expr): Sums numbers.
MIN(expr): Takes the minimum of numbers.
MAX(expr): Takes the maximum of numbers.
BUILD_GEO("LATITUDE","LONGITUDE"): Takes two columns that contain values for latitude and values for longitude. Outputs them as a single column that represents the coordinates with the datatype geo.
Functions to measure dispersion:
- VARIANCE(expr) or VAR_SAMP(expr): Computes the sample variance of numbers.
- VAR_POP(expr): Computes the population variance of numbers.
Functions to estimate distinct counts:
- COUNT(DISTINCT expr): Counts the number of distinct values of expr. In Polaris, this is the same as APPROX_COUNT_DISTINCT and APPROX_COUNT_DISTINCT_DS_HLL. Polaris uses the HLL approximation algorithm to estimate distinct values. To compute distinct count estimates from sketches, see Query sketched data.
- APPROX_COUNT_DISTINCT_DS_THETA: Counts the number of distinct count values using Theta sketches. Theta sketches require more memory than HLL sketches.
Sketch creation functions. For more information, see Cardinality sketches and Quantiles sketches.
- DS_HLL(expr, [lgK, tgtHllType]): Creates an HLL sketch on the values of the expression.
- DS_THETA(expr, [size]): Creates a Theta sketch on the values of the expression.
- DS_QUANTILES_SKETCH(expr, [k]): Creates a Quantiles sketch on the values of the expression.
Functions to aggregate based on time. For more information, see Aggregate value on ingestion based on earliest or latest time.
- EARLIEST_BY(expr, timestampExpr, maxBytesPerString): For the earliest time value in timestampExpr, returns the associated value of a string or numeric expression expr.
- LATEST_BY(expr, timestampExpr, maxBytesPerString): For the latest time value in timestampExpr, returns the associated value of a string or numeric expression expr.
Functions to aggregate time series from raw data:
- INGEST_TIMESERIES(TIME_PARSE(timeColumn), dataColumn [, maxEntries]: When ingesting data, creates a time series measure based on the specified data column and time column. Additionally, you can use any time series aggregation function as an input expression as long as it uses raw data to generate the time series.

For more information on these functions, see Druid SQL aggregation functions.

Handling incompatible data types

When there are input values incompatible with a column's data type, Polaris does the following:

Coerce the type: An incompatible data type that can be cast. For example, a long-typed input value in a string-typed column.
Ingest null value: An incompatible data type that can't be cast. For example, a string-typed value in a long-typed column.
Convert value to null and apply the function: An incompatible input type for a function. For example, dividing a string by the number 2.

The job logs for streaming jobs display the error messages for any parse exceptions Polaris encountered.

For more information on how data types are cast, see SQL data types.

Ingesting specific data types

By default, Polaris determines the appropriate data type for your table using schema detection for tables. Polaris determines data types for table columns based on the input data as well as any transforms you apply. Specific transforms are required to compute aggregations, such as aggregating data into sketches, or to properly parse an input field, such as IP addresses.

If you want to ensure that a column has a specific type, you can explicitly declare it with the intended data type.

This section lists the simple data types supported in Polaris as well as more complex types that require specific transforms. The data types are supported for both detail and aggregate tables unless otherwise documented. For aggregate tables, the SQL examples aggregate the timestamp values using the TIME_FLOOR function to apply rollup at ingestion.

Simple data types

bigint Aliased to long in the Polaris API.
double: A 64-bit floating point decimal number.
doubleArray: Array of double values.
float: A 32-bit floating point decimal number.
floatArray: Array of float values.
long: A 64-bit integer.
When ingesting into a flexible table without a declared schema, Polaris ingests Boolean values as long types.
longArray: Array of long values.
string: UTF-8 encoded text, including multi-value dimensions.
stringArray: Array of string values.
varchar: Aliased to string in the Polaris API.

Earliest or latest

Schema measure

Supported for schema measures in aggregate tables only.

Ingest the earliest or latest value from a set of aggregated rows. You can apply the EARLIEST or LATEST functions on the aggregated values at query time. For more information, see Earliest or latest aggregation.

Earliest

Ingest the earliest value from aggregated rows.

Table data type: longStringPair
Input expression: EARLIEST_BY(expr, timestampExpr)
Input data type: expr is a string or numeric input field

INSERT INTO "example_table"
SELECT
  TIME_FLOOR(TIME_PARSE("timestamp"), 'PT1H') AS "__time",
  EARLIEST_BY("score", TIME_PARSE("event_time"))
FROM ...

Latest

Ingest the latest value from aggregated rows.

Table data type: longStringPair
Input expression: LATEST_BY(expr, timestampExpr)
Input data type: expr is a string or numeric input field

INSERT INTO "example_table"
SELECT
  TIME_FLOOR(TIME_PARSE("timestamp"), 'PT1H') AS "__time",
  LATEST_BY("score", TIME_PARSE("event_time"))
FROM ...

Geo

Ingest high-precision geospatial data to determine whether coordinates fall within certain geometric boundaries that you define.

Table data type: geo
Input expression: BUILD_GEO("LATITUDE","LONGITUDE") as COLUMN_NAME
Input data type: Latitude and longitude must be a numerical data type.

INSERT INTO "example_table" 
SELECT 
  "__time" AS "__time",
  BUILD_GEO("pickupLatitude","pickupLongitude") AS "geo_column"
FROM ...

To query the geo datatype, you need to use native JSON queries. For more information about the geometric boundaries that are supported, see High-precision spatial filters.

HLL sketch

Schema measure

Supported for schema measures in aggregate tables only.

Ingest data as HLL sketch objects for approximate distinct counting.
You can provide an optional property in the input expression function to tune the size and accuracy of the created sketches. For more information, see Compute results with cardinality sketches.

Table data type: HLLSketch
Input expression: DS_HLL(expr)
Input data type: expr can be any type

INSERT INTO "example_table"
SELECT
  TIME_FLOOR(TIME_PARSE("timestamp"), 'PT1H') AS "__time",
  DS_HLL("example_column")
FROM ...

IP types

Ingest IP addresses or IP prefixes from a string input field.
For examples of queries on IP type columns, see Enhanced IP support.

IP address

Ingest IP addresses stored in 128-bit binary format.

Table data type: ipAddress
Input expression: IP_PARSE(expr)
Input data type: expr is a string input field

IP_PARSE raises an error if an input value is not a valid IP address. If you prefer to bypass the parsing errors, use the input expression IP_TRY_PARSE(expr), which returns a null value for invalid IP addresses.

INSERT INTO "example_table"
SELECT
  TIME_PARSE("timestamp") AS "__time",
  IP_PARSE("example_column")
FROM ...

IP prefix

Ingest IP prefixes stored in 136-bit binary format, comprised of a 128-bit address and an 8-bit prefix value.

Table data type: ipPrefix
Input expression: IP_PREFIX_PARSE(expr)
Input data type: expr is a string input field

IP_PREFIX_PARSE raises an error if an input value is not a valid IP prefix. If you prefer to bypass the parsing errors, use the input expression IP_PREFIX_TRY_PARSE(expr), which returns a null value for invalid IP prefixes.

INSERT INTO "example_table"
SELECT
  TIME_PARSE("timestamp") AS "__time",
  IP_PREFIX_PARSE("example_column")
FROM ...

JSON

Ingest nested JSON data. For more information, see Ingest nested data.

Table data type: json
Input expression: None. Polaris detects the type from the data.
Input data type: The input field is of type json or complex<json>.

INSERT INTO "example_table"
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "example_column"
FROM ...

Multi-value dimensions

Ingest multi-value dimensions (MVDs). MVDs are string types (SQL VARCHAR).

Table data type: string
Input expression:
- For a detail table, you don't need an input expression.
- For an aggregate table, use MV_TO_ARRAY(inputField). The function coerces the MVD into an array for grouping (SQL GROUP BY), but the data is still stored as a string MVD.
Input data type: The input field is a string array type.

The following example shows how to ingest MVDs into a detail table:

INSERT INTO "simple_detail_mvdmode"
SELECT
  TIME_PARSE("timestamp") AS "__time",
  "tags" AS "tags"
FROM TABLE(POLARIS_SOURCE('{"fileList":["array-mvd-agg.json"],"inputSchema":[{"dataType":"string","name":"timestamp"},{"dataType":"string","name":"tags"}],"formatSettings":{"format":"nd-json"},"type":"uploaded"}'))
PARTITIONED BY DAY

The following example shows how to ingest MVDs into an aggregate table:

INSERT INTO "simple_agg_mvdmode"
SELECT
  TIME_FLOOR(TIME_PARSE("timestamp"), 'P1D', NULL, 'UTC') AS "__time",
  MV_TO_ARRAY("tags") AS "tags",
  COUNT(*) AS "__count"
FROM TABLE(POLARIS_SOURCE('{"fileList":["array-mvd-agg.json"],"inputSchema":[{"dataType":"string","name":"timestamp"},{"dataType":"string","name":"tags"}],"formatSettings":{"format":"nd-json"},"type":"uploaded"}'))
GROUP BY 1, 2
PARTITIONED BY DAY

Quantiles sketch

Schema measure

Supported for schema measures in aggregate tables only.

Ingest numeric data as Quantile sketches.
You can provide an optional property in the input expression function to tune the size and accuracy of the created sketches. For more information, see Estimate distributions with quantile sketches.

Table data type: quantilesDoublesSketch
Input expression: DS_QUANTILES_SKETCH(expr)
Input data type: expr is a numeric type

INSERT INTO "example_table"
SELECT
  TIME_FLOOR(TIME_PARSE("timestamp"), 'PT1H') AS "__time",
  DS_QUANTILES_SKETCH("example_column")
FROM ...

Theta sketch

Schema measure

Supported for schema measures in aggregate tables only.

Ingest data as Theta sketch objects for approximate distinct counting with set operations.
You can provide an optional property in the input expression function to tune the size and accuracy of the created sketches. For more information, see Compute results with cardinality sketches.

Table data type: thetaSketch
Input expression: DS_THETA(expr)
Input data type: expr can be any type

INSERT INTO "example_table"
SELECT
  TIME_FLOOR(TIME_PARSE("timestamp"), 'PT1H') AS "__time",
  DS_THETA("example_column")
FROM ...

Time series

Ingest data as a time series.

Table data type: ingest_timeseries
Input expression: INGEST_TIMESERIES(TIME_PARSE(timeColumn), dataColumn [, maxEntries])
Input data type: timeColumn can be a string for TIME_PARSE or long for MILLIS_TO_TIMESTAMP. dataColumn can be any type.

SELECT
TIME_PARSE("timestamp") as "__time",
INGEST_TIMESERIES(TIME_PARSE("timestamp"), "example_column") AS "time_series_col"
FROM ...

You can use any time series aggregation function that only uses raw data in the input expression to ingest time series data. Choose the one that best fits your use case.

Timestamp

Ingest the primary timestamp.
Only use the timestamp data type for the table's __time column. The input timestamp field must be parsed from ISO 8601 or milliseconds since epoch format. For more information and examples, see Timestamp expressions.

Table data type: timestamp
Input expression: TIME_PARSE(timestamp) or MILLIS_TO_TIMESTAMP(timestamp) or __time
Input type: The input field is string for TIME_PARSE or long for MILLIS_TO_TIMESTAMP.

INSERT INTO "example_table"
SELECT
  TIME_PARSE("timestamp") AS "__time"
FROM ...

Variance

Schema measure

Supported for schema measures in aggregate tables only.

Ingest variance measurements on a numeric input field.
You can apply VARIANCE, STDDEV, and related functions on the aggregated values at query time. For information on the statistics functions, see Stats aggregator.

Table data type: variance
Input expression: any of VARIANCE(expr), VAR_SAMP(expr), VAR_POP(expr)
Input data type: expr is a numeric type

INSERT INTO "example_table"
SELECT
  TIME_FLOOR(TIME_PARSE("timestamp"), 'PT1H') AS "__time",
  VARIANCE("example_column")
FROM ...

Learn more

See the following topics for more information:

Input expressions for dimensions​

Input expressions for measures​

Handling incompatible data types​

Ingesting specific data types​

Simple data types​

Earliest or latest​

Earliest​

Latest​

Geo​

HLL sketch​

IP types​

IP address​

IP prefix​

JSON​

Multi-value dimensions​

Quantiles sketch​

Theta sketch​

Time series​

Timestamp​

Variance​

Learn more​

Input expressions for dimensions

Input expressions for measures

Handling incompatible data types

Ingesting specific data types

Simple data types

Earliest or latest

Earliest

Latest

Geo

HLL sketch

IP types

IP address

IP prefix

JSON

Multi-value dimensions

Quantiles sketch

Theta sketch

Time series

Timestamp

Variance

Learn more