Table schema
Imply Polaris stores data in tables. A table's schema determines how the data is organized. It defines the table's columns, data types, and other metadata about the table.
This topic covers the different data types for Polaris columns: schema dimensions, schema measures, and time.
Schema dimensions and schema measures relate 1:N with data cube dimensions and measures. Data cubes can model additional dimensions and measures using expressions, and can also remove dimensions and measures as needed.
Prerequisites
To create and edit a schema in Polaris, you need the following:
- An existing table.
- The
ManageTables
permission assigned to your user profile. For more information on permissions, see Permissions reference.
Schema auto-detection in the UI
Polaris can infer the schema for a table for batch or streaming ingestion with the schema auto-detection feature. With schema auto-detection, Polaris scans the first 1000 entries of your source data to create an input schema. During this data mapping phase, Polaris infers a name and a data type for each column in your table based on the detected values from your input fields. This method is best suited for cases when you do not have a predefined schema and want to get started quickly. For an example, see the quickstart guide.
Create a schema in the UI
You can use the Polaris UI to create a schema manually. You can add and remove columns, as described in the example. This method is best suited for cases when you know exactly what your schema should look like and want to define it before loading any data.
Create a schema with the API
You can use the Tables API to create tables and manage schemas programmatically. This method is best suited for automated workflows.
Data types
When you define a schema, you must specify a name and a data type for each column. Column definitions are immutable.
Polaris supports the following data types for a table column:
- string: UTF-8 encoded text
- long: a 64 bit integer
- float: a 32 bit floating point decimal number
- double: a 64 bit floating point decimal number
- json: nested data in JSON format
- timestamp: primary timestamp
For details on ingesting nested data, see Create an ingestion job.
The following data types are supported for schema measures in aggregate tables only:
- thetaSketch: a Theta sketch object
- HLLSketch: an HLL sketch object
Sketch objects are probabilistic data structures that improve the query performance of distinct count queries with known error distributions. For more information on sketches in Polaris, see Compute results with approximation algorithms.
The following restrictions apply to column names:
- Names must be unique and non-empty when trimmed of leading and trailing spaces.
- Names starting with two underscores, such as
__count
, are reserved for internal use.
Schema dimensions
Schema dimensions are data columns that contain qualitative information. You can group by, filter, or apply aggregators to dimensions at query time.
Schema measures
Schema measures are quantitative data fields or probabilistic data structures derived from the original data source. A schema measure stores data in aggregated form based on an aggregation function you apply on your source data in your ingestion job.
Supported aggregation functions include:
- Count: Counts the number of rows for the dimension.
- Max: Returns the largest value for the dimension.
- Min: Returns the smallest value for the dimension.
- Sum: Returns the sum of all values for the dimension. Polaris assigns the Sum aggregation function to measures containing numeric data by default.
Schema measures are only available for aggregate tables.
All aggregate tables automatically include a __count
measure that counts the number of source data rows that were rolled up into a given table row.
The __count
measure is populated internally. Do not specify this measure in a table schema or ingestion job specification.
Timestamp
Every schema has a timestamp column by default. Polaris uses the timestamp to partition and sort data, and to perform time-based data management operations, such as dropping time chunks.
When you create a table without a schema, Polaris automatically creates the primary timestamp column __time
.
If you use the Polaris API to manually define your schema,
include the __time
column in the schema
object of the request payload.
Only the __time
column takes the data type of timestamp
.
If you do not include __time
in the table schema, Polaris automatically creates this column for the table.
When creating an ingestion job, you can transform timestamps using input expressions in the ingestion job specification. For more information, see Timestamp expressions.
Example
The following example walks you through the steps to can create an empty table with a schema definition.
- Click Tables from the left navigation menu.
- Click Create table.
- Enter a unique name for your table and select Aggregate for the table type.
For more information, see Types of tables.
- Click Create.
- Click Edit schema.
- Polaris displays an empty table with two columns automatically created,
__time
and__count
. The__time
dimension stores the primary timestamp for all Polaris tables. The__count
measure holds the number of source data rows that were rolled up into a given table row for aggregate tables. Polaris splits the table view to show dimensions to the left and measures to the right. For more details on dimensions and measures, see Schema dimensions and Schema measures. - On the Dimensions side of the split view, click the add icon.
- Enter the name and data type of the dimension.
- Click Save.
- On the Measures side of the split view, click the add icon.
- Enter the name and data type of the measure. Certain data types, such as theta and HLL sketches, are only available for measures. The data type of a measure determines any aggregation functions to apply during ingestion and querying.
When you finish editing your schema, click Save schema. Your table is now ready for ingestion.
Learn more
To define a table schema using the Polaris API, see Define table schemas by API.