Skip to main content

Lookups

In Imply Polaris, you can use lookups to enrich your data with additional information from a lookup source. Lookups contain mappings of key-value pairs. A lookup function returns mapped entries from the lookup source based upon the contents of a column you specify. Lookups in Polaris support one-to-many relationships: a single key can map to multiple values, such as a user ID that maps to both the user’s region and their display name.

Polaris only supports existing tables as a lookup source. The key and value column of the table that sources a lookup must be string-typed columns. In the table you query directly, the column containing the key values must also be a string-typed column.

This topic shows how to configure and use lookups in Polaris.

The following topics provide further information about lookups:

Prerequisites

You need the following permissions to create and use lookups:

  • ManageTables: Create a lookup table and the lookup itself.
  • ManageIngestionJobs: Ingest data into a lookup table.
  • AccessQueries: Run SQL queries.

For information on permissions, see Permissions reference.

Lookup strategies

You can apply lookups either at query time or ingestion time. Lookups at query time enrich query results using data from a lookup source. Lookups at ingestion time supplement data from the input source to ingest into the table. To determine which strategy is right for your use case, consider performance costs, your data model, and variability of the lookup data.

  • Performance costs: To execute on many queries quickly, take the performance cost upfront and use lookups at ingestion time. That way you don't have to constantly pull information from two tables in all your queries.

  • Data model: If you'd rather keep the data table separate from the lookup source, or if you don't want as much data stored in the table, use lookups at query time.

  • Data variability: For data that changes periodically, consider lookups at query time. You can use the lookup table as a dimension table that you regularly refresh. A dimension table typically contains descriptive information about business entities and is smaller in size than a fact table, which contains quantitative data such as for observations or events.

    Refreshing the lookup table provides a cost-effective way to get the latest query results from some fact table without having to reingest data into the larger fact table. Note that if your data is specific to a certain time range, you likely need to ingest the data into the fact table itself.

Comparison to JOIN

Polaris generally exhibits better performance when your data is denormalized, meaning that the data you query on is stored in a single table. This is more efficient than using JOIN operations at query time. In cases where full denormalization is not desirable, use lookups for partial normalization. With lookups, you use the Druid SQL LOOKUP function to join information from another table. Lookups are usually faster than JOIN operations, since JOIN evaluates conditions on each row, whereas LOOKUP defers evaluation until after applying any aggregations in the query.

Lookup sources

Polaris only supports existing tables as a lookup source. To designate a lookup table, first create and ingest data into a table, then define a lookup that points to the table. Like any table, lookups may take a few minutes to initialize after they're created. Any string column in the lookup table may serve as the key or value column.

Data in a lookup table counts towards your project limits in the same way as any other data. For details on project sizes, see Billing overview.

Lookups only support strings. Any column that will be referenced as either a key column or value column should have string values or be cast to strings in the ingestion job. For the data table you query, the key column must also have the string data type. You can explicitly cast the data type using an input expression for the column. For example, to ingest the column vid as strings: CAST("vid" AS VARCHAR).

You can't set a custom retention policy on a lookup table. The data in a lookup table exists forever until it's deleted, and all data in a lookup table is precached.

Requirements

The following requirements apply to lookup tables:

  • The table must have a partitioning granularity of all.
    For information on setting the partitioning for a table, see Time partitioning.
  • The table must only have a single segment.
    Polaris typically stores around three million rows in a segment and limits the size of each segment to 5 GB. For details on data storage in segments, see Segment generation.

Polaris prevents you from creating a lookup if the source table doesn't have a single segment with all granularity. If your table has more data than fits in a single segment, you need to reduce the amount of data in the table. Consider removing columns that won't be referenced in the lookup query or breaking up your source into multiple lookups.

You can reduce the amount of data in the source table through an ingestion job to replace data in the table. The replace job reindexes the data when you set the source and target as the same table. Polaris combines the data in the table to a single segment unless the table has more data than fits in a single segment.

Guardrails

Polaris enforces the following guardrails for lookups:

  • You can't change the partitioning of a table that's actively referenced by a lookup.
  • You can't delete a table that's used by a lookup.
  • An ingestion job that inserts data into an active lookup table fails if the job doesn't set all partitioning or if it creates more than one segment.
  • You can't set a custom retention policy on a lookup table. The data in a lookup table exists until it's deleted, and all data in a lookup table is precached.
  • Polaris disables automatic compaction for tables used by lookups.

Recommendations

The following additional guidelines are recommended for lookup tables:

  • Only string columns can be used as key-value pairs for lookups.
    If a table has numeric columns, you can still use the table as a lookup source, but you can't use the numeric columns in lookups.
  • The key column must have unique values. Otherwise, you can ingest data into the table and reference it in a lookup, but Polaris returns a runtime error for the LOOKUP query.

How to use lookups

This section describes the process of creating a lookup, refreshing the data used in the lookup source, and updating or deleting the lookup.

Create a lookup

Create a lookup in the Polaris UI as follows:

  1. Select Lookups from the left pane to access the lookups page.
  2. Click Create lookup.
    Polaris only displays tables that are eligible based on the source requirements.
  3. Provide a name for the lookup and select a table to source information for the lookup.
    The lookup name identifies the lookup for use in queries or when you want to update or delete the lookup.
  4. Click Create.

The following screenshot shows the Create lookup dialog on the lookups page:

Create lookup dialog

Refresh a lookup

To refresh the data used for a lookup, update the data in the lookup table. Ensure that your ingestion job adheres to the lookup source requirements. When the table has the updated data, the lookup is updated.

You can't use a table with active streaming jobs as the source of a lookup.

To create an ingestion job to replace data in a table, see Replace data.

Update or delete a lookup

You can update a lookup to change its source. You can't change the name of a lookup.

To update or delete a lookup, click the ... menu button for the lookup on the Lookups page. Select Edit or Delete, then confirm your choice in the following dialog.

Query syntax

To use a lookup, call the LOOKUP function using the following syntax:

LOOKUP("COLUMN_NAME", 'LOOKUP_NAME[KEY_COLUMN][VALUE_COLUMN]', ['OPTIONAL_DEFAULT_VALUE'])

Replace the following values in the LOOKUP function:

  • COLUMN_NAME: Column name in the table that you're querying.
  • LOOKUP_NAME: Name that identifies the lookup.
  • KEY_COLUMN: Name of the column in the lookup source with keys that are in COLUMN_NAME.
  • VALUE_COLUMN: Name of the column in the lookup source with values to return for the mapped keys.
  • OPTIONAL_DEFAULT_VALUE: Constant string to use as the default when the lookup source doesn't have a value for the provided key. If you don't provide the default value, Polaris returns NULL.

For example, let's say your table has the column vid containing vehicle IDs. You created a lookup called vid-makes. The source for the lookup is a small table that has the columns vehicle_id and make. To look up the make of each vehicle using its ID, call LOOKUP as follows:

LOOKUP("vid", 'vid-makes[vehicle_id][make]')

In the previous example, if a vehicle ID wasn't present in the lookup source, the mapped value is NULL. If instead you want to fill in a default value or a non-NULL value, such as NA, pass the value in the third parameter:

LOOKUP("vid", 'vid-makes[vehicle_id][make]', 'NA')

Limitations

The ingestion wizard in the Polaris UI doesn't sample data from lookups. You can still ingest the lookup data.

Learn more

See the following topics for more information: