Skip to main content

Lookups

In Imply Polaris, you can use lookups to enrich your data with additional information from a lookup source. Lookups contain mappings of key-value pairs. A lookup function returns mapped entries from the lookup source based upon the contents of a column you specify. Lookups in Polaris support one-to-many relationships: a single key can map to multiple values, such as a user ID that maps to both the user’s region and their display name.

Polaris only supports existing tables as a lookup source. The key and value column of the table that sources a lookup must be string-typed columns. In the table you query directly, the column containing the key values must also be a string-typed column.

This topic shows how to configure and use lookups in Polaris.

The following topics provide further information about lookups:

Prerequisites

You need the following permissions to create and use lookups:

  • ManageTables: Create a lookup table and the lookup itself.
  • ManageIngestionJobs: Ingest data into a lookup table.
  • AccessQueries: Run SQL queries.

For information on permissions, see Permissions reference.

How lookups work

This section provides a conceptual overview of lookups. It uses an example involving data collected from car trips as well as descriptive attributes about the cars.

Imagine that you have a table named car-trips that contains information about the trips taken by a fleet of cars. The data includes the identity of the car that made the trip as well as quantitative information on the length of the trip and the average speed of the car during the trip.

You also have another table called car-makes that describes the make and model of each car given its vehicle ID. Each car that's included in car-trips has associated information in car-makes. In car-trips, the vid column identifies the car associated with a given trip. In car-makes, you can look up car information using the IDs stored in the vehicle_id column.

A car may be represented multiple times in the car-trips table when it makes multiple trips. Each car is only included once in the car-makes table that qualitatively describes the cars. The following diagram illustrates both tables:

Lookups conceputal explanation intro

You want to query the data in car-trips to analyze the trips taken. In the query, you want to cross reference information from car-makes. For example, to analyze the trip information separated by make or model of the car.

In order to use lookups in queries, you first create a lookup object named vid-makes. The lookup object is an intermediary object that references a lookup source, in this case, the table named car-makes. The lookup name must be unique. The following diagram shows the lookup that references the car-makes table:

Lookups conceptual explanation create lookup

Finally, you can use the lookup in your queries to join information from both tables. You do this using the Druid SQL LOOKUP function. In the LOOKUP function, you first supply the column name of IDs from the table you SELECT data. Here, that's vid.

The second parameter takes the name of the lookup object. With the lookup object name, you use square bracket notation to designate the keys to look up and the values to return for the matched keys: lookup_object[key][value]. To return car make information with queries against car-trips, you'd call this with vid-makes[vehicle_id][make]. The key and value column names come from the table that's referenced by the lookup.

The following diagram shows a query that uses the lookup to return trip lengths from car-trips along with the make of the associated car from car-makes:

Lookups conceptual explanation query lookup

To follow along with this example interactively, see the Lookups tutorial.

Lookup strategies

You can apply lookups either at query time or ingestion time. Lookups at query time enrich query results using data from a lookup source. Lookups at ingestion time supplement data from the input source to ingest into the table. To determine which strategy is right for your use case, consider performance costs, your data model, and variability of the lookup data.

  • Performance costs: To execute on many queries quickly, take the performance cost upfront and use lookups at ingestion time. That way you don't have to constantly pull information from two tables in all your queries.

  • Data model: If you'd rather keep the data table separate from the lookup source, or if you don't want as much data stored in the table, use lookups at query time.

  • Data variability: For data that changes periodically, consider lookups at query time. You can use the lookup table as a dimension table that you regularly refresh. A dimension table typically contains descriptive information about business entities and is smaller in size than a fact table, which contains quantitative data such as for observations or events.

    Refreshing the lookup table provides a cost-effective way to get the latest query results from some fact table without having to reingest data into the larger fact table. Note that if your data is specific to a certain time range, you likely need to ingest the data into the fact table itself.

Comparison to JOIN

Polaris generally exhibits better performance when your data is denormalized, meaning that the data you query on is stored in a single table. This is more efficient than using JOIN operations at query time. In cases where full denormalization is not desirable, use lookups for partial normalization. With lookups, you use the Druid SQL LOOKUP function to join information from another table. Lookups are usually faster than JOIN operations, since JOIN evaluates conditions on each row, whereas LOOKUP defers evaluation until after applying any aggregations in the query.

Lookup sources

Polaris only supports existing tables as a lookup source. To designate a lookup table, first create and ingest data into a table, then define a lookup that points to the table. Like any table, lookups may take a few minutes to initialize after they're created. Any string column in the lookup table may serve as the key or value column.

Lookups only support strings. Any column that will be referenced as either a key column or value column should have string values or be cast to strings in the ingestion job. For the data table you query, the key column must also have the string data type. You can explicitly cast the data type using an input expression for the column. For example, to ingest the column vid as strings: CAST("vid" AS VARCHAR).

Data in a lookup table counts towards your project limits in the same way as any other data. For details on project sizes, see Billing overview.

Requirements

The following requirements apply to lookup tables:

  • The table must have a partitioning granularity of all.
    For information on setting the partitioning for a table, see Time partitioning.
  • The table must only have a single segment.
    Polaris typically stores around three million rows in a segment and limits the size of each segment to 5 GB. For details on data storage in segments, see Segment generation.

Polaris prevents you from creating a lookup if the source table doesn't have a single segment with all granularity. If your table has more data than fits in a single segment, you need to reduce the amount of data in the table. Consider removing columns that won't be referenced in the lookup query or breaking up your source into multiple lookups.

You can reduce the amount of data in the source table through an ingestion job to replace data in the table. The replace job reindexes the data when you set the source and target as the same table. Polaris combines the data in the table to a single segment unless the table has more data than fits in a single segment.

Guardrails

Polaris enforces the following guardrails for tables that are actively referenced by lookups:

  • You can't delete a lookup table.
  • You can't change storage policies on a lookup table.
    The data in a lookup table exists until it's deleted, and all of its data is precached.
  • You can't change the partitioning of a lookup table.
  • When defining an ingestion job to load or refresh data in a lookup table:
    • Use all partitioning and create no more than one segment. Otherwise, Polaris fails the ingestion job.
    • Configure the job to replace all data, even if the table is empty.
      • In the UI, from the table view, click Load data > Replace data > Replace all data.
      • SQL-based ingestion must use REPLACE INTO ... OVERWRITE ALL.
      • The JSON job spec must set the ingestion mode and designate to replace all:
        {
        ...,
        "ingestionMode": "replace",
        "replaceAll": true
        }
  • Polaris disables automatic compaction for lookup tables.

Recommendations

The following additional guidelines are recommended for lookup tables:

  • Only string columns can be used as key-value pairs for lookups.
    If a table has numeric columns, you can still use the table as a lookup source, but you can't use the numeric columns in lookups.
  • The key column must have unique values. Otherwise, you can ingest data into the table and reference it in a lookup, but Polaris returns a runtime error for the LOOKUP query.

How to use lookups

This section describes the process of creating a lookup, refreshing the data used in the lookup source, and updating or deleting the lookup.

Create a lookup

Create a lookup in the Polaris UI as follows:

  1. Select Lookups from the left pane to access the lookups page.
  2. Click Create lookup.
    Polaris only displays tables that are eligible based on the source requirements.
  3. Provide a unique name for the lookup and select a table to source information for the lookup.
    The lookup name identifies the lookup for use in queries or when you want to update or delete the lookup.
  4. Click Create.

The following screenshot shows the Create lookup dialog on the lookups page:

Create lookup dialog

Refresh a lookup

To refresh the data used for a lookup, replace the data in the lookup table. Ensure that your ingestion job adheres to the guardrails for ingesting into lookup tables. Specifically, the job must replace all data, use all partitioning, and create only a single segment. When the table has the updated data, the lookup is updated.

You can't refresh a lookup using a streaming source. If a table has an active streaming job, you can't use it as the source of a lookup.

To learn about replacing data in a table, see Replace data.

Update or delete a lookup

You can update a lookup to change its source. You can't change the name of a lookup.

To update or delete a lookup, click the ... menu button for the lookup on the Lookups page. Select Edit or Delete, then confirm your choice in the following dialog.

Query syntax

To use a lookup, call the LOOKUP function using the following syntax:

LOOKUP("COLUMN_NAME", 'LOOKUP_NAME[KEY_COLUMN][VALUE_COLUMN]', ['OPTIONAL_DEFAULT_VALUE'])

Replace the following values in the LOOKUP function:

  • COLUMN_NAME: Column name in the table that you're querying.
  • LOOKUP_NAME: Name that identifies the lookup.
  • KEY_COLUMN: Name of the column in the lookup source with keys that are in COLUMN_NAME.
  • VALUE_COLUMN: Name of the column in the lookup source with values to return for the mapped keys.
  • OPTIONAL_DEFAULT_VALUE: Constant string to use as the default when the lookup source doesn't have a value for the provided key. If you don't provide the default value, Polaris returns NULL.

For example, let's say your table has the column vid containing vehicle IDs. You created a lookup called vid-makes. The source for the lookup is a small table that has the columns vehicle_id and make. To look up the make of each vehicle using its ID, call LOOKUP as follows:

LOOKUP("vid", 'vid-makes[vehicle_id][make]')

In the previous example, if a vehicle ID wasn't present in the lookup source, the mapped value is NULL. If instead you want to fill in a default value or a non-NULL value, such as NA, pass the value in the third parameter:

LOOKUP("vid", 'vid-makes[vehicle_id][make]', 'NA')

Limitations

The ingestion wizard in the Polaris UI doesn't sample data from lookups. You can still ingest the lookup data.

Learn more

See the following topics for more information: