Skip to main content

Configure streaming ingestion

Streaming ingestion jobs in Imply Polaris consume data from a constantly growing stream, such as a stream of one or more Kafka topics. In the ingestion job, you control the point in the stream where Polaris starts consuming data, as well as the acceptable time range for incoming events. This topic shows you how to select and reset where Polaris reads from the stream as well as how to restrict the time range for incoming data.

tip

Add data to your topic or stream before creating an ingestion job. The Polaris UI samples your data and detects the input schema, so you don't have to enter it manually.

To learn how to create an ingestion job, see Create an ingestion job.

Select offset for streaming ingestion

The offset controls where Polaris starts to consume data from a stream, whether from the earliest or latest point of a stream. When creating a streaming job, you select the offset in the Map source to table step of the ingestion job. Select one of the following offset options:

  • Beginning: Ingest all existing data. Note that Polaris only ingests events whose timestamps are within the last 30 days.
    You can override this period by changing the late message rejection period.
  • End: (default) Ingest events that arrive after creating the ingestion job.

The following screenshot shows the offset selection for a table:

Ingestion job streaming offset

The following diagrams show how the starting offset setting affects the ingestion of data added to a stream before and after starting an ingestion job:

Reset streaming job offset

For a given table and topic or stream, Polaris preserves the reading checkpoint of the topic or stream. If the same topic or stream is used in a new connection or new ingestion jobs, the reading checkpoint is still maintained for the table. Polaris only resets the reading checkpoint when the table has a new streaming ingestion job that reads from a different topic or stream.

You may reset the reading checkpoint to restart the streaming ingestion from your selected offset. The reset only applies to streaming ingestion jobs that are currently running.

caution

When you reset the offset, the ingestion job may omit or duplicate data from the stream.

  • If you select the end offset then reset the job, Polaris ingests data from the end of the stream, but may skip data that entered the stream before you applied the reset. This may result in missing data. Consider a batch ingestion job to backfill any missing data.
  • Conversely, if you select the beginning offset then reset the job, Polaris ingests data from the start of the stream, but may ingest data that has previously been ingested. This may cause data to be duplicated. Consider a replace data job to overwrite the time period with duplicate data with the correct data.

To reset the offset, do the following:

  1. Navigate to the Jobs page from the left navigation pane.
  2. Open the job you want to reset.
  3. Select Reset offset.
  4. Polaris displays a confirmation dialog and includes the offset declared in the job. Click Reset confirmation to confirm your choice.

After the reset, Polaris ingests streaming data from either the earliest or latest point of the stream.

To use the API to reset the offset, see Reset streaming job offset by API.

Reject early or late data

You can set early or late message rejection periods to reject data that arrives earlier or later than a given period, relative to the current time. By default, Polaris does not ingest data older than 30 days and data that's more than 2000 days into the future. Select Filter from the menu bar to set these periods.

For example, to avoid ingesting data older than 14 days, set the period as follows:

Streaming late message rejection

Note that the period for filtering out late data overrides the default event timestamp requirement.

For additional details on filtering data, see Filter data to ingest.