Skip to main content

Configure streaming ingestion

Streaming ingestion jobs in Imply Polaris consume data from a constantly growing stream, such as a stream of one or more Kafka topics. In the ingestion job, you control the point in the stream where Polaris starts consuming data, as well as the acceptable time range for incoming events. This topic describes configuration settings specific to streaming ingestion jobs, specifically:

tip

Add data to your topic or stream before creating an ingestion job. The Polaris UI samples your data and detects the input schema, so you don't have to enter it manually.

To learn how to create an ingestion job, see Create an ingestion job.

For details on streaming ingestion sources, see Ingestion sources overview.

Select offset for streaming ingestion

The offset controls where Polaris starts to consume data from a stream, whether from the earliest or latest point of a stream.

When creating a streaming job, you select the offset in the Map source to table step of the ingestion job. Select one of the following offset options:

  • Beginning: Ingest all data starting from the earliest offset. The data is subject to the message rejection period. By default, Polaris only ingests events with timestamps within the last 30 days. To change this period, see Reject early or late data.
  • End: (default) Ingest events from the latest offset. This ingests data sent to the topic after Polaris begins the ingestion job.

The following screenshot shows the offset selection for a table:

Ingestion job streaming offset

The following diagrams show how the starting offset setting affects the ingestion of data added to a stream before and after starting an ingestion job:

Reset streaming job offset

For a given table and either a topic or stream, Polaris preserves the reading checkpoint of the topic or stream. This behavior applies even if you use a new connection or ingestion job. If you have the same topic and table combination as a previous ingestion job, Polaris maintains the reading checkpoint. But it may no longer be the earliest of latest offset. Polaris only resets the reading checkpoint when the table has a new streaming ingestion job that reads from a different topic or stream.

You may need to reset the offset when either of the following situations occur:

  • The topic was deleted and recreated, so the consumer offset restarted.
  • Data that was previously read from the topic expired or got deleted, so the consumer offset expired.

Reset the reading checkpoint in order to ingest from your selected offset. The reset only applies to streaming ingestion jobs that are currently running.

caution

When you reset the offset, the ingestion job may omit or duplicate data from the stream.

  • If you select the end offset and then reset the job, some data may have entered the stream before you applied the reset. This may result in missing data between the previously stored consumer offset and the latest offset. Consider a batch ingestion job to backfill any missing data.
  • If you select the beginning offset and then reset the job, Polaris may ingest data that was already ingested, leading to duplicate data. Consider a replace data job to overwrite the time period with duplicate data with the correct data.

To reset the offset, do the following:

  1. Navigate to the Jobs page from the left navigation pane.
  2. Open the streaming job you want to reset.
  3. Select Hard reset.
  4. Polaris displays a confirmation dialog. Click Hard reset to proceed.
    After the reset, Polaris ingests streaming data from your selected offset.

To use the API to reset the offset, see Reset streaming job offset by API.

Reject early or late data

You can set early or late message rejection periods to reject data that arrives earlier or later than a given period, relative to the current time. In the default event timestamp requirement, Polaris does not ingest data older than 30 days and data that's more than 2000 days into the future. Polaris evaluates event timestamps after applying any transforms.

To configure the message rejection period, in the Map source to table step of the ingestion wizard, select Filter from the menu bar and select the option to filter out late data or future data.

The following screenshot shows a job configured to ingest data that has timestamps no older than the past fourteen days:

Streaming late message rejection

Pause and resume jobs

You can pause (suspend) a streaming ingestion job as well as resume its ingestion.

When you pause a job, the job execution status changes to Suspended. You can view a job's status in the job details page.

For a streaming ingestion job, the Suspended job status is different than the Idle job status. A job turns idle when it doesn't have data to ingest. In this case, the job automatically resumes with incoming data. A job is only suspended when explicitly requested by the user. The user must resume the job for it to continue.

To pause or resume a job, do the following:

  1. Go to the Jobs page from the left navigation pane.
  2. Select the job.
  3. Click Suspend job or Resume job, depending on the state of the job and the desired change. Suspend a job
  4. Confirm your choice.

To do this using the API, see Pause or resume a job by API.

Learn more

See the following topics for more information: