This topic covers the available sources for ingestion in Imply Polaris and strategies to determine which source to use.
Polaris supports the following sources for data ingestion:
- Files. Upload files to the Polaris staging area and ingest from uploaded files.
- Tables. Load data from one Polaris table into another table.
- Amazon S3. Ingest data from Amazon S3 buckets.
- Amazon Kinesis. Ingest data from Amazon Kinesis Data Streams.
- Confluent Cloud. Ingest data from Apache Kafka topics in Confluent Cloud.
- Apache Kafka using the Kafka Connector. Ingest data from Apache Kafka topics through Kafka Connect.
- Push streaming applications. Send events to Polaris directly from your own application using the Events API.
To determine what ingestion source you need, consider your data access and patterns. For example:
- How much data do you need to ingest?
- How quickly does the data need to be accessed?
- How much data needs to be stored and requested at a single time?
- What queries require low response latency?
This section covers guidance to the preceding questions for you to shape your strategy ingesting data into Polaris.
Streaming ingestion is ongoing data ingestion in which data is collected and stored as it comes into a streaming data source. This is useful when you need low latency between ingestion and query. Consider your query patterns and the event payload requirements to determine whether streaming ingestion fits your use case.
Event payload requirements
The following requirements apply to incoming events from all streaming sources:
- Events must contain a timestamp value. See Timestamp expressions for parsing and transforming timestamps from the source data.
- The event timestamp must be within 30 days of ingestion time. Polaris rejects events with timestamps older than 30 days. If you need to ingest older data, use batch ingestion.
- A single payload request must not exceed 1 MB in size.
The following are some of the common use cases for batch ingestion in Polaris:
- loading data into a table for the first time, such as when you need to migrate data from another database,
- appending new data into an existing table,
- backfilling data after initializing streaming ingestion.
Polaris requires all data to have a timestamp. The timestamp is used to partition and sort data and to perform time-based data management operations, such as dropping time chunks. You can transform the timestamp or fill in missing timestamps in your ingestion job. For more details, see Timestamp expressions.
You can query data once it has been ingested into a table. Consider your query patterns to determine how often to run batch ingestion and how much data to ingest in each job. Possible strategies include the following:
- To ensure data completeness, wait until you have all relevant data before batch ingestion.
- To prioritize faster data access, ingest the latest data as it arrives.
Ingesting one file per job is not recommended as it may create datasources with many small segments, affecting Polaris performance. When possible, batch load multiple files per ingestion job. Consider streaming ingestion for data coming in more frequently than an hourly basis.
Ingestion process overview
To ingest data into Polaris, create the following components:
- A table with a defined schema to receive the ingested data.
- A connection to define the source of the data. Connections are not required for ingesting from uploaded files.
- An ingestion job to bring in data from the connection.
The ingestion job bridges connections and tables by importing data from the source defined in a connection to a Polaris table. In the ingestion job specification, you also define how the input data maps to the table schema.
You can create tables, connections, and ingestion jobs using the UI or the API.