This topic covers the available sources for ingestion in Imply Polaris and strategies to determine which source to use.
For a comparison of the features of ingestion sources, see the Ingestion source reference.
To determine what ingestion source best meets your needs, analyze your data access and patterns. For example:
- How much data do you need to ingest?
- How quickly do you need to access the data?
- How much data will you send and subsequently request at a single time?
- What queries require low response latency post-ingestion?
Polaris provides scalable ingestion for both batch and streaming ingestion. Billing for data ingestion only accounts for the raw bytes of data that Polaris processes during ingestion; you are not charged for vCPU usage for ingestion. Polaris does not limit scaling and capacity by project size. The following sections provide more information on ingestion scalability for batch and streaming ingestion.
Batch ingestion in Polaris refers to an ingestion job that reads a finite amount of data from your source and terminates when all rows have been loaded into Polaris.
For batch ingestion only, you can use SQL to define your ingestion job. For more information, see Ingest using SQL.
Batch use cases
The following are a few common use cases for batch ingestion in Polaris:
- Load data into a table for the first time, such as when you need to migrate data from another database.
- Append new data into an existing table.
- Backfill older data after initializing streaming ingestion.
Batch ingestion strategies
You can query data once it has been ingested into a table. Consider your query patterns to determine how often to run batch ingestion and how much data to ingest in each job. Possible strategies include the following:
- To ensure data completeness, wait until you have all relevant data before batch ingestion.
- To prioritize faster data access, ingest the latest data as it arrives.
Ingesting one file per job is not recommended since it can create datasources with many small Druid segments and affect Polaris performance. When possible, batch load multiple files per ingestion job. Consider streaming ingestion for data coming in more frequently than an hourly basis.
Batch ingestion scalability
Polaris determines the best method of parallelizing the ingestion job based on the job itself. For batch ingestion, Polaris heuristically evaluates the number and size of files with the goal of having an optimal number of rows per parallel worker process. A single worker can be assigned up to 10,000 files and some number of bytes. The byte limit for a worker is dynamically determined by the format and compression used for the files. Polaris currently uses at most 75 workers per ingestion job; however, only very large ingestion jobs approach this limit.
Batch ingestion sources
Polaris supports several batch ingestion sources:
- Files: Upload files to the Polaris staging area and ingest from them using the UI or API.
- Tables: Load data from one Polaris table into another table using the UI or API.
- Amazon S3: Read files from Amazon S3 buckets to ingest data into Polaris using the UI or API.
Streaming ingestion in Polaris refers to an ongoing ingestion job that consumes data from your event stream.
Streaming use cases
When you need low latency between ingestion and query, streaming ingestion is a good choice. Take into account your query patterns and the event payload requirements to determine whether streaming ingestion fits your use case. Also note that Polaris only ingests streaming event data from with 30 days of ingestion time.
Streaming ingestion strategies
The options for streaming ingestion to Polaris include:
- Consume from an event stream, sometimes called "pull streaming ingestion." This is the best option for high data volume and high throughput.
- Publish event data from your event stream to Polaris using the Events API. This is sometimes called "push streaming ingestion." The Events API option is a good choice for use cases and configurations where you don’t want to own Kafka or similar event streaming technology. For example, IoT. When sending event data to Polaris, the payload for a single request must not exceed 1 MB in size.
Streaming ingestion scalability
For streaming ingestion jobs, Polaris scales the number of tasks to minimize ingestion lag and maintain near zero latency. The maximum number of parallel tasks is determined by the configuration of the streaming source—that is, the number of partitions in a Kafka topic or the number of shards in a Kinesis stream.
Consume from an event stream
Polaris supports consuming event streams from the following sources:
- Apache Kafka and MSK: Ingest streaming event data from Apache Kafka from a self-managed Apache Kafka cluster or Amazon MSK using the UI or API.
- Amazon Kinesis: Ingest data from Amazon Kinesis Data Streams using the UI or API.
- Confluent Cloud: Ingest streaming event data from Confluent Cloud using the UI or API.
Publish data from an event stream
You can publish events to Polaris from the following ingestion sources:
- Kafka Connector: Read data from a Kafka event stream and send it to Polaris using the Kafka Connector. Note that the Kafka Connector runs inside of Kafka Connect.
- Events API: Send events to Polaris directly from your own application with the Events API.
For a feature comparison of ingestion sources, see the Ingestion source reference.
For information about supported data formats, see Supported data and file formats.