Batch ingestion in Imply Polaris is the process of importing data stored in files into Polaris tables. The following are some of the common use cases for batch ingestion in Polaris:
- loading data into a table for the first time, such as when you need to migrate data from another database,
- backfilling data after initializing streaming ingestion,
- appending new data into an existing table.
To run batch ingestion in Polaris, you first need to upload your files to the file staging area and create the table to host the data. This topic describes the process to upload and ingest data from files as well as strategies to consider for batch ingestion. For information on tables, see Introduction to tables.
Batch ingestion strategies
You can query data once it has been ingested into a table. Consider your query patterns to determine how often to run batch ingestion and how much data to ingest in each job. Possible strategies include the following:
- To ensure data completeness, wait until you have all relevant data before batch ingestion.
- To prioritize faster data access, ingest the latest data as it arrives.
The following examples illustrate each case.
Example 1: You have a new data file generated for each customer every day. To ensure your queries are accurate for each day, ingest all customer files for the same day together. If multiple batch ingestion jobs are required, such as to accommodate different file formats, confirm that the ingestion jobs succeeded before querying. This batch ingestion strategy makes data available atomically each day.
Example 2: Your data pipeline produces data files every 15 minutes, and you want users to be able to query the data as soon as it's produced. In this case, start a batch ingestion job for each file as it comes in, rather than collecting a set of data files to ingest together. This batch ingestion strategy promotes quick access to the latest data.
The following user roles are required to run batch ingestion:
Visit User roles reference for more information on roles and their permissions.
Before initiating batch ingestion, upload files to the staging area. View the staging area from the Files tab of the left navigation tree. The following screenshot shows the file staging area:
The file staging area displays the name, size, upload date, and status of the file. Click on the ellipsis menu for any file to view its MD5 checksum to verify integrity of the file upload.
You can upload files in the UI from a table view or from the file staging area. See Upload files by API for a guide on uploading files using the Polaris API.
Polaris can load data from multiple files in a single batch ingestion job as long as the files have the same format. To ingest data from different formats into the same table, create a separate ingestion job for each format—for example, one job to ingest newline-delimited JSON data and a separate job to ingest CSV data. See Supported source data formats for the data formats supported by Polaris.
You can load data into a table whose status is Setup incomplete, Ready for ingestion, or Ingested. When you click on a table whose status is Setup incomplete, the table view shows tiles to edit schema and to start batch ingestion:
Click Start batch ingestion to move to the Add data page:
After you select your files to batch ingest, Polaris automatically detects the file type and creates a table preview in the Map source to table page. This page displays the table schema and a sample of your data. You can refine the schema further with the following actions:
- Add or delete columns
- Update a column name
- Change a column data type
- Set a different source column
See Create a schema for more information on defining a table schema.
You can ingest additional data into a table when its status is Ready for Ingestion or Ingested. To add data, click Add data in the top navigation panel of the table view. This takes you to the Add data page as shown above.
You can also complete batch ingestion using the Polaris API. For more information, see Ingest batch data by API.
View ingestion status
To monitor the status of ingestion jobs or to view past ingestion jobs, navigate to the Ingestion Jobs tab in the UI. There, you can also view and sort by ingestion job ID, table where the data was ingested, time the job started, and how long the job took. The following screenshot shows the job monitoring page:
In addition to newline-delimited JSON, Polaris supports data ingestion from files containing delimiter-separated values, specifically:
- CSV: Comma-separated values
- TSV: Tab-separated values
- Custom: Custom text delimiter
A custom delimiter of
, is equivalent to CSV format. A value of
\t is equivalent to TSV format.
When provided, Polaris parses each row using the custom delimiter. If all rows do not contain the same number of delimiters, Polaris identifies columns based on the row with the lowest number of delimiters.
When ingesting files with delimiter-separated values, you can specify the following:
- Number of header rows to skip. If you choose to skip any header rows, Polaris detects the column headers from the first non-skipped row.
- New column names. If your files do not contain headers or if you want to use your own column names, set Data has header? to No and enter your column headers.
Verify the column headers and any skipped rows in the data sample before continuing onto schema editing.
The following screenshot shows an example of batch ingestion from a CSV file. Here, Polaris ignores the first line of the file and assigns new column names.
You can also configure the format settings for batch ingestion using the API.
For more information, see the
formatSettings field of the ingestion job spec.
See the following topics for more information: