Supported data and file formats
This topic is a reference for data and file format support in Imply Polaris. In an ingestion job, Polaris automatically detects the data format or compression format using the file extension. If you specify a filename that does not match the automatically detected type, Polaris attempts to ingest based on the user-specified value.
Supported source data formats
The following table describes the data formats that Polaris supports for batch and streaming ingestion.
Format | Batch ingestion | Streaming ingestion |
---|---|---|
Newline-delimited JSON | Yes | Yes |
Delimiter-separated values | Yes | Yes |
Regular expressions | Yes | Yes |
Apache Avro | Yes (Avro OCF) | Yes (not supported for push streaming) |
Apache ORC | Yes | No |
Apache Parquet | Yes | No |
Protocol Buffers (Protobuf) | No | Yes (not supported for push streaming) |
You can ingest nested data for all supported data formats.
For details on how to specify your input data schema for Avro and Protobuf formats, see Specify the data schema.
If your file uses the UTF-8 character encoding (the most common text encoding), ensure that the file does not store a byte order mark. The presence of a byte order mark interferes with UTF-8.
Supported file compression formats
Polaris supports the following compression formats for uploaded files:
ZIP files and TAR files are not supported.
You can send gzipped data in push streaming ingestion with the HTTP header Content-Encoding: gzip
.
File size limit
Polaris supports individual files up to 10 GB. This limit refers to the size of the file transmitted by the browser or HTTP client.
You may upload a file that's larger than 10 GB on disk if your browser or client compresses the file less than 10 GB in transit.
Timestamp requirements
Polaris requires all data to have a timestamp. The timestamp is used to partition and sort data and to perform time-based data management operations, such as dropping time chunks. You can transform timestamps or fill in missing timestamps in your ingestion job. For information about parsing and transforming timestamps from your source data, see Timestamp expressions.
Late arriving event data
For streaming ingestion, the event timestamp must be within 30 days of ingestion time. Polaris rejects events with timestamps older than 30 days. To override this period, set the late message rejection period in the ingestion job.
Otherwise if you need to ingest older data, use batch ingestion.
Learn more
For information on supported timestamp formats in Polaris, see Timestamp expressions.