Troubleshoot data ingestion
This topic describes common issues and resolution strategies regarding data ingestion in Imply Polaris. For information on creating an ingestion job, see Create an ingestion job. Also see the known limitations of Polaris.
Missing data from ingestion job
Unexpected behavior: An ingestion job succeeds with an HTTP 200 OK
status code but ingests no rows or fewer rows than expected.
Remediation action: Verify data quality for unparseable rows, and update any filter expressions.
More info: Polaris skips ingesting rows that are unparseable or filtered out by a given filter expression. Also note the following points:
- An aggregate table may show fewer rows than the input data due to rollup. Rollup aggregates rows with the same timestamp and dimension values into a single row.
- SQL-based ingestion jobs fail when all rows are filtered out.
To view row ingestion metrics and any parse exceptions from your ingestion job, see Job health reference.
Data formats
Incorrectly formatted data can lead to errors when Polaris attempts to sample and ingest data. Supported data and file formats lists the supported source data formats as well as file compression formats.
Polaris automatically detects the data format or compression format based on the file extension. If you specify a filename that does not match the automatically detected type, Polaris attempts to ingest based on the user-specified value.
JSON data
Error: Polaris throws the following error for incorrectly formatted data:
- In the UI, Polaris displays the error, "No rows could be sampled (there are unparseable rows)".
- In the API, Polaris creates the job successfully with an HTTP
201 Created
status code, but the job fails with an error similar to the following example.
View the example
The following response for getting job logs describes the logs of an ingestion job that was successfully created but failed with an error due to incorrect JSON format.
{
"logs": [
{
"timestamp": "2023-07-19T21:10:42.635Z",
"healthStatus": "error",
"code": "InsertTimeNull",
"message": "Encountered a null timestamp in the __time field during INSERT or REPLACE."
},
{
"timestamp": "2023-07-19T21:10:42.635Z",
"healthStatus": "warn",
"code": "CannotParseExternalData",
"message": "A worker task could not parse data from an external datasource."
}
]
}
Remediation action: Verify that the data is proper newline-delimited JSON, a collection of JSON objects, and not a single large JSON object.
More info: The JSON format that Polaris supports is newline-delimited JSON. In newline-delimited JSON, each line of the source data must be a completely valid JSON object, separated by newline characters and not commas. Each object must be on its own line and not split across multiple lines.
The following example is valid newline-delimited JSON:
{"ts": "2023-06-16 00:00:00", "item_id": 123}
{"ts": "2023-06-16 01:00:00", "item_id": 234}
You cannot ingest array-based JSON data, that is, comma-separated values in an array. The following example is unsupported JSON for ingestion:
[
{
"ts": "2023-06-16 00:00:00",
"item_id": 123
},
{
"ts": "2023-06-16 01:00:00",
"item_id": 234
}
]
Data replacement time interval
Error: A SQL-defined job to replace data fails to start with an error similar to the following:
OVERWRITE WHERE clause identified interval [2020-01-01T00:00:00.000Z/2020-01-31T00:00:00.001Z] which is not aligned with
PARTITIONED BY granularity [{type=period, period=P1M, timeZone=UTC, origin=null}]
.
Remediation action: Review your OVERWRITE WHERE
time interval and ensure that the boundaries align with the time partitioning set in the job.
In most cases, time intervals should use <
(less than) for the upper time boundary rather than <=
(less than or equal to).
More info: Consider a replacement data example for which time partitioning is set to MONTH
.
Polaris evaluates the following OVERWRITE WHERE
time intervals accordingly:
The following interval raises an error since the borders do not match the time partitioning:
"__time" >= TIMESTAMP '2023-11-01' AND
"__time" <= TIMESTAMP '2023-11-30'The interval translates to
2023-11-01T00:00:00.000Z/2023-11-30T00:00:00.001Z
. Note that the upper border does not include the full day, causing the interval to be short of one month.The following interval aligns correctly with the time partitioning:
"__time" >= TIMESTAMP '2023-11-01' AND
"__time" < TIMESTAMP '2023-12-01'The interval correctly translates to
2023-11-01T00:00:00.000Z/2023-12-01T00:00:00.000Z
.
For more information, see SQL ingestion reference.
Event data not ingested
Unexpected behavior: A streaming ingestion job does not ingest any events.
Remediation action: To troubleshoot a running job not ingesting any events, try the following:
- Verify that there is data actively in the topic or stream and that it hasn’t been deleted according to the table's retention policy.
- Test the connection to ensure the correct topic or stream name and valid credentials.
- Confirm that the timestamp values mapped to the
__time
column are within 30 days of ingestion time. For more information, see Late arriving event data. - Reset the reading checkpoint for the stream.
More info: A streaming ingestion job may fail to ingest when the reading offset for the topic or stream has expired. This issue applies to a given table name and topic name, regardless of connection name. Possible causes include the following:
- The topic was deleted and recreated, so that the reading offset restarted.
- Data that was previously read from the topic expired or got deleted.
Schema registry with data streams
Error: A streaming ingestion job is running but the job logs from the job details page or API list row processing errors due to failure to fetch the schema.
Remediation action: To address this error, verify the following:
- The ingestion job lists the correct Schema Registry connection in
source.formatSettings.parseSchemaProvider.connectionName
. - The ingestion job lists the correct streaming connection in
source.connectionName
. - The Schema Registry connection has the correct source URL.
- The Schema Registry connection has valid authentication credentials.
- The schema associated with the message exists and is valid with the data sent.
More info: The job logs may show an error such as the following:
{
"logs": [
{
"timestamp": "2023-07-21T23:26:16.553Z",
"healthStatus": "error",
"code": "CannotProcessRow",
"message": "Failed to fetch Avro schema id[100006] from registry. Check if the schema exists in the registry. Otherwise it could mean that there is malformed data in the stream or data that doesn't conform to the schema specified."
},
]
}
In this example, 100006
is the schema ID associated with the message sent to the Kafka topic. If your ingestion job spec, streaming connection details, and Schema Registry connection details are correct, verify that the schema still exists in the registry and is valid in accordance with your data. A possible cause is when the schema associated with the message was deleted, leaving a message that Polaris cannot decode.
Request body structure
Error: The ingestion job fails with an HTTP 400 Bad Request
status code with the response code "BadArgument"
.
Remediation action: Verify the request body of the POST
request in the following aspects:
- The request body is correctly formatted JSON.
- All required properties are included.
- Each property contains the correct type of JSON value (string, number, array, object).
- Each enum property is assigned an allowable value.
More info: The following sections provide more information on possible errors for the BadArgument
code.
Incorrectly formatted JSON
Ensure the request body is properly formatted JSON and does not have missing or extra commas, curly braces, or square brackets. Polaris highlights the object with the incorrect syntax. For example, consider the case when source.formatSettings
has an extra comma in the request body:
"formatSettings": {
"format": "nd-json",
}
Polaris returns the following response body:
{
"error": {
"code": "BadArgument",
"message": "The request contained invalid JSON: Unexpected character ('}' (code 125)): was expecting double-quote to start field name",
"target": "source.formatSettings"
}
}
Missing required properties
If you did not include a required property, Polaris includes the missing property in the response body. For example, if you did not supply target.tableName
, Polaris returns the following response:
{
"error": {
"code": "BadArgument",
"message": "createJob.arg1.target.tableName: must not be null",
"details": [
{
"code": "BadArgument",
"message": "must not be null"
}
]
}
}
See the Jobs API documentation for a description of all required parameters.
Incorrect JSON type
You may receive the BadArgument
error when a property is not supplied with the correct JSON type. For the property intervals
that takes an array of strings, the following JSON snippet would raise an error:
"intervals": "2000-01-03/2020-01-20"
The following example shows the correct syntax for this property:
"intervals": ["2000-01-03/2020-01-20"]
Unsupported enum value
Also ensure that you supply an allowable value for enum properties. For example, the data format set in source.formatSettings.format
accepts nd-json
and not json
. The response body shows the known types accepted in this property:
{
"error": {
"code": "BadArgument",
"message": "The request contained invalid JSON: Could not resolve type id 'json' as a subtype of `io.imply.services.tables.gen.dtos.DataFormatSettings`: known type ids = [DataFormatSettings, avro_ocf, avro_stream, csv, kafka, nd-json, orc, parquet, protobuf] (for POJO property 'formatSettings')",
"target": "source.formatSettings"
}
}
Function arguments
Unexpected behavior: The ingestion job succeeds but the ingested data shows null values or does not match the expected outcome of the input expression.
Remediation action: Check that function arguments used in input expressions have the correct syntax.
More info: For example, the following input expression using JSON_VALUE contains improper syntax for the second parameter. The ingestion job may succeed, but Polaris populates the column with null values.
JSON_VALUE("campaign",'terms')
The following example shows correct usage of the function:
JSON_VALUE("campaign",'$.terms')
HTTP content header
Error: The ingestion job fails with the HTTP 415 Unsupported Media Type
status code.
Remediation action: Set the HTTP header Content-Type
to application/json
.
More info: The request body for creating an ingestion job by API is a JSON object that you send with your request. Your API client should automatically set the correct Content-Type
header. However, if you receive this error, set this header to application/json
. For example, in a curl request, include the command-line option -H 'Content-Type: application/json'
.
Contact Polaris support
For additional assistance, contact Polaris customer support.