Skip to main content

Troubleshoot data ingestion

This topic describes common issues and resolution strategies regarding data ingestion in Imply Polaris. For information on creating an ingestion job, see Create an ingestion job. Also see the known limitations of Polaris.

Missing data from ingestion job

Unexpected behavior: An ingestion job succeeds with an HTTP 200 OK status code but ingests no rows or fewer rows than expected.

Remediation action: Verify data quality for unparseable rows, and update any filter expressions.

More info: Polaris skips ingesting rows that are unparseable or filtered out by a given filter expression. Also note the following points:

  • An aggregate table may show fewer rows than the input data due to rollup. Rollup aggregates rows with the same timestamp and dimension values into a single row.
  • SQL-based ingestion jobs fail when all rows are filtered out.

To view row ingestion metrics and any parse exceptions from your ingestion job, see Job health reference.

Data formats

Incorrectly formatted data can lead to errors when Polaris attempts to sample and ingest data. Supported data and file formats lists the supported source data formats as well as file compression formats.

Polaris automatically detects the data format or compression format based on the file extension. If you specify a filename that does not match the automatically detected type, Polaris attempts to ingest based on the user-specified value.

JSON data

Error: Polaris throws the following error for incorrectly formatted data:

  • In the UI, Polaris displays the error, "No rows could be sampled (there are unparseable rows)".
  • In the API, Polaris creates the job successfully with an HTTP 201 Created status code, but the job fails with an error similar to the following example.
View the example

The following response for getting job logs describes the logs of an ingestion job that was successfully created but failed with an error due to incorrect JSON format.

{
"logs": [
{
"timestamp": "2023-07-19T21:10:42.635Z",
"healthStatus": "error",
"code": "InsertTimeNull",
"message": "Encountered a null timestamp in the __time field during INSERT or REPLACE."
},
{
"timestamp": "2023-07-19T21:10:42.635Z",
"healthStatus": "warn",
"code": "CannotParseExternalData",
"message": "A worker task could not parse data from an external datasource."
}
]
}

Remediation action: Verify that the data is proper newline-delimited JSON, a collection of JSON objects, and not a single large JSON object.

More info: The JSON format that Polaris supports is newline-delimited JSON. In newline-delimited JSON, each line of the source data must be a completely valid JSON object, separated by newline characters and not commas. Each object must be on its own line and not split across multiple lines.

The following example is valid newline-delimited JSON:

{"ts": "2023-06-16 00:00:00", "item_id": 123}
{"ts": "2023-06-16 01:00:00", "item_id": 234}

You cannot ingest array-based JSON data, that is, comma-separated values in an array. The following example is unsupported JSON for ingestion:

[
{
"ts": "2023-06-16 00:00:00",
"item_id": 123
},
{
"ts": "2023-06-16 01:00:00",
"item_id": 234
}
]

Data replacement time interval

Error: A SQL-defined job to replace data fails to start with an error similar to the following:

OVERWRITE WHERE clause identified interval [2020-01-01T00:00:00.000Z/2020-01-31T00:00:00.001Z] which is not aligned with PARTITIONED BY granularity [{type=period, period=P1M, timeZone=UTC, origin=null}]

Remediation action: Review your OVERWRITE WHERE time interval and ensure that the boundaries align with time partitioning. In most cases, time intervals should use < (less than) for the upper time boundary rather than <= (less than or equal to).

More info: Consider a replacement data example for which time partitioning is set to MONTH. Polaris evaluates the following OVERWRITE WHERE time intervals accordingly:

  • The following interval raises an error since the borders do not match the time partitioning:

    "__time" >= TIMESTAMP '2023-11-01' AND
    "__time" <= TIMESTAMP '2023-11-30'

    The interval translates to 2023-11-01T00:00:00.000Z/2023-11-30T00:00:00.001Z. Note that the upper border does not include the full day, causing the interval to be short of one month.

  • The following interval aligns correctly with the time partitioning:

    "__time" >= TIMESTAMP '2023-11-01' AND
    "__time" < TIMESTAMP '2023-12-01'

    The interval correctly translates to 2023-11-01T00:00:00.000Z/2023-12-01T00:00:00.000Z.

For more information, see SQL ingestion reference.

Event data not ingested

Unexpected behavior: A streaming ingestion job does not ingest any events.

Remediation action: To troubleshoot a running job not ingesting any events, try the following:

More info: A streaming ingestion job may fail to ingest when the reading offset for the topic or stream has expired. This issue applies to a given table name and topic name, regardless of connection name. Possible causes include the following:

  • The topic was deleted and recreated, so that the reading offset restarted.
  • Data that was previously read from the topic expired or got deleted.

Schema registry with data streams

Error: A streaming ingestion job is running but the job logs from the job details page or API list row processing errors due to failure to fetch the schema.

Remediation action: To address this error, verify the following:

  • The ingestion job lists the correct Schema Registry connection in source.formatSettings.parseSchemaProvider.connectionName.
  • The ingestion job lists the correct streaming connection in source.connectionName.
  • The Schema Registry connection has the correct source URL.
  • The Schema Registry connection has valid authentication credentials.
  • The schema associated with the message exists and is valid with the data sent.

More info: The job logs may show an error such as the following:

{
"logs": [
{
"timestamp": "2023-07-21T23:26:16.553Z",
"healthStatus": "error",
"code": "CannotProcessRow",
"message": "Failed to fetch Avro schema id[100006] from registry. Check if the schema exists in the registry. Otherwise it could mean that there is malformed data in the stream or data that doesn't conform to the schema specified."
},
]
}

In this example, 100006 is the schema ID associated with the message sent to the Kafka topic. If your ingestion job spec, streaming connection details, and Schema Registry connection details are correct, verify that the schema still exists in the registry and is valid in accordance with your data. A possible cause is when the schema associated with the message was deleted, leaving a message that Polaris cannot decode.

Request body structure

Error: The ingestion job fails with an HTTP 400 Bad Request status code with the response code "BadArgument".

Remediation action: Verify the request body of the POST request in the following aspects:

  • The request body is correctly formatted JSON.
  • All required properties are included.
  • Each property contains the correct type of JSON value (string, number, array, object).
  • Each enum property is assigned an allowable value.

More info: The following sections provide more information on possible errors for the BadArgument code.

Incorrectly formatted JSON

Ensure the request body is properly formatted JSON and does not have missing or extra commas, curly braces, or square brackets. Polaris highlights the object with the incorrect syntax. For example, consider the case when source.formatSettings has an extra comma in the request body:

        "formatSettings": {
"format": "nd-json",
}

Polaris returns the following response body:

{
"error": {
"code": "BadArgument",
"message": "The request contained invalid JSON: Unexpected character ('}' (code 125)): was expecting double-quote to start field name",
"target": "source.formatSettings"
}
}

Missing required properties

If you did not include a required property, Polaris includes the missing property in the response body. For example, if you did not supply target.tableName, Polaris returns the following response:

{
"error": {
"code": "BadArgument",
"message": "createJob.arg1.target.tableName: must not be null",
"details": [
{
"code": "BadArgument",
"message": "must not be null"
}
]
}
}

See the Jobs API documentation for a description of all required parameters.

Incorrect JSON type

You may receive the BadArgument error when a property is not supplied with the correct JSON type. For the property intervals that takes an array of strings, the following JSON snippet would raise an error:

"intervals": "2000-01-03/2020-01-20"

The following example shows the correct syntax for this property:

"intervals": ["2000-01-03/2020-01-20"]

Unsupported enum value

Also ensure that you supply an allowable value for enum properties. For example, the data format set in source.formatSettings.format accepts nd-json and not json. The response body shows the known types accepted in this property:

{
"error": {
"code": "BadArgument",
"message": "The request contained invalid JSON: Could not resolve type id 'json' as a subtype of `io.imply.services.tables.gen.dtos.DataFormatSettings`: known type ids = [DataFormatSettings, avro_ocf, avro_stream, csv, kafka, nd-json, orc, parquet, protobuf] (for POJO property 'formatSettings')",
"target": "source.formatSettings"
}
}

Function arguments

Unexpected behavior: The ingestion job succeeds but the ingested data shows null values or does not match the expected outcome of the input expression.

Remediation action: Check that function arguments used in input expressions have the correct syntax.

More info: For example, the following input expression using JSON_VALUE contains improper syntax for the second parameter. The ingestion job may succeed, but Polaris populates the column with null values.

JSON_VALUE("campaign",'terms')

The following example shows correct usage of the function:

JSON_VALUE("campaign",'$.terms')

HTTP content header

Error: The ingestion job fails with the HTTP 415 Unsupported Media Type status code.

Remediation action: Set the HTTP header Content-Type to application/json.

More info: The request body for creating an ingestion job by API is a JSON object that you send with your request. Your API client should automatically set the correct Content-Type header. However, if you receive this error, set this header to application/json. For example, in a curl request, include the command-line option -H 'Content-Type: application/json'.

Contact Polaris support

For additional assistance, contact Polaris customer support.