ORC extension (early access)

Imply 2.9 includes a revamped Druid ORC extension, which replaces the contrib ORC extension that is part of Apache Druid 0.14.x. Starting with Druid 0.15.0, the revamped extension will be part of the community Apache Druid release as well. For full details on differences between the old and new ORC extension, please refer to the Imply release notes.

ORC Hadoop Parser

The inputFormat of inputSpec in ioConfig must be set to "org.apache.orc.mapreduce.OrcInputFormat".

Field Type Description Required
type String This should say orc yes
parseSpec JSON Object Specifies the timestamp and dimensions of the data (timeAndDims and orc format) and a flattenSpec (orc format) yes

The parser supports two parseSpec formats: orc and timeAndDims.

orc supports auto field discovery and flattening, if specified with a flattenSpec. If no flattenSpec is specified, useFieldDiscovery will be enabled by default. Specifying a dimensionSpec is optional if useFieldDiscovery is enabled: if a dimensionSpec is supplied, the list of dimensions it defines will be the set of ingested dimensions, if missing the discovered fields will make up the list.

timeAndDims parse spec must specify which fields will be extracted as dimensions through the dimensionSpec.

All column types are supported, with the exception of union types. Columns of list type, if filled with primitives, may be used as a multi-value dimension, or specific elements can be extracted with flattenSpec expressions. Likewise, primitive fields may be extracted from map and struct types in the same manner. Auto field discovery will automatically create a string dimension for every (non-timestamp) primitive or list of primitives, as well as any flatten expressions defined in the flattenSpec.

Hadoop Job Properties

Like most Hadoop jobs, the best outcomes will add "mapreduce.job.user.classpath.first": "true" or "mapreduce.job.classloader": "true" to the jobProperties section of tuningConfig. Note that it is likely if using "mapreduce.job.classloader": "true" that you will need to set mapreduce.job.classloader.system.classes to include -org.apache.hadoop.hive. to instruct Hadoop to load org.apache.hadoop.hive classes from the application jars instead of system jars, e.g.

...
    "mapreduce.job.classloader": "true",
    "mapreduce.job.classloader.system.classes" : "java., javax.accessibility., javax.activation., javax.activity., javax.annotation., javax.annotation.processing., javax.crypto., javax.imageio., javax.jws., javax.lang.model., -javax.management.j2ee., javax.management., javax.naming., javax.net., javax.print., javax.rmi., javax.script., -javax.security.auth.message., javax.security.auth., javax.security.cert., javax.security.sasl., javax.sound., javax.sql., javax.swing., javax.tools., javax.transaction., -javax.xml.registry., -javax.xml.rpc., javax.xml., org.w3c.dom., org.xml.sax., org.apache.commons.logging., org.apache.log4j., -org.apache.hadoop.hbase., -org.apache.hadoop.hive., org.apache.hadoop., core-default.xml, hdfs-default.xml, mapred-default.xml, yarn-default.xml",
...

This is due to the hive-storage-api dependency of the orc-mapreduce library, which provides some classes under the org.apache.hadoop.hive package. If instead using the setting "mapreduce.job.user.classpath.first": "true", then this will not be an issue.

Examples

orc parser, orc parseSpec, auto field discovery, flatten expressions

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat",
        "paths": "path/to/file.orc"
      },
      ...
    },
    "dataSchema": {
      "dataSource": "example",
      "parser": {
        "type": "orc",
        "parseSpec": {
          "format": "orc",
          "flattenSpec": {
            "useFieldDiscovery": true,
            "fields": [
              {
                "type": "path",
                "name": "nestedDim",
                "expr": "$.nestedData.dim1"
              },
              {
                "type": "path",
                "name": "listDimFirstItem",
                "expr": "$.listDim[1]"
              }
            ]
          },
          "timestampSpec": {
            "column": "timestamp",
            "format": "millis"
          }
        }
      },
      ...
    },
    "tuningConfig": <hadoop-tuning-config>
    }
  }
}

orc parser, orc parseSpec, field discovery with no flattenSpec or dimensionSpec

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat",
        "paths": "path/to/file.orc"
      },
      ...
    },
    "dataSchema": {
      "dataSource": "example",
      "parser": {
        "type": "orc",
        "parseSpec": {
          "format": "orc",
          "timestampSpec": {
            "column": "timestamp",
            "format": "millis"
          }
        }
      },
      ...
    },
    "tuningConfig": <hadoop-tuning-config>
    }
  }
}

orc parser, orc parseSpec, no autodiscovery

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat",
        "paths": "path/to/file.orc"
      },
      ...
    },
    "dataSchema": {
      "dataSource": "example",
      "parser": {
        "type": "orc",
        "parseSpec": {
          "format": "orc",
          "flattenSpec": {
            "useFieldDiscovery": false,
            "fields": [
              {
                "type": "path",
                "name": "nestedDim",
                "expr": "$.nestedData.dim1"
              },
              {
                "type": "path",
                "name": "listDimFirstItem",
                "expr": "$.listDim[1]"
              }
            ]
          },
          "timestampSpec": {
            "column": "timestamp",
            "format": "millis"
          },
          "dimensionsSpec": {
            "dimensions": [
              "dim1",
              "dim3",
              "nestedDim",
              "listDimFirstItem"
            ],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      ...
    },
    "tuningConfig": <hadoop-tuning-config>
    }
  }
}

orc parser, timeAndDims parseSpec

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "inputFormat": "org.apache.orc.mapreduce.OrcInputFormat",
        "paths": "path/to/file.orc"
      },
      ...
    },
    "dataSchema": {
      "dataSource": "example",
      "parser": {
        "type": "orc",
        "parseSpec": {
          "format": "timeAndDims",
          "timestampSpec": {
            "column": "timestamp",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": [
              "dim1",
              "dim2",
              "dim3",
              "listDim"
            ],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      ...
    },
    "tuningConfig": <hadoop-tuning-config>
  }
}

Migration from 'contrib' extension

This extension, first available in version 0.15.0, replaces the previous 'contrib' extension which was available until 0.14.0-incubating. While this extension can index any data the 'contrib' extension could, the json spec for the ingestion task is incompatible, and will need modified to work with the newer 'core' extension.

To migrate to 0.15.0+:

Overview

Deploy

Manage Data

Query Data

Visualize

Configure

Misc