EMR setup

This document describes the general steps that you will need to follow to support Hadoop batch indexing (via Amazon Elastic Map Reduce).

Set up your EMR cluster

If you do not already have an EMR cluster:

If you already have an EMR cluster:

Configuring your Imply Cloud cluster

Typically, Druid clusters use the core-site.xml, hdfs-site.xml, mapred-site.xml, and yarn-site.xml files to handle configuration of the Hadoop client and to set job properties. Because these files are not present in Imply Cloud clusters, we will use alternate methods of passing the important Hadoop configurations to Druid.

Asides from using the *-site.xml files, there are two other common ways of providing Hadoop configurations:

You can use either of the above methods to pass down your Hadoop configurations with some exceptions. The hadoop.fs.defaultFS configuration must be set in the runtime.properties file, otherwise the indexing job will not be able to locate the final segments in HDFS after the MapReduce jobs have completed. You can use the below configurations as a starting point and modify them as appropriate.

Add properties to the runtime.properties file of the middle manager:

hadoop.fs.defaultFS=hdfs://ip-xxx-xx-xx-xxx.ec2.internal:8020
hadoop.yarn.resourcemanager.hostname=ip-xxx-xx-xx-xxx.ec2.internal
hadoop.yarn.application.classpath=$HADOOP_CONF_DIR,$HADOOP_COMMON_HOME/*,$HADOOP_COMMON_HOME/lib/*,$HADOOP_HDFS_HOME/*,$HADOOP_HDFS_HOME/lib/*,$HADOOP_MAPRED_HOME/*,$HADOOP_MAPRED_HOME/lib/*,$HADOOP_YARN_HOME/*,$HADOOP_YARN_HOME/lib/*,/usr/lib/hadoop-lzo/lib/*,/usr/share/aws/emr/emrfs/conf,/usr/share/aws/emr/emrfs/lib/*,/usr/share/aws/emr/emrfs/auxlib/*,/usr/share/aws/emr/lib/*,/usr/share/aws/emr/ddb/lib/emr-ddb-hadoop.jar,/usr/share/aws/emr/goodies/lib/emr-hadoop-goodies.jar,/usr/share/aws/emr/kinesis/lib/emr-kinesis-hadoop.jar,/usr/share/aws/emr/cloudwatch-sink/lib/*
hadoop.mapreduce.framework.name=yarn
hadoop.fs.s3n.awsAccessKeyId={accessKey}
hadoop.fs.s3n.awsSecretAccessKey={secretAccessKey}

Submit a Hadoop indexing job

As previously noted, you can specify additional Hadoop configurations in the jobProperties section of the indexing spec. Here is a sample spec that you can use as a template.

{
  "type": "index_hadoop",
  "spec": {
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "static",
        "paths": "s3n://{bucket}/{path}"
      }
    },
    "tuningConfig": {
      "type": "hadoop",
      "partitionsSpec": {
        "type": "hashed",
        "targetPartitionSize": 5000000
      },
      "maxRowsInMemory": 500000,
      "buildV9Directly": true,
      "jobProperties": {
        "mapreduce.job.user.classpath.first": "true",
        "mapreduce.map.memory.mb": 1536,
        "mapreduce.map.java.opts": "-server -Xms1152m -Xmx1152m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps",
        "mapreduce.reduce.memory.mb": 6144,
        "mapreduce.reduce.java.opts": "-server -Xms2688m -Xmx2688m -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+PrintGCDetails -XX:+PrintGCTimeStamps",
        "mapreduce.task.timeout": 1800000
      }
    },
    "dataSchema": {
      "dataSource": "MY_DATASOURCE",
      "granularitySpec": {
        "type": "uniform",
        "segmentGranularity": "DAY",
        "queryGranularity": "NONE",
        "intervals": [
          "MY_INTERVAL_EXAMPLE_2017-01-01/2018-01-01"
        ]
      },
      "parser" : {
          "type" : "hadoopyString",
           "parseSpec":{
               "format" : "json",
               "timestampSpec" : {
                   "column" : "time",
                   "format" : "auto"
               },
               "dimensionsSpec" : {
                   "dimensions" : []
               }
           }
      },
      "metricsSpec": [
        {
          "name": "count",
          "type": "count"
        }
      ]
    }
  }
}

Click on the Send button to post the indexing task to Imply Cloud.

Overview

Deploy

Manage Data

Query Data

Visualize

Configure

Misc