Monitoring

To monitor an Imply cluster, you can use:

Clarity

When deploying Imply on-prem, there are two ways you can use Clarity:

Clarity is bundled with Pivot, so if running Clarity locally, no extra installations is required. However, you need to perform a few extra configuration steps to enable the Clarity, as described next.

If you are using Imply-hosted Clarity, you only need to enable metric emission on your Druid cluster, as described in step 2 below, below.

Configuring self-hosted Clarity

Under the covers, Clarity uses Druid to store metrics. For a production installation, you should install a separate Druid cluster (the collection cluster) to receive performance data from the monitored Druid cluster (the monitored cluster).

In evaluation settings, it possible to have a single cluster acting as both the monitored and collection cluster. However, in a production setting, this is strongly discouraged; the monitoring cluster should run independently from the cluster being monitored to ensure that monitoring functions, such as alerting, continue working if the availability or performance of the the production cluster is degraded. It also prevents Clarity operations from impacting production cluster performance.

Similarly, running Pivot from the secondary Imply instance provides the equivalent advantage.

Enabling Clarity on-prem involves these steps:

  1. Set up a metrics collection cluster.
  2. Configure your monitored cluster to emit metrics to Kafka.
  3. Configure the Kafka topic to which the monitored cluster emits metrics as a data source on your metrics collection cluster.
  4. Enable the embedded Clarity UI in your Pivot configuration.

Clarity architecture

Step 1: Set up a metrics collection cluster

You can skip this step if you plan to use the same cluster for metrics emitting and metrics collection. However, note that in production, you should use separate clusters.

The metrics collection cluster can be a single machine, as illustrated by the quickstart config, or multiple machines. A multi-machine cluster scales better to larger volumes of metrics and more retention.

To minimize sizing requirements of the metrics cluster, use load and drop rules to set a retention window on your data.

Most metrics are query telemetry events, which are emitted once per query per segment. Since it's common for clusters to have thousands of segments or more, there can be quite a lot of these events! If you have high query concurrency and wish to limit the amount of telemetry emitted, use the druid.emitter.clarity.samplingRate property documented in the table below. This property should be set on the metrics emitting cluster, not the metrics collection cluster.

For a large metrics cluster, you will need to increase the size of taskCount in your Kafka supervisor spec. This is the amount of parallelism used to process metrics. You may need to increase the number of data servers as well. Ensure that the druid-histogram extension is in the druid.extensions.loadList in the druid/_common/common.runtime.properties config file. This extension is used for computing 98th percentile latency metrics.

Step 2: Enable the metric emitter on the monitored cluster

For every cluster that you want to monitor, configure the Clarity emitter by following these steps:

  1. Ensure that the clarity-emitter-kafka extension is in the druid.extensions.loadList in druid/_common/common.runtime.properties file for the emitting cluster.

  2. Remove or comment out existing druid.emitter and druid.emitter.* configs in druid/_common/common.runtime.properties and replace them with the following:

    druid.emitter=clarity-kafka
    druid.emitter.clarity.topic=druid-metrics
    druid.emitter.clarity.producer.bootstrap.servers=kafka1.example.com:9092
    druid.emitter.clarity.clusterName=clarity-collection-cluster
  3. Replace kafka1.example.com:9092 with a comma-delimited list of Kafka brokers in your environment.

  4. The "clarity-collection-cluster" string can be anything you want, but it is intended to be used to help Clarity users tell different clusters apart in the Clarity UI.

The Clarity emitter will write to a "druid-metrics" topic. Start up Druid and verify that the druid-metrics topic exists as a datasource in the collection cluster.

druid-metrics datasource

Step 3: Configure Kafka ingestion on your metrics collection cluster

Ensure that the Druid Kafka indexing service extension is loaded on the metrics collection cluster. See extensions for information on loading Druid extension.

Download the Clarity Kafka supervisor spec from https://static.imply.io/support/clarity-kafka-supervisor.json. Apply the spec by running the following command from the directory to which you downloaded the spec:

curl -XPOST -H'Content-Type: application/json' -d@clarity-kafka-supervisor.json http://<overlord_address>:8090/druid/indexer/v1/supervisor

Replace overlord_address with the IP address of the machine running the overlord process in your Imply cluster. This is typically the Master server in the Druid cluster.

Step 4: Configure Clarity-specific settings

Keep in mind that Clarity maintains a connection to the Druid collection cluster that is apart from Pivot's own connection to Druid. Accordingly, you need to configure the connection separately.

Add the following minimum configuration settings to your Pivot configuration file. You can find the Pivot configuration file in conf/pivot/config.yaml and conf-quickstart/pivot/config.yaml (for a quickstart instance) in your Imply installation home.

# Specify the metrics cluster to connect to
metricsCluster:
 host: localhost:8082 # Enter the IP of your metrics collecting broker node here

# Enter the name of your clarity data source
metricsDataSource: druid-metrics

# Instead of relying on auto-detection you can explicitly specify which clusters should be available from the cluster dropdown
clientClusters: ["default"]

# If your metrics data source does not have a histogram (approxHistogram) metric column then take it out of the UI by suppressing it
#suppressQuantiles: true

# If your metrics data source does have a histogram you can specify a tuning config here
#quantileTuning: "resolution=40

As noted in the comment, replace localhost in the metricsCluster configuration with the IP address of the metrics collection cluster.

You need to provide at least one cluster name in clientClusters parameter or Pivot may fail to start up. The name given should match the one used in druid.emitter.clarity.clusterName in the emitting cluster's common.runtime.properties configuration file.

Depending on the configuration of the Druid collecting cluster, you may need additional settings. For example, if authentication is enabled in Druid, add the defaultDbAuthToken property with the auth type, you need to add a username and password to the metricsCluster configuration, as follows:

metricsCluster:
 host: <broker_host>:<broker_port>
 ...
 defaultDbAuthToken:
    type: 'basic-auth'
    username: <auth_user_name>
    password: <auth_password>

If TLS is enabled, add the protocol property and provide the certificate information to the metricsCluster configuration:

metricsCluster:
  host: <MetricClusterBrokerHost>:<BrokerPort>
  protocol: tls
  ca: <certificate>

For a self-signed certificate, you can use tls-loose as the protocol:

metricsCluster:
  host: <MetricClusterBrokerHost>:<BrokerPort>
  protocol: tls-loose

Likewise, you can use any connection parameter available for connecting Pivot to Druid in the metricsCluster configuration for connecting Clarity to the metrics collection cluster as well. See metricsCluster settings for more information on those settings.

Access Clarity

If Pivot is running, you need to restart it to have the configuration change take effect. After restarting Pivot, you can open Clarity at the following address:

http://<pivot_address>:9095/clarity

Users in Pivot need to have the AccessClarity permission to be able to access the Clarity UI. Of the built-in roles, only "Super Admins" have this permission, so you'll need to allocate this permission to the users and roles as appropriate for your system.

Navigating the Clarity UI

By default, Clarity opens in the Visuals pane:

Clarity home

From the Clarity home page, you can access various views, including:

Clarity alerts

It's a good practice to open the Clarity UI regularly to inspect the performance of your Imply cluster. In addition, by configuring alerts, you can have Clarity notify you when a condition is met. You can configure conditions to evaluate
query times, exception counts, and more.

You can configure alerts from the Alerts tab:

Clarity alerts

Clarity alerts are configured in the same way as other Pivot alerts. For more information, see Pivot Alerts.

Clarity emitter configurations

You can control the way Druid emits metrics by adding the following properties to the Druid properties file, common.runtime.properties, of the metrics emitting cluster.

Add druid.emitter.clarity. as a prefix to the field names shown, for example, druid.emitter.clarity.topic and druid.emitter.clarity.producer.bootstrap.servers.

Field Type Description Default Required
topic String HTTP endpoint events will be posted to, e.g. http://<clarity collector host>:<port>/d/<username> [required] yes
producer.bootstrap.servers String Kafka "bootstrap.servers" configuration (a list of brokers) [required] yes
producer.* String Can be used to specify any other Kafka producer property. empty no
clusterName String Cluster name used to tag events null no
anonymous Boolean Should hostnames be scrubbed from events? false no
maxBufferSize Integer Maximum size of event buffer min(250MB, 10% of heap) no
samplingRate Integer For sampled metrics, what percentage of metrics will be emitted 100 no
sampledMetrics List Which event types are sampled ["query/wait/time", "query/segment/time", "query/segmentAndCache/time"] no
sampledNodeTypes List Which node types are sampled ["druid/historical", "druid/peon", "druid/realtime"] no

metricsCluster connection optional parameters

Step 4 above describes the basic connection settings to connect Clarity to the metrics collection cluster.

The following connection settings are optional, or required only as necessitated by your metrics collection Druid configuration.

The settings are equivalent to those in the Pivot configuration. However, those settings are separate from Clarity's.

Field Description
timeout The timeout for the metric queries. Default is 40000.
protocol The connection protocol, one of plain (the default), tls-loose or tls. If tls specify the ca, cert, key and passphrase.
ca If connecting via TLS, a trusted certificate of the certificate authority if using self-signed certificates. Should be PEM formatted text.
cert If connecting via TLS, the client side certificate to present. Should be PEM-formatted text.
key If connecting via TLS, the private key file name. The key should be PEM-formatted text.
passphrase If connecting via TLS, a passphrase for the private key, if needed.
defaultDbAuthToken If Druid authentication is enabled, the default token that will be used to authenticate against this connection.
socksHost If Clarity needs to connect to Druid via a SOCKS5 proxy, the hostname of the proxy host.
socksUsername The user for the Socks proxy, if needed.
socksPassword The password for proxy authentication, if needed.

Druid Status APIs

Druid includes status APIs that return metrics that help you gauge the health of the system. The following APIs are especially useful for monitoring.

  1. Unavailable segments: On the Coordinator, check /druid/coordinator/v1/loadstatus?simple and verify each datasource registers "0". This is the number of unavailable segments. It may briefly be non-zero when new segments are added, but if this value is high for a prolonged period of time, it indicates a problem with segment availability in your cluster. In this case, check your data nodes to confirm they are healthy, have spare disk space to load data, and have access to your S3 bucket where data is stored.

  2. Data freshness: Run a "dataSourceMetadata" query to get the "maxIngestedEventTime" and verify that it's recent enough for your needs. For example, alert if it's more than a few minutes old. This is an inexpensive Druid query, since it only hits the most recent segments and it only looks at the last row of data. In addition to verifying ingestion time, this also verifies that Druid is responsive to queries. If this value is staler than you expect, it can indicate that real-time data is not being loaded properly. In this case, use the Imply Manager to verify that your data ingestion is healthy, that there have not been any errors loading data, and that you have enough capacity to load the amount of data that you're trying to load.

See Druid API reference for more information.

Overview

Tutorial

Deploy

Administer

Manage Data

Query Data

Visualize

Configure

Misc