Monitoring

To monitor an Imply cluster, you can use the following monitoring solutions:

Clarity, a visual Application Performance Monitoring (APM) tool that lets you identify and solve performance issues of an Imply cluster.
Status APIs, endpoints served by Apache Druid that return system state representing the health and performance of the cluster.
In cloud deployments, you can further enable monitoring of the metadata store (Amazon RDS, by default) using CloudWatch. See Monitoring the metadata store with CloudWatch for more information.

Clarity

On Imply Hybrid (formerly Imply Cloud), you can access Clarity from the Imply Manager console. To do so, click Monitor from the left menu of the cluster overview page.

Clarity accessing

When deploying Imply Enterprise (formerly Imply Private) on-prem or to your own cloud provider account, you can use Clarity in one of the following ways:

(Recommended) SaaS hosted by Imply. Access to SaaS Clarity is included with your Imply subscription, at clarity.imply.io.
A local instance that you set up and host yourself.

Because Clarity is bundled with Pivot, it does not require additional installation when running locally. However, you need to perform a few extra configuration steps to enable Clarity, as described next.

If you are using Imply-hosted Clarity, you need to enable metric emission on your Druid cluster. For more information, see Clarity emitter configurations.

Navigating the Clarity UI

By default, Clarity opens in the Visuals pane:

Clarity home

From the Clarity home page, you can access various views, including:

Queries views:
- Broker Queries shows telemetry from your query servers and gives you an overview of query performance, including average, 98th percentile, and maximum query times. It allows you to break down performance by data source, server, properties of the query (for example, query type, number of dimensions queried), and even individual query IDs.
- Historical Queries lets you drill into the performance of individual data servers involved in each of your queries. A single Druid query will access many data nodes — each one handling a different slice of your data — and this view allows you to see how that fan-out affects query performance. Like the Broker Queries view, it can be broken down by data source, server, properties of the query (for example, query type, number of dimensions queried), and individual query IDs.
Ingestion shows telemetry from real-time ingestion.
Server shows information about JVM memory allocation, including garbage collection.
Exceptions shows error conditions reported by your servers.

Clarity alerts

It is a good practice to open the Clarity UI regularly to inspect the performance of your Imply cluster. In addition, by configuring alerts, you can have Clarity notify you when a condition is met. You can configure conditions to evaluate query times, exception counts, and more.

You can configure alerts from the Alerts tab:

Clarity alerts

Clarity alerts are configured in the same way as other Pivot alerts. For more information, see Pivot Alerts.

Clarity emitter configurations

You can control the way Druid emits metrics by adding the following properties to the Druid properties file, common.runtime.properties, of the metrics emitting cluster.

All properties must start with the druid.emitter.clarity. prefix followed by the field name. For example: druid.emitter.clarity.recipientBaseUrl and druid.emitter.clarity.basicAuthentication.

Field	Type	Description	Default	Required
`recipientBaseUrl`	String	The HTTP endpoint events are posted to. For example. `http://<clarity collector host>:<port>/d/<username>`.	Imply provided URL	yes
`basicAuthentication`	String	Basic auth credentials, typically `<username>:<password>`.	null	no
`clusterName`	String	Cluster name used to tag events.	null	no
`anonymous`	Boolean	If true, Clarity removes hostnames from events. Clarity also anonymizes the `identity` and `remoteAddress` event fields as well as the `implyUser` and `implyUserEmail` metric dimensions using a salted SHA-256 hash. This way, Clarity can distinguish multiple high-cost queries from a single user or many different users to help troubleshoot performance issues.	false	no
`maxBufferSize`	Integer	Maximum size of event buffer.	min(250 MB, 10% of heap)	no
`maxBatchSize`	Integer	Maximum size of HTTP event payload.	5 MB	no
`flushCount`	Integer	Number of events before a flush is triggered.	500	no
`flushBufferPercentFull`	Integer	Percentage of buffer fill that will trigger a flush (byte-based).	25	no
`flushMillis`	Integer	Period between flushes if not triggered by `flushCount` or `flushBufferPercentFull`.	60s	no
`flushTimeOut`	Integer	Flush timeout.	`Long.MAX_VALUE`	no
`timeOut`	ISO8601 Period	HTTP client response timeout.	PT1M	no
`batchingStrategy`	String [ARRAY, NEWLINES]	How events are batched together in the payload.	ARRAY	no
`compression`	String [NONE, LZ4, GZIP]	Compression algorithm used.	LZ4	no
`lz4BufferSize`	Integer	Block size for the LZ4 compressor in bytes.	65536	no
`samplingRate`	Integer	Percentage of sampled metrics to be emitted.	100	no
`sampledMetrics`	List	Sampled event types.	[`query/wait/time`, `query/segment/time`, `query/segmentAndCache/time`]	no
`sampledNodeTypes`	List	Sampled node types.	[`druid/historical`, `druid/peon`, `druid/realtime`]	no
`customQueryDimensions`	List	Context dimensions that will be extracted and emitted. Emitted context dimensions are prepended with the string `context:`.	[]	no
`metricsFactoryChoice`	Boolean	If true, registers Clarity's custom query metrics modules as options rather than defaults. This is useful when you need to define your own query metrics factories through your own extensions.	false	no

Druid Status APIs

Druid includes status APIs that return metrics that help you gauge the health of the system. The following APIs are especially useful for monitoring.

Unavailable segments: On the Coordinator, check /druid/coordinator/v1/loadstatus?simple and verify each datasource registers 0. This is the number of unavailable segments. It may briefly be non-zero when new segments are added, but if this value is high for a prolonged period of time, it indicates a problem with segment availability in your cluster. In this case, check your data nodes to confirm they are healthy, have spare disk space to load data, and have access to your S3 bucket where data is stored.
Data freshness: Run a "dataSourceMetadata" query to get the "maxIngestedEventTime" and verify that it's recent enough for your needs. For example, alert if it's more than a few minutes old. This is an inexpensive Druid query, since it only hits the most recent segments and it only looks at the last row of data. In addition to verifying ingestion time, this also verifies that Druid is responsive to queries. If this value is staler than you expect, it can indicate that real-time data is not being loaded properly. In this case, use the Imply Manager to verify that your data ingestion is healthy, that there have not been any errors loading data, and that you have enough capacity to load the amount of data that you're trying to load.

See Druid API reference for more information.

Clarity​

Navigating the Clarity UI​

Clarity alerts​

Clarity emitter configurations​

Druid Status APIs​

Clarity

Navigating the Clarity UI

Clarity alerts

Clarity emitter configurations

Druid Status APIs