There are two main ways of understanding and monitoring an Imply cluster: analyzing performance telemetry with Clarity and monitoring high-level status via our APIs.
Your Imply subscription includes Clarity, a performance analysis service that gives you visibility into the performance of your cluster, and can help you pinpoint and solve performance issues. Access it by going to clarity.imply.io.
Once inside, you have access to the following views:
Broker Queries shows telemetry from your query servers and gives you an overview of query performance, including average, 98%ile, and maximum query times. It allows you to break down performance by data source, server, properties of the query (e.g. query type, number of dimensions queried), and even individual query IDs.
Compute Queries lets you drill into the performance of individual data servers involved in each of your queries. A single Druid query will access many data nodes — each one handling a different slice of your data — and this view allows you to see how that fan-out affects query performance. Like the Broker Queries view, it can be broken down by by data source, server, properties of the query (e.g. query type, number of dimensions queried), and individual query IDs.
Ingestion shows telemetry from real-time ingestion.
JVM Memory shows information about JVM memory allocation, including garbage collection.
Exceptions shows any error conditions reported by your servers.
Druid has a number of status APIs that provide visibility into the system. The following APIs are especially useful for monitoring.
Unavailable segments: On the Coordinator, check /druid/coordinator/v1/loadstatus?simple and verify each datasource registers "0". This is the number of unavailable segments. It may briefly be non-zero when new segments are added. But if this value is high for a prolonged period of time, it indicates a problem with segment availability in your cluster. In this case, check your data nodes to confirm they are healthy, have spare disk space to load data, and have access to your S3 bucket where data is stored.
Data freshness: Run a "dataSourceMetadata" query to get the "maxIngestedEventTime" and verify that it's recent enough for your needs. For example, alert if it's more than a few minutes old. This is is an inexpensive Druid query, since it only hits the most recent segments and it only looks at the last row of data. In addition to verifying ingestion time, this also verifies that Druid is responsive to queries. If this value is staler than you expect, it can indicate that real-time data is not being loaded properly. In this case, use the Imply Manager to verify that your data ingestion is healthy, that there have not been any errors loading data, and that you have enough capacity to load the amount of data that you're trying to load.