Skip to main content

Tutorial: Querying data

After you have ingested data into Imply, you can query the data using Druid SQL.

The Quickstart takes you through the steps to get started with Imply, add sample data, and run a simple query against it. This tutorial builds upon the quickstart introduction by showing you how to build more complex queries in the Druid console and run them in Pivot.

Before starting, make sure you have access to a running Imply instance with data loaded. This tutorial uses the Wikipedia data set introduced in the Quickstart.

Query SQL from the Druid console

The Druid console serves as the data management interface for an Imply cluster. It's where you can load and manage data. While its primary purpose is data management, it also gives data administrators an interface for building data queries, as you'll explore in the following steps.

You'll build the query in two steps, starting simple and then extending the query.

Build a simple query

To build a simple query in Druid console, follow these steps:

  1. Open the Druid console in your web browser. From Pivot, you can open the console from the Data tab by clicking the Load Data button. For on-prem, navigate to http://localhost:8888.

  2. Click the Query button in the top menu to open the Query page:

    Query view

    In the Query page, you can you enter Druid SQL queries directly in the query editor. Instead, however, you'll build a query using query builder tools.

  3. Expand the wikipedia node in the data sources tree.

  4. Scroll down until you see the page node and click it to show the query menu.

  5. Click Show:page:

    Query select page

    A SELECT query appears in the query editor and runs automatically. In this case, the query should return no data, since the query filters for data from the last day, but our data is considerably older than that. Let's remove the filter.

  6. In the datasource tree, click __time and Remove Filter.

    Clear WHERE filter

  7. Click Run to run the query.

    You should now see two columns of data, a page name and the count:

    Query results

    The console limits the results to about a hundred by default because of the Smart query limit feature. This limit helps you avoid inadvertently overwhelming the system by preventing queries from returning an excessive amounts of data.

Extend the query

Now you'll edit the query directly, and explore a few more query building features of the editor:

  1. Click in the query edit pane and add a line after the first column, "page", and start typing the name of a new column, "countryName".

    Notice that the autocomplete menu suggests column names, functions, keywords, and more.

  2. Choose "countryName" and add the new column to the GROUP BY clause as well, either by name or by reference to its position, 2.

  3. For readability, replace the Count column name with Edits, since the COUNT() function actually returns the number of edits for the page. Make the same column name change in the ORDER BY clause as well.

      The `COUNT()` function is one of many functions available for use in Druid SQL queries. You can mouse over a function name
    in the autocomplete menu to see a brief description of a function. Also, you can find more information in the Druid
    documentation. For example, the `COUNT()` function is documented in
    [Aggregation functions](/latest/druid/querying/sql-aggregations).

    The query should now be:
    SELECT
    "page",
    "countryName",
    COUNT(*) AS "Edits"
    FROM "wikipedia"
    GROUP BY 1, 2
    ORDER BY "Edits" DESC

    When you run the query again, notice that you're getting the new dimension,countryName, but for most of the rows, its value is null. Show only rows with a countryName value as follows.

  4. Click the countryName dimension and choose the first filtering option, countryName IS NOT NULL. The following clause should be added to the query:

    WHERE "countryName" IS NOT NULL

    Run the query again. You should now see the top edits by country:

    Finished query

  5. Under the covers, every Druid SQL query is translated into a query in the JSON-based Druid native query format before it runs on data nodes. You can view the native query for this query by clicking ... and Explain SQL Query.

    While you can use Druid SQL for most purposes, familiarity with native query is useful for composing complex queries and for troubleshooting performance issues. For more information, see Native queries.

    Explain query

    > Another way to view the explain plan is by adding EXPLAIN PLAN FOR to the front of your query, as follows:
    >
    >```sql
    >EXPLAIN PLAN FOR
    >SELECT
    > "page",
    > "countryName",
    > COUNT(*) AS "Edits"
    >FROM "wikipedia"
    >WHERE "countryName" IS NOT NULL
    >GROUP BY 1, 2
    >ORDER BY "Edits" DESC
    >```
    >This is particularly useful when running queries
    from the command line or over HTTP.
  1. Finally, click ... and Edit context to see how you can add additional parameters controlling the execution of the query execution. In the field, enter query context options as JSON key-value pairs, as described in Context flags.

That's it! you've built a simple query using some of the query builder features built into the Druid Console. The following sections provide a few more example queries you can try. Also, see Other ways to invoke SQL queries to learn how to run Druid SQL from the command line or over HTTP.

Other ways to run SQL queries

The following sections describe a few other ways you can run queries with Imply.

Querying in Pivot

Any query you build in the Druid Console can be run in Pivot as well. The SQL page lets you build and run Druid SQL queries against data sources:

Finished query

From there, you can download results and check query execution time, and save a named query (as shown above), making it available to other Pivot users.

Querying over HTTP

You can also submit queries directly to the Druid Broker over HTTP.

Given a file quickstart/tutorial/wikipedia-top-pages-sql.json with the query above, run the query as follows:

curl -X 'POST' -H 'Content-Type:application/json' -d @quickstart/tutorial/wikipedia-top-pages-sql.json http://localhost:8888/druid/v2/sql

The results are:

[
{
"page": "Wikipedia:Vandalismusmeldung",
"Edits": 33
},
{
"page": "User:Cyde/List of candidates for speedy deletion/Subpage",
"Edits": 28
},
{
"page": "Jeremy Corbyn",
"Edits": 27
},
{
"page": "Wikipedia:Administrators' noticeboard/Incidents",
"Edits": 21
},
{
"page": "Flavia Pennetta",
"Edits": 20
},
{
"page": "Total Drama Presents: The Ridonculous Race",
"Edits": 18
},
{
"page": "User talk:Dudeperson176123",
"Edits": 18
},
{
"page": "Wikipédia:Le Bistro/12 septembre 2015",
"Edits": 18
},
{
"page": "Wikipedia:In the news/Candidates",
"Edits": 17
},
{
"page": "Wikipedia:Requests for page protection",
"Edits": 17
}
]

More information

See the Druid SQL documentation for more information on using Druid SQL queries.