Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Mastering Elasticsearch Queries: From DSL to Java API Implementation

Tech May 10 4

Today, we'll explore Elasticsearch's data search capabilities. Elasticsearch provides a JSON-based DSL (Domain Specific Language) for defining query conditions, and its Java API essentially organizes these DSL conditions.

Therefore, we'll first learn the DSL query syntax, then use that as a foundation to understand the Java API, which will make the learning process more efficient.

  1. DSL Queries

Elasticsearch queries can be categorized into two main types:

  • Leaf query clauses: Typically search for specific values in particular fields. These are simple queries rarely used independently.
  • Compound query clauses: Logically combine multiple leaf queries or modify leaf query behavior.

1.1. Quick Start

We'll continue using Kibana's DevTools to learn the DSL query syntax. First, let's examine the basic query structure:

GET /{index-name}/_search
{
  "query": {
    "query_type": {
      // .. query conditions
    }
  }
}

Explanation:

  • GET /{index-name}/_search: The _search path is fixed and cannot be modified

For example, let's look at the simplest unconditional query. The query type for unconditional queries is match_all, so the query statement would be:

GET /products/_search
{
  "query": {
    "match_all": {
      
    }
  }
}

Since match_all has no conditions, we can leave the condition section empty.

The execution results show that despite using match_all, the response doesn't include all documents in the index but only 10 records. This is because, for security reasons, Elasticsearch sets a default query page size.

1.2. Leaf Queries

Leaf query types can be further subdivided. For detailed information, you can refer to the official documentation: Elasticsearch Query DSL

Here are some common examples:

  • Full Text Queries: Use analyzers to first tokenize user input into terms, then search for these terms using the inverted index. Examples:
    • match
    • multi_match
  • Term-level queries: Don't tokenize user input but match exact values against field content. These can only search keyword, numeric, date, or boolean type fields. Examples:
    • ids
    • term
    • range
  • Geo-coordinate queries: Used for searching geographic locations with various methods:
    • geo_bounding_box: Search within a rectangle
    • geo_distance: Search within a point and radius

1.2.1. Full Text Queries

There are many types of full-text queries. For details, refer to the official documentation: Full text queries

Let's take the match query as an example. The syntax is as follows:

GET /{index-name}/_search
{
  "query": {
    "match": {
      "field_name": "search_term"
    }
  }
}

Similar to match, there's also multi_match, which differs by allowing simultaneous searches across multiple fields where all fields must match. The syntax example:

GET /{index-name}/_search
{
  "query": {
    "multi_match": {
      "query": "search_term",
      "fields": ["field1", "field2"]
    }
  }
}

1.2.2. Term-level Queries

Term-level queries, as the name suggests, operate at the term level. They don't tokenize user input but treat it as a single term, matching against exact field values. Therefore, they're recommended for searching keyword, numeric, date, or boolean type fields. Examples include:

  • id
  • price
  • city
  • place names
  • personal names

For fields that have meaning as a whole entity.

For detailed information, see the official documentation: Term-level queries

Let's take the term query as an example. Its syntax is as follows:

GET /{index-name}/_search
{
  "query": {
    "term": {
      "field_name": {
        "value": "search_term"
      }
    }
  }
}

Now let's look at the range query, with syntax as follows:

GET /{index-name}/_search
{
  "query": {
    "range": {
      "field_name": {
        "gte": {min_value},
        "lte": {max_value}
      }
    }
  }
}

range is a range query, with the following keywords for range filtering:

  • gte: Greater than or equal to
  • gt: Greater than
  • lte: Less than or equal to
  • lt: Less than

1.3. Compound Queries

Compound queries can be roughly divided into two categories:

  • First category: Combine leaf queries based on logical operations to implement combined conditions, such as:
    • bool
  • Second category: Modify document relevance scoring during queries based on certain algorithms, thereby changing document rankings. Examples:
    • function_score
    • dis_max

For other compound queries and related syntax, refer to the official documentation: Compound queries

1.3.1. Function Score Query

When we use match queries, document results are scored based on relevance (_score) to the search terms, and results are returned in descending order of scores.

Starting from Elasticsearch 5.1, the relevance scoring algorithm used is BM25. Based on this algorithm, the relevance between a document and user search keywords can be determined quite accurately. However, in actual business requirements, there's often a need for paid ranking functionality, where higher-paying results rank higher rather than those with higher relevance.

To manually control relevance scoring, we need to use the function score query in Elasticsearch.

Basic syntax:

The function score query contains four main components:

  • Original query conditions: The query part, which searches documents based on these conditions and scores documents using the BM25 algorithm, resulting in original scores (query score)
  • Filter conditions: The filter part, where only documents meeting these conditions will be rescored
  • Scoring functions: Documents meeting the filter conditions will be calculated using these functions, resulting in function scores (function score). There are four types of functions:
    • weight: The function result is a constant
    • field_value_factor: Uses a field value from the document as the function result
    • random_score: Uses a random number as the function result
    • script_score: Custom scoring function algorithm
  • Operation mode: How the function score results and original query relevance scores are combined, including:
    • multiply: Multiply
    • replace: Replace query score with function score
    • Others, such as: sum, avg, max, min

The function score operation flow is as follows:

  1. Search documents based on original conditions and calculate relevance scores, called original scores (query score)
  2. Filter documents based on filter conditions
  3. Documents meeting filter conditions are calculated based on scoring functions, resulting in function scores (function score)
  4. Combine the original scores (query score) and function scores (function score) based on the operation mode to get the final result as the relevance score

Therefore, the key points are:

  • Filter conditions: Determine which documents' scores are modified
  • Scoring functions: Determine the algorithm for function scores
  • Operation mode: Determine the final score result

Example: Increase the score of iPhones by 10 times. Analysis:

  • Filter condition: Brand must be iPhone
  • Scoring function: Constant weight, value of 10
  • Scoring mode: Multiply

Corresponding code:

GET /products/_search
{
  "query": {
    "function_score": {
      "query": {  .... }, // Original query, can be any condition
      "functions": [ // Scoring functions
        {
          "filter": { // Conditions to meet, brand must be iPhone
            "term": {
              "brand": "iPhone"
            }
          },
          "weight": 10 // Scoring weight is 10
        }
      ],
      "boost_mode": "multiply" // Weighting mode, multiply
    }
  }
}

1.3.2. Bool Query

The bool query, or Boolean query, uses logical operations to combine one or more query clauses. The logical operations supported by bool queries are:

  • must: Must match each subquery, similar to "AND"
  • should: Optionally match subqueries, similar to "OR"
  • must_not: Must not match, doesn't participate in scoring, similar to "NOT"
  • filter: Must match, doesn't participate in scoring

The syntax for bool queries is as follows:

GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"name": "mobile phone"}}
      ],
      "should": [
        {"term": {"brand": { "value": "vivo" }}},
        {"term": {"brand": { "value": "Xiaomi" }}}
      ],
      "must_not": [
        {"range": {"price": {"gte": 2500}}}
      ],
      "filter": [
        {"range": {"price": {"lte": 1000}}}
      ]
    }
  }
}

For performance reasons, queries unrelated to search keywords should use must_not or filter logical operations to avoid participating in relevance scoring.

For example, in an e-commerce search page:

The search box conditions should definitely participate in relevance scoring and can use match. However, price range filtering, brand filtering, category filtering, etc., should尽量 use filter and not participate in relevance scoring.

For example, if we want to search for "mobile phone", but the brand must be "Huawei" and the price must be "900~1599", we could write:

GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        {"match": {"name": "mobile phone"}}
      ],
      "filter": [
        {"term": {"brand": { "value": "Huawei" }}},
        {"range": {"price": {"gte": 90000, "lt": 159900}}}
      ]
    }
  }
}

1.4. Sorting

By default, Elasticsearch sorts by relevance score (_score), but it also supports custom sorting of search results. However, tokenized fields cannot be sorted. Field types that can participate in sorting include: keyword types, numeric types, geo-coordinate types, date types, etc.

For detailed information, refer to the official documentation: Sort search results

Syntax explanation:

GET /indexName/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "sort_field": {
        "order": "sort_order_asc_or_desc"
      }
    }
  ]
}

Example, sorting by product price:

GET /products/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "price": {
        "order": "desc"
      }
    }
  ]
}

1.5. Pagination

By default, Elasticsearch only returns the top 10 data. To query more data, you need to modify the pagination parameters.

1.5.1. Basic Pagination

In Elasticsearch, pagination results are controlled by modifying the from and size parameters:

  • from: Starting from which document
  • size: Total number of documents to query

Similar to limit ?, ? in MySQL.

The syntax is as follows:

GET /products/_search
{
  "query": {
    "match_all": {}
  },
  "from": 0, // Starting position of pagination, default is 0
  "size": 10,  // Number of documents per page, default is 10
  "sort": [
    {
      "price": {
        "order": "desc"
      }
    }
  ]
}

1.5.2. Deep Pagination

Elasticsearch data is typically stored in shards, meaning data in an index is divided into N parts and stored on different nodes. This storage method is beneficial for data expansion but brings some challenges to pagination.

For example, if an index contains 100,000 documents stored in 4 shards, with 25,000 documents per shard. Now we want to query 10 documents per page, page 99. The pagination query conditions would be:

GET /products/_search
{
  "from": 990, // Start from the 990th document
  "size": 10, // Query 10 documents per page
  "sort": [
    {
      "price": "asc"
    }
  ]
}

Analyzing the statement, we want to query documents ranked 990~1000.

From an implementation perspective, we would need to sort all data, find the top 1000, and then extract the 990~1000 portion. But the problem is, how do we find the true top 1000 across all data?

Since the data in each shard is different, the 9001000 documents in the first shard may not be ranked 9001000 in another node. Therefore, we can only find the top 1000 documents in each shard, aggregate them together, re-sort, and then find the true top 1000 in the entire index, at which point we can extract the 990~1000 data.

Imagine if we wanted to query page 999 data, meaning documents 9990~10000, wouldn't we need to query the top 10,000 documents from each shard, aggregate them in memory, and sort? If the pagination depth is deeper, wouldn't the amount of data retrieved at once be even larger?

Therefore, when the query pagination depth is large, the aggregated data is too much, which puts significant pressure on memory and CPU.

Thus, Elasticsearch prohibits requests where from + size exceeds 10,000.

For deep pagination, Elasticsearch provides two solutions:

  • search after: Requires sorting during pagination, with the principle of querying the next page of data starting from the last sort value. This is the officially recommended method.
  • scroll: The principle is to create a snapshot of sorted document IDs and save them, then paginate based on the snapshot. This is no longer officially recommended.

1.6. Highlighting

1.6.1. Highlighting Principle

What is highlighting?

When we search on Baidu or JD, keywords become red and more prominent, which is called highlighting:

Observing the page source code, you'll notice two things:

  • Highlighted terms are wrapped in <em> tags
  • The <em> tags have red styles added

CSS styles are definitely written by the frontend when implementing the page, but when the frontend writes the page, they don't know what data will be displayed, so they can't add tags to the data. The backend implements the search functionality, and if using Elasticsearch for tokenized search, it knows which terms need to be highlighted.

Therefore, the highlighting tags must be added by the backend when providing data.

Thus, the implementation idea for highlighting is:

  1. User enters search keywords to search for data
  2. The backend searches in Elasticsearch based on the keywords and adds HTML tags to the keyword terms in the search results
  3. The frontend adds CSS styles to the pre-agreed HTML tags

1.6.2. Implementing Highlighting

Infact, Elasticsearch already provides syntax for adding tags to search keywords, so we don't need to code it ourselves.

The basic syntax is as follows:

GET /{index-name}/_search
{
  "query": {
    "match": {
      "search_field": "search_keyword"
    }
  },
  "highlight": {
    "fields": {
      "highlight_field_name": {
        "pre_tags": "<em>",
        "post_tags": "</em>"
      }
    }
  }
}

Note:

  • The search must have query conditions, and they must be full-text retrieval type query conditions, such as match
  • Fields participating in highlighting must be text type fields
  • By default, fields participating in highlighting must be consistent with search fields, unless adding: required_field_match=false
  1. RestClient Query

Document queries still use the RestHighLevelClient object we learned previously. The basic steps for querying are as follows:

  1. Create a request object, this time for searching, so it's SearchRequest
  2. Prepare request parameters, which are the JSON parameters corresponding to the query DSL
  3. Send the request
  4. Parse the response, which is relatively complex and requires layer-by-layer parsing

2.1. Quick Start

As mentioned before, since the interfaces exposed by Elasticsearch are all Restful-style, the Java API call is essentially sending an HTTP request. Our core task is to organize request parameters using Java code and parse response results.

The format of these parameters completely references the JSON structure of the DSL query statement, so in our learning process, we'll constantly compare the Java API with the DSL statement. You should also learn by comparing in this way.

2.1.1. Sending Requests

First, let's take match_all query as an example. The DSL and Java API comparison is as follows:

Code interpretation:

  • First step, create a SearchRequest object, specifying the index name
  • Second step, use request.source() to build the DSL, which can include query, pagination, sorting, highlighting, etc.
    • query(): Represents query conditions, using QueryBuilders.matchAllQuery() to build a match_all query DSL
  • Third step, use client.search() to send the request and get the response

There are two key APIs here. One is request.source(), which builds the complete JSON parameters in the DSL. It includes all functions such as query, sort, from, size, highlight, etc.

The other is QueryBuilders, which contains various **leaf queries**, **compound queries**, etc., that we've learned:

2.1.2. Parsing Response Results

After sending the request, we get the response result SearchResponse. The structure of this class is completely consistent with the JSON structure of the response result we see in Kibana:

{
    "took" : 0,
    "timed_out" : false,
    "hits" : {
        "total" : {
            "value" : 2,
            "relation" : "eq"
        },
        "max_score" : 1.0,
        "hits" : [
            {
                "_index" : "heima",
                "_type" : "_doc",
                "_id" : "1",
                "_score" : 1.0,
                "_source" : {
                "info" : "Java Instructor",
                "name" : "Zhao Yun"
                }
            }
        ]
    }
}

Therefore, our code to parse SearchResponse is parsing this JSON result, with the following comparison:

Code interpretation:

The result returned by Elasticsearch is a JSON string, with a structure containing:

  • hits: Hit results
    • total: Total count, where value is the specific total count value
    • max_score: The highest relevance score among all results
    • hits: Array of search result documents, where each document is a JSON object
      • _source: Original data in the document, also a JSON object

Therefore, parsing the response result is to parse the JSON string layer by layer, with the following process:

  • SearchHits: Obtained through response.getHits(), this is the outermost hits in the JSON, representing hit results
    • SearchHits#getTotalHits().value: Get total count information
    • SearchHits#getHits(): Get SearchHit array, which is the document array
      • SearchHit#getSourceAsString(): Get the _source in the document result, which is the original JSON document data

2.1.3. Summary

The basic steps for document search are:

  1. Create a SearchRequest object
  2. Prepare request.source(), which is the DSL.
    1. Use QueryBuilders to build query conditions
    2. Pass to the query() method of request.source()
  3. Send the request and get the result
  4. Parse the result (refer to the JSON result, parse from outside to inside, layer by layer)

Complete code:

@Test
void testMatchAll() throws IOException {
    // 1. Create Request
    SearchRequest request = new SearchRequest("products");
    // 2. Organize request parameters
    request.source().query(QueryBuilders.matchAllQuery());
    // 3. Send request
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    // 4. Parse response
    parseResponse(response);
}

private void parseResponse(SearchResponse response) {
    SearchHits searchHits = response.getHits();
    // 1. Get total count
    long total = searchHits.getTotalHits().value;
    System.out.println("Found " + total + " records in total");
    // 2. Iterate through result array
    SearchHit[] hits = searchHits.getHits();
    for (SearchHit hit : hits) {
        // 3. Get _source, which is the original JSON document
        String source = hit.getSourceAsString();
        // 4. Deserialize and print
        ProductDoc product = JSONUtil.toBean(source, ProductDoc.class);
        System.out.println(product);
    }
}

2.2. Leaf Queries

All query conditions are built by QueryBuilders, and leaf queries are no exception. Therefore, in the entire code, only the query condition construction method changes, while everything else remains the same.

For example, match query:

@Test
void testMatch() throws IOException {
    // 1. Create Request
    SearchRequest request = new SearchRequest("products");
    // 2. Organize request parameters
    request.source().query(QueryBuilders.matchQuery("name", "skim milk"));
    // 3. Send request
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    // 4. Parse response
    parseResponse(response);
}

Another example, multi_match query:

@Test
void testMultiMatch() throws IOException {
    // 1. Create Request
    SearchRequest request = new SearchRequest("products");
    // 2. Organize request parameters
    request.source().query(QueryBuilders.multiMatchQuery("skim milk", "name", "category"));
    // 3. Send request
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    // 4. Parse response
    parseResponse(response);
}

And range query:

@Test
void testRange() throws IOException {
    // 1. Create Request
    SearchRequest request = new SearchRequest("products");
    // 2. Organize request parameters
    request.source().query(QueryBuilders.rangeQuery("price").gte(10000).lte(30000));
    // 3. Send request
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    // 4. Parse response
    parseResponse(response);
}

And term query:

@Test
void testTerm() throws IOException {
    // 1. Create Request
    SearchRequest request = new SearchRequest("products");
    // 2. Organize request parameters
    request.source().query(QueryBuilders.termQuery("brand", "Huawei"));
    // 3. Send request
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    // 4. Parse response
    parseResponse(response);
}

2.3. Compound Queries

Compound queries are also built by QueryBuilders. Let's take bool query as an example. The DSL and Java API comparison is as follows:

Complete code:

@Test
void testBool() throws IOException {
    // 1. Create Request
    SearchRequest request = new SearchRequest("products");
    // 2. Organize request parameters
    // 2.1. Prepare bool query
    BoolQueryBuilder boolQuery = QueryBuilders.boolQuery();
    // 2.2. Keyword search
    boolQuery.must(QueryBuilders.matchQuery("name", "skim milk"));
    // 2.3. Brand filter
    boolQuery.filter(QueryBuilders.termQuery("brand", "Muller"));
    // 2.4. Price filter
    boolQuery.filter(QueryBuilders.rangeQuery("price").lte(30000));
    request.source().query(boolQuery);
    // 3. Send request
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    // 4. Parse response
    parseResponse(response);
}

2.4. Sorting and Pagination

As mentioned before, request.source() is the entire request JSON parameter, so sorting and pagination are both set based on this. The DSL and Java API comparison is as follows:

Complete code:

@Test
void testPageAndSort() throws IOException {
    int pageNum = 1, pageSize = 5;

    // 1. Create Request
    SearchRequest request = new SearchRequest("products");
    // 2. Organize request parameters
    // 2.1. Search condition parameters
    request.source().query(QueryBuilders.matchQuery("name", "skim milk"));
    // 2.2. Sort parameters
    request.source().sort("price", SortOrder.ASC);
    // 2.3. Pagination parameters
    request.source().from((pageNum - 1) * pageSize).size(pageSize);
    // 3. Send request
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    // 4. Parse response
    parseResponse(response);
}

2.5. Highlighting

Highlighting queries differ from previous queries in two ways:

  • Conditions are also specified in request.source(), but highlighting conditions need to be constructed based on HighlightBuilder
  • Highlighting response results are not together with search document results and need to be parsed separately

First, let's look at the highlighting condition construction. The DSL and Java API comparison is as follows:

Example code:

@Test
void testHighlight() throws IOException {
    // 1. Create Request
    SearchRequest request = new SearchRequest("products");
    // 2. Organize request parameters
    // 2.1. Query conditions
    request.source().query(QueryBuilders.matchQuery("name", "skim milk"));
    // 2.2. Highlight conditions
    request.source().highlighter(
            SearchSourceBuilder.highlight()
                    .field("name")
                    .preTags("<em>")
                    .postTags("</em>")
    );
    // 3. Send request
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    // 4. Parse response
    parseResponse(response);
}

Now let's look at result parsing. The document parsing part remains unchanged, but the highlighting content needs to be parsed separately. The DSL and Java API comparison is as follows:

Code interpretation:

  • Steps 3,4: Get _source from the result. hit.getSourceAsString(), this part is the non-highlighted result, a JSON string. It also needs to be deserialized into a ProductDoc object
  • Step 5: Get highlighting results. hit.getHighlightFields(), the return value is a Map, where the key is the highlighted field name, and the value is the HighlightField object, representing the highlighted value
  • Step 5.1: Get the highlighted field value object HighlightField from the Map based on the highlighted field name
  • Step 5.2: Get Fragments from HighlightField and convert to string. This is the actual highlighted string
  • Finally: Replace the non-highlighted result in ProductDoc with the highlighted result

Complete code:

private void parseResponse(SearchResponse response) {
    SearchHits searchHits = response.getHits();
    // 1. Get total count
    long total = searchHits.getTotalHits().value;
    System.out.println("Found " + total + " records in total");
    // 2. Iterate through result array
    SearchHit[] hits = searchHits.getHits();
    for (SearchHit hit : hits) {
        // 3. Get _source, which is the original JSON document
        String source = hit.getSourceAsString();
        // 4. Deserialize
        ProductDoc product = JSONUtil.toBean(source, ProductDoc.class);
        // 5. Get highlighting results
        Map<String, HighlightField> highlightFields = hit.getHighlightFields();
        if (CollUtils.isNotEmpty(highlightFields)) {
            // 5.1. There are highlighting results, get the highlighting result for name
            HighlightField nameHighlight = highlightFields.get("name");
            if (nameHighlight != null) {
                // 5.2. Get the first highlighting result fragment, which is the highlighted value of the product name
                String highlightedName = nameHighlight.getFragments()[0].string();
                product.setName(highlightedName);
            }
        }
        System.out.println(product);
    }
}
  1. Data Aggregation

Aggregation (aggregations) allows us to easily implement statistics, analysis, and calculations on data. For example:

  • Which mobile phone brands are most popular?
  • What are the average, highest, and lowest prices of these phones?
  • How are the monthly sales of these phones?

Implementing these statistical functions is much more convenient than database SQL, and the query speed is very fast, achieving near real-time search effects.

Official documentation: Aggregations

There are three common types of aggregations:

  • Bucket aggregations: Used to group documents
    • TermAggregation: Group by document field values, such as grouping by brand values, grouping by country
    • Date Histogram: Group by date steps, such as one week per group, or one month per group
  • Metric aggregations: Used to calculate some values, such as maximum, minimum, average, etc.
    • Avg: Calculate average
    • Max: Calculate maximum
    • Min: Calculate minimum
    • Stats: Calculate max, min, avg, sum simultaneously
  • Pipeline aggregations: Further calculations based on the results of other aggregations

Note: Fields participating in aggregation must be of keyword, date, numeric, or boolean types

3.1. DSL Implementation of Aggregation

Similar to the search functions we learned earlier, we'll first learn the DSL syntax, then the Java API.

3.1.1. Bucket Aggregation

For example, if we want to count all product categories in the products, we're essentially grouping data by the category field. Documents with the same category value are placed in the same group, belonging to the Term aggregation in Bucket aggregation.

Basic syntax:

GET /products/_search
{
  "size": 0, 
  "aggs": {
    "category_agg": {
      "terms": {
        "field": "category",
        "size": 20
      }
    }
  }
}

Syntax explanation:

  • size: Set size to 0, meaning query 0 documents per page, so the result won't contain documents, only aggregations
  • aggs: Define aggregation
    • category_agg: Aggregation name, customizable, but cannot be duplicated
      • terms: Aggregation type, using term for category aggregation
        • field: Name of the field participating in aggregation
        • size: Maximum number of aggregation results you want to return

3.1.2. Conditional Aggregation

By default, Bucket aggregation aggregates all documents in the index. For example, when we count all product brands, the result is as follows:

We can see that there are many brands counted.

But in real scenarios, users will input search conditions, so aggregation must be on search results. Therefore, aggregation must add limiting conditions.

For example, if I want to know which mobile phone brands have prices above 3000, how should I count them?

We need to analyze the search query conditions and aggregation goals from the requirements:

  • Search query conditions:
    • Price above 3000
    • Must be a mobile phone
  • Aggregation goal: We're counting brands, so it must be a term aggregation on the brand field

Syntax:

GET /products/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "category": "mobile phone"
          }
        },
        {
          "range": {
            "price": {
              "gte": 300000
            }
          }
        }
      ]
    }
  }, 
  "size": 0, 
  "aggs": {
    "brand_agg": {
      "terms": {
        "field": "brand",
        "size": 20
      }
    }
  }
}

The aggregation results show only 3 brands now.

3.1.3. Metric Aggregation

In the previous section, we counted mobile phone brands with prices above 3000, forming buckets. Now we need to perform calculations on the products within these buckets to get the minimum, maximum, and average prices for each brand.

This requires using Metric aggregation, such as stat aggregation, which can simultaneously get min, max, avg, and other results.

Syntax:

GET /products/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "term": {
            "category": "mobile phone"
          }
        },
        {
          "range": {
            "price": {
              "gte": 300000
            }
          }
        }
      ]
    }
  }, 
  "size": 0, 
  "aggs": {
    "brand_agg": {
      "terms": {
        "field": "brand",
        "size": 20
      },
      "aggs": {
        "price_stats": {
          "stats": {
            "field": "price"
          }
        }
      }
    }
  }
}

We won't discuss the query part here. Let's focus on interpreting the aggregation part syntax.

We can see that inside the brand_agg aggregation, we've added a new aggs parameter. This aggregation is a sub-aggregation of brand_agg and will perform statistics separately on documents in each bucket formed by brand_agg.

  • price_stats: Aggregation name
    • stats: Aggregation type, stats is one type of metric aggregation
      • field: Aggregation field, here we choose price to统计价格

Since stats performs statistics separately on documents within each brand bucket formed by brand_agg, each brand will have its own minimum, maximum, and average prices calculated.

Additionally, we can also sort the aggregation results by the average price of each brand:

3.1.4. Summary

aggs represents aggregation, at the same level as query. At this point, what is the role of query?

  • Limit the document range for aggregation

The three essential elements of aggregation:

  • Aggregation name
  • Aggregation type
  • Aggregation field

Configurable properties for aggregation include:

  • size: Specify the number of aggregation results
  • order: Specify the sorting method of aggregation results
  • field: Specify the aggregation field

3.2. RestClient Implementation of Aggregation

In the DSL, aggs aggregation conditions are at the same level as query conditions, both belonging to the query JSON parameters. Therefore, we still use the request.source() method to set them.

However, the aggregation conditions need to be constructed using the AggregationBuilders utility class. The DSL and Java API syntax comparison is as follows:

Aggregation results are at the same level as search documents and need to be retrieved and parsed separately. The specific parsing syntax is as follows:

Complete code:

@Test
void testAggregation() throws IOException {
    // 1. Create Request
    SearchRequest request = new SearchRequest("products");
    // 2. Prepare request parameters
    BoolQueryBuilder boolQuery = QueryBuilders.boolQuery()
            .filter(QueryBuilders.termQuery("category", "mobile phone"))
            .filter(QueryBuilders.rangeQuery("price").gte(300000));
    request.source().query(boolQuery).size(0);
    // 3. Aggregation parameters
    request.source().aggregation(
            AggregationBuilders.terms("brand_agg").field("brand").size(5)
    );
    // 4. Send request
    SearchResponse response = client.search(request, RequestOptions.DEFAULT);
    // 5. Parse aggregation results
    Aggregations aggregations = response.getAggregations();
    // 5.1. Get brand aggregation
    Terms brandTerms = aggregations.get("brand_agg");
    // 5.2. Get buckets in the aggregation
    List<? extends Terms.Bucket> buckets = brandTerms.getBuckets();
    // 5.3. Iterate through data in buckets
    for (Terms.Bucket bucket : buckets) {
        // 5.4. Get key in the bucket
        String brand = bucket.getKeyAsString();
        System.out.print("brand = " + brand);
        long count = bucket.getDocCount();
        System.out.println("; count = " + count);
    }
}

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.