Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Elasticsearch Advanced Aggregation Analysis

Tech May 10 2

Bucket & Metric Aggregation and Nested Aggregation

Aggregations are part of a search request. It is generally recommended to set the size parameter to 0 when performing aggregations to focus only on the aggregated results.

Metric Aggregation

Example: Salary statistics

Insert sample data

DELETE /employees
PUT /employees/
{
  "mappings": {
    "properties": {
      "age": {
        "type": "integer"
      },
      "gender": {
        "type": "keyword"
      },
      "job": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 50
          }
        }
      },
      "name": {
        "type": "keyword"
      },
      "salary": {
        "type": "integer"
      }
    }
  }
}

PUT /employees/_bulk
{ "index" : {  "_id" : "1" } }
{ "name" : "Emma","age":32,"job":"Product Manager","gender":"female","salary":35000 }
{ "index" : {  "_id" : "2" } }
{ "name" : "Underwood","age":41,"job":"Dev Manager","gender":"male","salary": 50000}
{ "index" : {  "_id" : "3" } }
{ "name" : "Tran","age":25,"job":"Web Designer","gender":"male","salary":18000 }
{ "index" : {  "_id" : "4" } }
{ "name" : "Rivera","age":26,"job":"Web Designer","gender":"female","salary": 22000}
{ "index" : {  "_id" : "5" } }
{ "name" : "Rose","age":25,"job":"QA","gender":"female","salary":18000 }
{ "index" : {  "_id" : "6" } }
{ "name" : "Lucy","age":31,"job":"QA","gender":"female","salary": 25000}
{ "index" : {  "_id" : "7" } }
{ "name" : "Byrd","age":27,"job":"QA","gender":"male","salary":20000 }
{ "index" : {  "_id" : "8" } }
{ "name" : "Foster","age":27,"job":"Java Programmer","gender":"male","salary": 20000}
{ "index" : {  "_id" : "9" } }
{ "name" : "Gregory","age":32,"job":"Java Programmer","gender":"male","salary":22000 }
{ "index" : {  "_id" : "10" } }
{ "name" : "Bryant","age":20,"job":"Java Programmer","gender":"male","salary": 9000}
{ "index" : {  "_id" : "11" } }
{ "name" : "Jenny","age":36,"job":"Java Programmer","gender":"female","salary":38000 }
{ "index" : {  "_id" : "12" } }
{ "name" : "Mcdonald","age":31,"job":"Java Programmer","gender":"male","salary": 32000}
{ "index" : {  "_id" : "13" } }
{ "name" : "Jonthna","age":30,"job":"Java Programmer","gender":"female","salary":30000 }
{ "index" : {  "_id" : "14" } }
{ "name" : "Marshall","age":32,"job":"Javascript Programmer","gender":"male","salary": 25000}
{ "index" : {  "_id" : "15" } }
{ "name" : "King","age":33,"job":"Java Programmer","gender":"male","salary":28000 }
{ "index" : {  "_id" : "16" } }
{ "name" : "Mccarthy","age":21,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : {  "_id" : "17" } }
{ "name" : "Goodwin","age":25,"job":"Javascript Programmer","gender":"male","salary": 16000}
{ "index" : {  "_id" : "18" } }
{ "name" : "Catherine","age":29,"job":"Javascript Programmer","gender":"female","salary": 20000}
{ "index" : {  "_id" : "19" } }
{ "name" : "Boone","age":30,"job":"DBA","gender":"male","salary": 30000}
{ "index" : {  "_id" : "20" } }
{ "name" : "Kathy","age":29,"job":"DBA","gender":"female","salary": 20000}

Find the minimum salary:

POST employees/_search
{
  "size": 0,
  "aggs": {
    "min_salary": {
      "min": {
        "field": "salary"
      }
    }
  }
}

Find the maximum salary:

POST employees/_search
{
  "size": 0,
  "aggs": {
    "max_salary": {
      "max": {
        "field": "salary"
      }
    }
  }
}

Find multiple metrics (min, max, average) concurrently:

POST employees/_search
{
  "size": 0,
  "aggs": {
    "max_salary": {
      "max": {
        "field": "salary"
      }
    },
    "min_salary": {
      "min": {
        "field": "salary"
      }
    },
    "avg_salary": {
      "avg": {
        "field": "salary"
      }
    }
  }
}

A single stats aggregation returns multiple statistics:

POST employees/_search
{
  "size": 0,
  "aggs": {
    "stats_salary": {
      "stats": {
        "field": "salary"
      }
    }
  }
}

Bucket Aggregation

Terms Aggregation

  • For keyword fields, doc_values are enabled by default, so terms aggregation works directly.
  • For text fields, you must enable fielddata in the mapping. The aggregation will be applied on the analyzed tokens.

Aggregate on job.keyword:

POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"
      }
    }
  }
}

Attempting terms aggregation on a text field will fail unless fielddata is enabled:

POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job"
      }
    }
  }
}

Enable fielddata on the job field:

PUT employees/_mapping
{
  "properties": {
    "job": {
      "type": "text",
      "fielddata": true
    }
  }
}

Now the aggregation on job will bucket by the analyzed tokens.

Cardinality Aggregation

Similar to SQL COUNT(DISTINCT):

POST employees/_search
{
  "size": 0,
  "aggs": {
    "cardinate": {
      "cardinality": {
        "field": "job.keyword"
      }
    }
  }
}

Bucket Size and Top Hits

Use Case: After bucketing, retrieve the top documents within each bucket.

Size: Limit the number of buckets returned. Example: bucket by age, return only the top 3 buckets.

POST employees/_search
{
  "size": 0,
  "aggs": {
    "ages_5": {
      "terms": {
        "field": "age",
        "size": 3
      }
    }
  }
}

Top Hits: For each job type, find the 3 oldest employees.

POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"
      },
      "aggs": {
        "old_employee": {
          "top_hits": {
            "size": 3,
            "_source": ["name", "job"], 
            "sort": [
              {
                "age": {
                  "order": "desc"
                }
              }
            ]
          }
        }
      }
    }
  }
}

Range and Histogram Aggregation

Bucket numeric values by defined ranges. Custom keys can be specified.

POST employees/_search
{
  "size": 0,
  "aggs": {
    "salary_range": {
      "range": {
        "field": "salary",
        "ranges": [
          {
            "to": 10000
          },
          {
            "from": 10000,
            "to": 20000
          },
          {
            "key": ">20000",
            "from": 20000
          }
        ]
      }
    }
  }
}

Histogram aggregation with a fixed interval (e.g., 5000):

POST employees/_search
{
  "size": 0,
  "aggs": {
    "salary_histrogram": {
      "histogram": {
        "field": "salary",
        "interval": 5000,
        "extended_bounds": {
          "min": 0,
          "max": 100000
        }
      }
    }
  }
}

Bucket + Metric Aggregation (Nested Aggregations)

Sub-aggregations can be added within a bucket aggregation. Sub-aggregations can be either bucket or metric aggregations.

Example 1: Bucket by job type and calculate salary statistics within each bucket.

POST employees/_search
{
  "size": 0,
  "aggs": {
    "Job_salary_stats": {
      "terms": {
        "field": "job.keyword"
      },
      "aggs": {
        "salary": {
          "stats": {
            "field": "salary"
          }
        }
      }
    }
  }
}

Example 2: Bucket by job type, then by gender within each job, and calculate salary statistics.

POST employees/_search
{
  "size": 0,
  "aggs": {
    "Job_gender_stats": {
      "terms": {
        "field": "job.keyword"
      },
      "aggs": {
        "gender_stats": {
          "terms": {
            "field": "gender"
          },
          "aggs": {
            "salary_stats": {
              "stats": {
                "field": "salary"
              }
            }
          }
        }
      }
    }
  }
}

Pipeline Aggregation

Pipeline aggregations work on the outputs of other aggregations. They can be placed at the same level (sibling) or nested within (parent) the source aggregation.

  • Sibling: max_bucket, min_bucket, avg_bucket, sum_bucket, stats_bucket, percentiles_bucket
  • Parant: cumulative_sum, derivative, moving_function

Re-index sample data (after deleting the index):

DELETE employees
PUT /employees/_bulk
{ "index" : {  "_id" : "1" } }
{ "name" : "Emma","age":32,"job":"Product Manager","gender":"female","salary":35000 }
{ "index" : {  "_id" : "2" } }
{ "name" : "Underwood","age":41,"job":"Dev Manager","gender":"male","salary": 50000}
... (same data as before)

Sibling Pipeline Examples

Find the job type with the lowest average salary:

POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "min_salary_by_job": {
      "min_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }
}

Find the job type with highest average salary:

POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "max_salary_by_job": {
      "max_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }
}

Calculate the average of job-specific average salaries:

POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "avg_salary_by_job": {
      "avg_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }
}

Statistics on the bucket-level average salaries:

POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "stats_salary_by_job": {
      "stats_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }
}

Percentiles of the bucket-level average salaries:

POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "size": 10
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        }
      }
    },
    "percentiles_salary_by_job":{
      "percentiles_bucket": {
        "buckets_path": "jobs>avg_salary"
      }
    }
  }
}

Parent Pipeline Example (Derivative)

Calculate the derivative (rate of change) of the average salary as age increases.

POST employees/_search
{
  "size": 0,
  "aggs": {
    "age": {
      "histogram": {
        "field": "age",
        "min_doc_count": 1,
        "interval": 1
      },
      "aggs": {
        "avg_salary": {
          "avg": {
            "field": "salary"
          }
        },
        "derivative_avg_salary": {
          "derivative": {
            "buckets_path": "avg_salary"
          }
        }
      }
    }
  }
}

Aggregation Scope and Sorting

Aggregation Scope

The default scope of an aggregation is the result set of the query. You can change the scope using:

  • filter
  • post_filter
  • global

Sample data (re-create index if needed):

DELETE /employees
PUT /employees/
... (same mapping and data as before)

Query with default scope: Filter by age >= 20, then bucket by job type.

POST employees/_search
{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gte": 20
      }
    }
  },
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"
      }
    }
  }
}

Filter Aggregation: Apply a filter within an aggregation to restrict the documents considered.

POST employees/_search
{
  "size": 0,
  "aggs": {
    "older_person": {
      "filter": {
        "range": {
          "age": {
            "from": 35
          }
        }
      },
      "aggs": {
        "jobs": {
          "terms": {
            "field": "job.keyword"
          }
        }
      }
    },
    "all_jobs": {
      "terms": {
        "field": "job.keyword"
      }
    }
  }
}

Post Filter: Filter the documents returned after aggregation. This allows you to retrieve both aggregated results and a specific subset of documents in one request.

POST employees/_search
{
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"
      }
    }
  },
  "post_filter": {
    "match": {
      "job.keyword": "Dev Manager"
    }
  }
}

Global Aggregation: Ignores the query scope and aggregates over all documents in the index.

POST employees/_search
{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gte": 40
      }
    }
  },
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword"
      }
    },
    "all": {
      "global": {},
      "aggs": {
        "salary_avg": {
          "avg": {
            "field": "salary"
          }
        }
      }
    }
  }
}

Sorting Buckets

By default, terms are sorted by document count in descending order. You can specify order to sort by _count, _key, or by a sub-aggregation value.

Sort by count ascending, then by key descending:

POST employees/_search
{
  "size": 0,
  "query": {
    "range": {
      "age": {
        "gte": 20
      }
    }
  },
  "aggs": {
    "jobs": {
      "terms": {
        "field": "job.keyword",
        "order": [
          {
            "_count": "asc"
          },
          {
            "_key": "desc"
          }
        ]
      }
    }
  }
}

Sort by a sub-aggregation value:

POST employees/_search
{
  "size": 0,
  "aggs": {
    "jobs": {
      "terms": {
        "field":"job.keyword",
        "order":[
          {
            "stats_salary.min":"desc"
          }
        ]
      },
      "aggs": {
        "stats_salary": {
          "stats": {
            "field":"salary"
          }
        }
      }
    }
  }
}

Aggregation Precision

Terms aggregation results can be approximate due to data being distributed across multiple shards. The coordinating node doesn't have the full picture.

Solutions:

  1. Set the number of primary shards to 1 for exact results (only suitable for small datasets).
  2. Increase the shard_size parameter to fetch more terms from each shard, improving accuracy at the cost of performance.

Default shard_size: shard_size = size * 1.5 + 10

Example with shard_size:

GET my_flights/_search
{
  "size": 0,
  "aggs": {
    "weather": {
      "terms": {
        "field":"OriginWeather",
        "size":1,
        "shard_size":10,
        "show_term_doc_count_error":true
      }
    }
  }
}

Composite Aggregation for Paginated Multi-conditional Buckets

Use the composite aggregation to paginate through combinations of multiple bucket keys.

First page:

POST legislation/_search
{
  "size": 0,
  "query": {
    "term": {
      "source_type": {
        "value": "migrate"
      }
    }
  },
  "aggs": {
    "legislation": {
      "composite": {
        "sources": [
          {
            "norm_citation": {
              "terms": {
                "field": "norm_citation"
              }
            }
          },
          {
            "designator": {
              "terms": {
                "field": "designator.keyword"
              }
            }
          }
        ],
        "size": 10
      }
    }
  }
}

Next page using the after parameter:

POST legislation/_search
{
  "size": 0,
  "query": {
    "term": {
      "source_type": {
        "value": "migrate"
      }
    }
  },
  "aggs": {
    "legislation": {
      "composite": {
        "sources": [
          {
            "norm_citation": {
              "terms": {
                "field": "norm_citation"
              }
            }
          },
          {
            "designator": {
              "terms": {
                "field": "designator.keyword"
              }
            }
          }
        ],
        "size": 10, 
        "after": {
          "norm_citation": "1890_39a",
          "designator": "section-15"
        }
      }
    }
  }
}

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.