Fading Coder

One Final Commit for the Last Sprint

Home > Notes > Content

Data Modeling in Elasticsearch: Objects, Relationships, and Pipelines

Notes 2

Objects and Nested Objects

Relational Data in the Real World

Many real-world scenarios involve complex relationships between entities:

  • Blog posts linked to authors and comments
  • Bank accounts with multiple transaction records
  • Customers owning multiple bank accounts
  • Directories containing files and subdirectories

Denormalization vs Normalization

Denormalization involves flattening data structures by storing redundant copies directly within documents rather than using traditional joins.

Advantages:

  • Eliminates expensive join operations
  • Improves readd performance significantly
  • Elasticsearch compresses the _source field to reduce disk overhead

Disadvantages:

  • Not ideal for frequently updated data
  • Modifying a single value (like a username) may require updating numerous documents

Handling Relationships in Elasticsearch

Relational databases favor normalization, while Elasticsearch typically works better with denormalized data:

  • Faster read operations
  • No table joins required
  • No row-level locks needed

Elasticsearch doesn't handle relationships efficiently by default. Four common approaches exist:

  • Object types
  • Nested objects
  • Parent/child relationships
  • Application-level joins

Example 1: Blog Posts with Author Information

Object Type Approach:

Store author details directly within each blog document. If author information changes, update all related blog documents.

DELETE blog

PUT /blog
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text"
      },
      "posted_at": {
        "type": "date"
      },
      "author": {
        "properties": {
          "location": {
            "type": "text"
          },
          "author_id": {
            "type": "long"
          },
          "display_name": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

PUT blog/_doc/1
{
  "content": "I like Elasticsearch",
  "posted_at": "2019-01-01T00:00:00",
  "author": {
    "author_id": 1,
    "display_name": "Jack",
    "location": "Shanghai"
  }
}

Retrieve both blog and author information with a single query:

POST blog/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "content": "Elasticsearch"
          }
        },
        {
          "match": {
            "author.display_name": "Jack"
          }
        }
      ]
    }
  }
}

Example 2: Documents with Object Arrays

DELETE my_films

PUT my_films
{
  "mappings": {
    "properties": {
      "cast": {
        "properties": {
          "first_name": {
            "type": "keyword"
          },
          "last_name": {
            "type": "keyword"
          }
        }
      },
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

POST my_films/_doc/1
{
  "title": "Speed",
  "cast": [
    {
      "first_name": "Keanu",
      "last_name": "Reeves"
    },
    {
      "first_name": "Dennis",
      "last_name": "Hopper"
    }
  ]
}

Searching documents with object arrays:

POST my_films/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "cast.first_name": "Keanu"
          }
        },
        {
          "match": {
            "cast.last_name": "Hopper"
          }
        }
      ]
    }
  }
}

Why this produces unexpected results:

  • Internally, Elasticsearch flattens nested object boundaries into flat key-value structures
  • When querying multiple fields, this causes false positive matches

The Nested Data Type Solution

What is Nested Data Type:

  • Allows objects within arrays to be indexed independently
  • Uses nested type combined with properties to index all actors into separate Lucene documents
  • Internal joins are performed during queries

Creating nested object mappings:

DELETE my_films

PUT my_films
{
  "mappings": {
    "properties": {
      "cast": {
        "type": "nested",
        "properties": {
          "first_name": {
            "type": "keyword"
          },
          "last_name": {
            "type": "keyword"
          }
        }
      },
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

POST my_films/_doc/1
{
  "title": "Speed",
  "cast": [
    {
      "first_name": "Keanu",
      "last_name": "Reeves"
    },
    {
      "first_name": "Dennis",
      "last_name": "Hopper"
    }
  ]
}

Nested query:

POST my_films/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "Speed"
          }
        },
        {
          "nested": {
            "path": "cast",
            "query": {
              "bool": {
                "must": [
                  {
                    "match": {
                      "cast.first_name": "Keanu"
                    }
                  },
                  {
                    "match": {
                      "cast.last_name": "Hopper"
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }
}

Nested aggregation:

POST my_films/_search
{
  "size": 0,
  "aggs": {
    "actors": {
      "nested": {
        "path": "cast"
      },
      "aggs": {
        "actor_name": {
          "terms": {
            "field": "cast.first_name",
            "size": 10
          }
        }
      }
    }
  }
}

Regular aggregations won't work on nested objects without the nested aggregation wrapper:

POST my_films/_search
{
  "size": 0,
  "aggs": {
    "NAME": {
      "terms": {
        "field": "cast.first_name",
        "size": 10
      }
    }
  }
}

Parent/Child Relationships

Limitations of Objects and Nested Objects

  • Every update requires reindexing the entire object, including root and nested objects

Elasticsearch provides a Join datatype similar to relational database joins:

  • Parent and child documents are independent
  • Updating a parent document doesn't require reindexing child documents
  • Adding, updating, or deleting child documents doesn't affect parents or other siblings

Defining Parenet/Child Relationships

Steps:

  1. Configure index mapping
  2. Index parent documents
  3. Index child documents
  4. Query as needed

Setting up mappings:

DELETE my_posts

PUT my_posts
{
  "settings": {
    "number_of_shards": 2
  },
  "mappings": {
    "properties": {
      "post_comment_relation": {
        "type": "join",
        "relations": {
          "post": "comment"
        }
      },
      "content": {
        "type": "text"
      },
      "title": {
        "type": "keyword"
      }
    }
  }
}

Indexing parent documents:

PUT my_posts/_doc/post1
{
  "title": "Learning Elasticsearch",
  "content": "learning ELK @ geektime",
  "post_comment_relation": {
    "name": "post"
  }
}

PUT my_posts/_doc/post2
{
  "title": "Learning Hadoop",
  "content": "learning Hadoop",
  "post_comment_relation": {
    "name": "post"
  }
}

Indexing child documents:

PUT my_posts/_doc/reply1?routing=post1
{
  "reply_text": "I am learning ELK",
  "username": "Jack",
  "post_comment_relation": {
    "name": "comment",
    "parent": "post1"
  }
}

PUT my_posts/_doc/reply2?routing=post2
{
  "reply_text": "I like Hadoop!!!!!",
  "username": "Jack",
  "post_comment_relation": {
    "name": "comment",
    "parent": "post2"
  }
}

PUT my_posts/_doc/reply3?routing=post2
{
  "reply_text": "Hello Hadoop",
  "username": "Bob",
  "post_comment_relation": {
    "name": "comment",
    "parent": "post2"
  }
}

Parent/Child Query Types

Query all documents:

POST my_posts/_search
{}

Parent ID query: Returns all related children for a given parent:

POST my_posts/_search
{
  "query": {
    "parent_id": {
      "type": "comment",
      "id": "post2"
    }
  }
}

Has Child query: Returns parent documents:

POST my_posts/_search
{
  "query": {
    "has_child": {
      "type": "comment",
      "query": {
        "match": {
          "username": "Jack"
        }
      }
    }
  }
}

Has Parent query:

POST my_posts/_search
{
  "query": {
    "has_parent": {
      "parent_type": "post",
      "query": {
        "match": {
          "title": "Learning Hadoop"
        }
      }
    }
  }
}

Accessing child documents:

GET my_posts/_doc/reply3
GET my_posts/_doc/reply3?routing=post2

Updating child documents:

PUT my_posts/_doc/reply3?routing=post2
{
  "reply_text": "Hello Hadoop??",
  "post_comment_relation": {
    "name": "comment",
    "parent": "post2"
  }
}

Nested Objects vs Parent/Child

Update By Query and Reindex API

Common Use Cases

Reindexing becomes necessary when:

  • Index mappings change: field type modifications, analyzer updates
  • Index settings change: primary shard count adjustments
  • Data migration: within or across clusters

Elasticsearch provides two APIs:

  • Update By Query: Rebuilds documents in the existing index
  • Reindex: Rebuilds documents into a different index

Example 1: Adding Sub-fields to an Index

Indexing initial documents:

DELETE articles/

PUT articles/_doc/1
{
  "body": "Hadoop is cool",
  "category": "hadoop"
}

Modifying mappings to add sub-fields with English analyzer:

PUT articles/_mapping
{
  "properties": {
    "body": {
      "type": "text",
      "fields": {
        "english": {
          "type": "text",
          "analyzer": "english"
        }
      }
    }
  }
}

PUT articles/_doc/2
{
  "body": "Elasticsearch rocks",
  "category": "elasticsearch"
}

Query newly indexed documents:

POST articles/_search
{
  "query": {
    "match": {
      "body.english": "Elasticsearch"
    }
  }
}

Query pre-mapping-change documents:

POST articles/_search
{
  "query": {
    "match": {
      "body.english": "Hadoop"
    }
  }
}

Execute Update By Query to resolve the issue:

POST articles/_update_by_query
{}

POST articles/_search
{
  "query": {
    "match": {
      "body.english": "Hadoop"
    }
  }
}

Example 2: Changing Existing Field Types

Elasticsearch doesn't allow modifying field types on existing mappings once data exists. The solution requires:

  1. Creating a new index with correct field types
  2. Reimporting the data
GET articles/_mapping

PUT articles/_mapping
{
  "properties": {
    "body": {
      "type": "text",
      "fields": {
        "english": {
          "type": "text",
          "analyzer": "english"
        }
      }
    },
    "category": {
      "type": "keyword"
    }
  }
}

Reindex API:

Copies documents from one index to another. Use cases include:

  • Modifying primary shard count
  • Changing field types
  • Migrating data within or across clusters
DELETE articles_v2

PUT articles_v2/
{
  "mappings": {
    "properties": {
      "body": {
        "type": "text",
        "fields": {
          "english": {
            "type": "text",
            "analyzer": "english"
          }
        }
      },
      "category": {
        "type": "keyword"
      }
    }
  }
}

Migrate data from old index:

POST _reindex
{
  "source": {
    "index": "articles"
  },
  "dest": {
    "index": "articles_v2"
  }
}

Verify term aggregation:

GET articles_v2/_doc/1

POST articles_v2/_search
{
  "size": 0,
  "aggs": {
    "blog_category": {
      "terms": {
        "field": "category",
        "size": 10
      }
    }
  }
}

OP Type:

POST _reindex
{
  "source": {
    "index": "articles"
  },
  "dest": {
    "index": "articles_v2",
    "op_type": "create"
  }
}

Cross-cluster Reindex

Ingest Pipeline and Painless Script

Requirements

Common preprocessing needs:

  • Converting comma-separated tags from strings to arrays
  • Supporting aggregation on tag fields

Ingest Node

Introduced in Elasticsearch 5.0, each node is an Ingest Node by default:

  • Intercepts Index or Bulk API requests for preprocessing
  • Transforms data and returns it to the indexing pipeline

Preprocessing capabilities without Logstash:

  • Setting default field values
  • Renaming fields
  • Splitting field values
  • Custom Painless scripts for complex transformations

Pipeline and Processor:

A Pipeline processes documents sequentially. Processors are abstract wrappers for transformation operations.

Elasticsearch includes numerous built-in Processors and supports custom ones via plugins.

Splitting strings with Pipeline:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "split blog tags",
    "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "title": "Introducing big data......",
        "tags": "hadoop,elasticsearch,spark",
        "body": "You know, for big data"
      }
    },
    {
      "_index": "index",
      "_id": "idxx",
      "_source": {
        "title": "Introducing cloud computing",
        "tags": "openstack,k8s",
        "body": "You know, for cloud"
      }
    }
  ]
}

Adding fields to documents:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "split and enhance blog data",
    "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      },
      {
        "set": {
          "field": "views",
          "value": 0
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "title": "Introducing big data......",
        "tags": "hadoop,elasticsearch,spark",
        "body": "You know, for big data"
      }
    }
  ]
}

Pipeline API usage:

DELETE tech_articles


PUT tech_articles/_doc/1
{
  "title": "Introducing big data......",
  "tags": "hadoop,elasticsearch,spark",
  "body": "You know, for big data"
}

Creating a pipeline:

PUT _ingest/pipeline/article_pipeline
{
  "description": "an article pipeline",
  "processors": [
    {
      "split": {
        "field": "tags",
        "separator": ","
      }
    },
    {
      "set": {
        "field": "views",
        "value": 0
      }
    }
  ]
}

GET _ingest/pipeline/article_pipeline

Testing the pipeline:

POST _ingest/pipeline/article_pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "title": "Introducing cloud computing",
        "tags": "openstack,k8s",
        "body": "You know, for cloud"
      }
    }
  ]
}

Indexing with and without pipeline:

PUT tech_articles/_doc/1
{
  "title": "Introducing big data......",
  "tags": "hadoop,elasticsearch,spark",
  "body": "You know, for big data"
}

PUT tech_articles/_doc/2?pipeline=article_pipeline
{
  "title": "Introducing cloud computing",
  "tags": "openstack,k8s",
  "body": "You know, for cloud"
}

Query results:

POST tech_articles/_search
{}

Rebuilding existing documents with pipeline:

POST tech_articles/_update_by_query?pipeline=article_pipeline
{}

Adding query conditions:

POST tech_articles/_update_by_query?pipeline=article_pipeline
{
  "query": {
    "bool": {
      "must_not": {
        "exists": {
          "field": "views"
        }
      }
    }
  }
}

Common built-in Processors:

  • Split Processor: Splits field values into arrays
  • Remove/Rename Processor: Removes or renames fields
  • Append: Adds new values to fields
  • Convert: Changes field types (e.g., string to float)
  • Date/JSON: Date format conversion, string to JSON
  • Date Index Name Processor: Routes documents to time-based indices
  • Fail Processor: Returns custom error messages
  • Foreach Processor: Applies processors to array elements
  • Grok Processor: Parses log formats
  • Gsub/Join/Split: String replacement, array conversions
  • Lowercase/Uppercase: Case transformations

Painless Scripting

Introduced in Elasticsearch 5.x, Painless is purpose-built for Elasticsearch, extending Java syntax. From 6.0 onwards, Painless is the only supported scripting language.

Painless characteristics:

  • High performance with security features
  • Supports both explicit and dynamic typing
  • Compatible with Java data types and API subsets

Painless use cases:

  • Updating or removing fields
  • Data aggregation operations
  • Script Fields: Pre-computing returned fields
  • Function Score: Modifying document relevance scoring
  • Ingest Pipeline transformations
  • Reindex and Update By Query operations

Example 1: Script Processor:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "process blog data",
    "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      },
      {
        "script": {
          "source": """
          if(ctx.containsKey("body")){
            ctx.body_length = ctx.body.length();
          }else{
            ctx.body_length=0;
          }
"""
        }
      },
      {
        "set": {
          "field": "views",
          "value": 0
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "title": "Introducing big data......",
        "tags": "hadoop,elasticsearch,spark",
        "body": "You know, for big data"
      }
    }
  ]
}

Example 2: Document Update Counter:

DELETE tech_articles
PUT tech_articles/_doc/1
{
  "title": "Introducing big data......",
  "tags": "hadoop,elasticsearch,spark",
  "body": "You know, for big data",
  "views": 0
}

POST tech_articles/_update/1
{
  "script": {
    "source": "ctx._source.views += params.count",
    "params": {
      "count": 100
    }
  }
}

POST tech_articles/_search
{}

Example 3: Search Script Fields:

GET tech_articles/_search
{
  "script_fields": {
    "random_views": {
      "script": {
        "lang": "painless",
        "source": """
          java.util.Random rnd = new Random();
          doc['views'].value+rnd.nextInt(1000);
"""
      }
    }
  },
  "query": {
    "match_all": {}
  }
}

Storing scripts in Cluster State:

POST _scripts/update_views
{
  "script": {
    "lang": "painless",
    "source": "ctx._source.views += params.count"
  }
}

POST tech_articles/_update/1
{
  "script": {
    "id": "update_views",
    "params": {
      "count": 1000
    }
  }
}

Elasticsearch Data Modeling Examples

What is Data Modeling?

Data modeling is the process of creating data models that abstractly describe the real world:

  • Blog posts / authors / comments
  • Mapping real-world entities to searchable documents

Three stages: Conceptual Model → Logical Model → Data Model

  • Data model: Determines final field definitions based on specific database capabilities
  • Balances functional requirements with performance needs

Field Type Selection

Text vs Keyword:

  • Text: Full-text fields that get analyzed. Doesn't support aggregation or sorting by default without enabling fielddata.

  • Keyword: For IDs, enumerations, and text that doesn't need tokenization. Ideal for filtering, sorting, and aggregations.

Multi-field types:

  • Text fields automatically include a keyword sub-field
  • Additional analyzers (English, pinyin, standard) improve search results for human language

Numeric types:

  • Choose the smallest appropriate type (use byte instead of long when sufficient)

Enumerated types:

  • Always use keyword type for better performance, even for numeric values

Other types: Date, boolean, geo information

Search Configuration

If search isn't needed:

  • Set enabled to false

If search isn't needed but aggregation is:

  • Set index to false

Configuring search granularity:

  • index_options / norms: Disable when normalization isn't needed

Aggregation and Sorting

If aggregation and sorting aren't needed:

  • Set enabled to false

If sorting and aggregation aren't needed:

  • Set doc_values or fielddata to false

Frequently updated fields with heavy aggregation:

  • Set eager_global_ordinals to true

Storage Options

Store field data separately:

  • Set store to true to store raw field content
  • Typically used when _source is disabled

Disable _source: Saves disk space but limits functionality:

  • No visibility into original document
  • Reindex and Update operations become unavailable
  • Kibana discovery features won't work

Practical Data Modeling Example

Optimizing field definitions:

PUT books/_doc/1
{
  "title":"Mastering Elasticsearch 5.0",
  "description":"Master the searching, indexing, and aggregation features in Elasticsearch",
  "author":"Bharvi Dixit",
  "published_date":"2017",
  "cover_url":"https://images-na.ssl-images-amazon.com/images/I/51OeaMFxcML.jpg"
}

GET books/_mapping

DELETE books

PUT books
{
  "mappings": {
    "properties": {
      "author": {"type": "keyword"},
      "cover_url": {"type": "keyword","index": false},
      "description": {"type": "text"},
      "published_date": {"type": "date"},
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 100
          }
        }
      }
    }
  }
}

POST books/_search
{
  "query": {
    "term": {
      "cover_url": {
        "value": "https://images-na.ssl-images-amazon.com/images/I/51OeaMFxcML.jpg"
      }
    }
  }
}

POST books/_search
{
  "aggs": {
    "covers": {
      "terms": {
        "field": "cover_url",
        "size": 10
      }
    }
  }
}

Handling large content fields:

DELETE books

PUT books
{
  "mappings": {
    "_source": {
      "enabled": false
    },
    "properties": {
      "author": {
        "type": "keyword",
        "store": true
      },
      "cover_url": {
        "type": "keyword",
        "index": false,
        "store": true
      },
      "description": {
        "type": "text",
        "store": true
      },
      "content": {
        "type": "text",
        "store": true
      },
      "published_date": {
        "type": "date",
        "store": true
      },
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 100
          }
        },
        "store": true
      }
    }
  }
}

PUT books/_doc/1
{
  "title": "Mastering Elasticsearch 5.0",
  "description": "Master the searching, indexing, and aggregation features in Elasticsearch",
  "content": "The content of the book......Indexing data, aggregation, searching.",
  "author": "Bharvi Dixit",
  "published_date": "2017",
  "cover_url": "https://images-na.ssl-images-amazon.com/images/I/51OeaMFxcML.jpg"
}

POST books/_search
{}

POST books/_search
{
  "stored_fields": ["title","author","published_date"],
  "query": {
    "match": {
      "content": "searching"
    }
  },
  "highlight": {
    "fields": {
      "content":{}
    }
  }
}

Mapping Parameters Reference

  • enabled: Set to false for storage-only fields without search or aggregation support
  • index: Set to false to disable search while preserving aggregation capability
  • norms: Disable when filtering and aggregation are the primary use cases
  • doc_values: Enable for sorting and aggregation
  • fielddata: Enable for text field sorting and aggregation
  • store: Default false; store raw field content separately from _source
  • coerce: Default true; enables automatic type conversion (e.g., string to number)
  • dynamic: true/false/strict controls automatic mapping updates

Script-based field updates:

POST legislation/_update_by_query
{
  "track_total_hits": true,
  "query": {
    "term": {
      "source_type": {
        "value": "migrate"
      }
    }
  },
  "script": {
    "source": "ctx._source.norm_citation = ctx._source.enactment_citation"
  }
}

Modifying document IDs during reindex:

POST _reindex
{
  "source": {
    "index": "legislation_clean_dev"
  },
  "dest": {
    "index": "legislation_clean_dev_test"
  },
  "script": {
    "inline": "ctx._id = ctx._source['object_id']",
    "lang": "painless"
  }
}

Related Articles

Designing Alertmanager Templates for Prometheus Notifications

How to craft Alertmanager templates to format alert messages, improving clarity and presentation. Alertmanager uses Go’s text/template engine with additional helper functions. Alerting rules referenc...

Skipping Errors in MySQL Asynchronous Replication

When a replica halts because the SQL thread encounters an error, you can resume replication by skipping the problematic event(s). Two common approaches are available. Methods to Skip Errors 1) Skip a...

Spring Boot MyBatis with Two MySQL DataSources Using Druid

Required dependencies application.properties: define two data sources and poooling Java configuration for both data sources MyBatis mappers for each data source Controller endpoints to verify both co...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.