Home > Notes > Content

Data Modeling in Elasticsearch: Objects, Relationships, and Pipelines

Notes 2

Objects and Nested Objects

Relational Data in the Real World

Many real-world scenarios involve complex relationships between entities:

Blog posts linked to authors and comments
Bank accounts with multiple transaction records
Customers owning multiple bank accounts
Directories containing files and subdirectories

Denormalization vs Normalization

Denormalization involves flattening data structures by storing redundant copies directly within documents rather than using traditional joins.

Advantages:

Eliminates expensive join operations
Improves readd performance significantly
Elasticsearch compresses the _source field to reduce disk overhead

Disadvantages:

Not ideal for frequently updated data
Modifying a single value (like a username) may require updating numerous documents

Handling Relationships in Elasticsearch

Relational databases favor normalization, while Elasticsearch typically works better with denormalized data:

Faster read operations
No table joins required
No row-level locks needed

Elasticsearch doesn't handle relationships efficiently by default. Four common approaches exist:

Object types
Nested objects
Parent/child relationships
Application-level joins

Example 1: Blog Posts with Author Information

Object Type Approach:

Store author details directly within each blog document. If author information changes, update all related blog documents.

DELETE blog

PUT /blog
{
  "mappings": {
    "properties": {
      "content": {
        "type": "text"
      },
      "posted_at": {
        "type": "date"
      },
      "author": {
        "properties": {
          "location": {
            "type": "text"
          },
          "author_id": {
            "type": "long"
          },
          "display_name": {
            "type": "keyword"
          }
        }
      }
    }
  }
}

PUT blog/_doc/1
{
  "content": "I like Elasticsearch",
  "posted_at": "2019-01-01T00:00:00",
  "author": {
    "author_id": 1,
    "display_name": "Jack",
    "location": "Shanghai"
  }
}

Retrieve both blog and author information with a single query:

POST blog/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "content": "Elasticsearch"
          }
        },
        {
          "match": {
            "author.display_name": "Jack"
          }
        }
      ]
    }
  }
}

Example 2: Documents with Object Arrays

DELETE my_films

PUT my_films
{
  "mappings": {
    "properties": {
      "cast": {
        "properties": {
          "first_name": {
            "type": "keyword"
          },
          "last_name": {
            "type": "keyword"
          }
        }
      },
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

POST my_films/_doc/1
{
  "title": "Speed",
  "cast": [
    {
      "first_name": "Keanu",
      "last_name": "Reeves"
    },
    {
      "first_name": "Dennis",
      "last_name": "Hopper"
    }
  ]
}

Searching documents with object arrays:

POST my_films/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "cast.first_name": "Keanu"
          }
        },
        {
          "match": {
            "cast.last_name": "Hopper"
          }
        }
      ]
    }
  }
}

Why this produces unexpected results:

Internally, Elasticsearch flattens nested object boundaries into flat key-value structures
When querying multiple fields, this causes false positive matches

The Nested Data Type Solution

What is Nested Data Type:

Allows objects within arrays to be indexed independently
Uses nested type combined with properties to index all actors into separate Lucene documents
Internal joins are performed during queries

Creating nested object mappings:

DELETE my_films

PUT my_films
{
  "mappings": {
    "properties": {
      "cast": {
        "type": "nested",
        "properties": {
          "first_name": {
            "type": "keyword"
          },
          "last_name": {
            "type": "keyword"
          }
        }
      },
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

POST my_films/_doc/1
{
  "title": "Speed",
  "cast": [
    {
      "first_name": "Keanu",
      "last_name": "Reeves"
    },
    {
      "first_name": "Dennis",
      "last_name": "Hopper"
    }
  ]
}

Nested query:

POST my_films/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "title": "Speed"
          }
        },
        {
          "nested": {
            "path": "cast",
            "query": {
              "bool": {
                "must": [
                  {
                    "match": {
                      "cast.first_name": "Keanu"
                    }
                  },
                  {
                    "match": {
                      "cast.last_name": "Hopper"
                    }
                  }
                ]
              }
            }
          }
        }
      ]
    }
  }
}

Nested aggregation:

POST my_films/_search
{
  "size": 0,
  "aggs": {
    "actors": {
      "nested": {
        "path": "cast"
      },
      "aggs": {
        "actor_name": {
          "terms": {
            "field": "cast.first_name",
            "size": 10
          }
        }
      }
    }
  }
}

Regular aggregations won't work on nested objects without the nested aggregation wrapper:

POST my_films/_search
{
  "size": 0,
  "aggs": {
    "NAME": {
      "terms": {
        "field": "cast.first_name",
        "size": 10
      }
    }
  }
}

Parent/Child Relationships

Limitations of Objects and Nested Objects

Every update requires reindexing the entire object, including root and nested objects

Elasticsearch provides a Join datatype similar to relational database joins:

Parent and child documents are independent
Updating a parent document doesn't require reindexing child documents
Adding, updating, or deleting child documents doesn't affect parents or other siblings

Defining Parenet/Child Relationships

Steps:

Configure index mapping
Index parent documents
Index child documents
Query as needed

Setting up mappings:

DELETE my_posts

PUT my_posts
{
  "settings": {
    "number_of_shards": 2
  },
  "mappings": {
    "properties": {
      "post_comment_relation": {
        "type": "join",
        "relations": {
          "post": "comment"
        }
      },
      "content": {
        "type": "text"
      },
      "title": {
        "type": "keyword"
      }
    }
  }
}

Indexing parent documents:

PUT my_posts/_doc/post1
{
  "title": "Learning Elasticsearch",
  "content": "learning ELK @ geektime",
  "post_comment_relation": {
    "name": "post"
  }
}

PUT my_posts/_doc/post2
{
  "title": "Learning Hadoop",
  "content": "learning Hadoop",
  "post_comment_relation": {
    "name": "post"
  }
}

Indexing child documents:

PUT my_posts/_doc/reply1?routing=post1
{
  "reply_text": "I am learning ELK",
  "username": "Jack",
  "post_comment_relation": {
    "name": "comment",
    "parent": "post1"
  }
}

PUT my_posts/_doc/reply2?routing=post2
{
  "reply_text": "I like Hadoop!!!!!",
  "username": "Jack",
  "post_comment_relation": {
    "name": "comment",
    "parent": "post2"
  }
}

PUT my_posts/_doc/reply3?routing=post2
{
  "reply_text": "Hello Hadoop",
  "username": "Bob",
  "post_comment_relation": {
    "name": "comment",
    "parent": "post2"
  }
}

Parent/Child Query Types

Query all documents:

POST my_posts/_search
{}

Parent ID query: Returns all related children for a given parent:

POST my_posts/_search
{
  "query": {
    "parent_id": {
      "type": "comment",
      "id": "post2"
    }
  }
}

Has Child query: Returns parent documents:

POST my_posts/_search
{
  "query": {
    "has_child": {
      "type": "comment",
      "query": {
        "match": {
          "username": "Jack"
        }
      }
    }
  }
}

Has Parent query:

POST my_posts/_search
{
  "query": {
    "has_parent": {
      "parent_type": "post",
      "query": {
        "match": {
          "title": "Learning Hadoop"
        }
      }
    }
  }
}

Accessing child documents:

GET my_posts/_doc/reply3
GET my_posts/_doc/reply3?routing=post2

Updating child documents:

PUT my_posts/_doc/reply3?routing=post2
{
  "reply_text": "Hello Hadoop??",
  "post_comment_relation": {
    "name": "comment",
    "parent": "post2"
  }
}

Nested Objects vs Parent/Child

Update By Query and Reindex API

Common Use Cases

Reindexing becomes necessary when:

Index mappings change: field type modifications, analyzer updates
Index settings change: primary shard count adjustments
Data migration: within or across clusters

Elasticsearch provides two APIs:

Update By Query: Rebuilds documents in the existing index
Reindex: Rebuilds documents into a different index

Example 1: Adding Sub-fields to an Index

Indexing initial documents:

DELETE articles/

PUT articles/_doc/1
{
  "body": "Hadoop is cool",
  "category": "hadoop"
}

Modifying mappings to add sub-fields with English analyzer:

PUT articles/_mapping
{
  "properties": {
    "body": {
      "type": "text",
      "fields": {
        "english": {
          "type": "text",
          "analyzer": "english"
        }
      }
    }
  }
}

PUT articles/_doc/2
{
  "body": "Elasticsearch rocks",
  "category": "elasticsearch"
}

Query newly indexed documents:

POST articles/_search
{
  "query": {
    "match": {
      "body.english": "Elasticsearch"
    }
  }
}

Query pre-mapping-change documents:

POST articles/_search
{
  "query": {
    "match": {
      "body.english": "Hadoop"
    }
  }
}

Execute Update By Query to resolve the issue:

POST articles/_update_by_query
{}

POST articles/_search
{
  "query": {
    "match": {
      "body.english": "Hadoop"
    }
  }
}

Example 2: Changing Existing Field Types

Elasticsearch doesn't allow modifying field types on existing mappings once data exists. The solution requires:

Creating a new index with correct field types
Reimporting the data

GET articles/_mapping

PUT articles/_mapping
{
  "properties": {
    "body": {
      "type": "text",
      "fields": {
        "english": {
          "type": "text",
          "analyzer": "english"
        }
      }
    },
    "category": {
      "type": "keyword"
    }
  }
}

Reindex API:

Copies documents from one index to another. Use cases include:

Modifying primary shard count
Changing field types
Migrating data within or across clusters

DELETE articles_v2

PUT articles_v2/
{
  "mappings": {
    "properties": {
      "body": {
        "type": "text",
        "fields": {
          "english": {
            "type": "text",
            "analyzer": "english"
          }
        }
      },
      "category": {
        "type": "keyword"
      }
    }
  }
}

Migrate data from old index:

POST _reindex
{
  "source": {
    "index": "articles"
  },
  "dest": {
    "index": "articles_v2"
  }
}

Verify term aggregation:

GET articles_v2/_doc/1

POST articles_v2/_search
{
  "size": 0,
  "aggs": {
    "blog_category": {
      "terms": {
        "field": "category",
        "size": 10
      }
    }
  }
}

OP Type:

POST _reindex
{
  "source": {
    "index": "articles"
  },
  "dest": {
    "index": "articles_v2",
    "op_type": "create"
  }
}

Cross-cluster Reindex

Ingest Pipeline and Painless Script

Requirements

Common preprocessing needs:

Converting comma-separated tags from strings to arrays
Supporting aggregation on tag fields

Ingest Node

Introduced in Elasticsearch 5.0, each node is an Ingest Node by default:

Intercepts Index or Bulk API requests for preprocessing
Transforms data and returns it to the indexing pipeline

Preprocessing capabilities without Logstash:

Setting default field values
Renaming fields
Splitting field values
Custom Painless scripts for complex transformations

Pipeline and Processor:

A Pipeline processes documents sequentially. Processors are abstract wrappers for transformation operations.

Elasticsearch includes numerous built-in Processors and supports custom ones via plugins.

Splitting strings with Pipeline:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "split blog tags",
    "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "title": "Introducing big data......",
        "tags": "hadoop,elasticsearch,spark",
        "body": "You know, for big data"
      }
    },
    {
      "_index": "index",
      "_id": "idxx",
      "_source": {
        "title": "Introducing cloud computing",
        "tags": "openstack,k8s",
        "body": "You know, for cloud"
      }
    }
  ]
}

Adding fields to documents:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "split and enhance blog data",
    "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      },
      {
        "set": {
          "field": "views",
          "value": 0
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "title": "Introducing big data......",
        "tags": "hadoop,elasticsearch,spark",
        "body": "You know, for big data"
      }
    }
  ]
}

Pipeline API usage:

DELETE tech_articles


PUT tech_articles/_doc/1
{
  "title": "Introducing big data......",
  "tags": "hadoop,elasticsearch,spark",
  "body": "You know, for big data"
}

Creating a pipeline:

PUT _ingest/pipeline/article_pipeline
{
  "description": "an article pipeline",
  "processors": [
    {
      "split": {
        "field": "tags",
        "separator": ","
      }
    },
    {
      "set": {
        "field": "views",
        "value": 0
      }
    }
  ]
}

GET _ingest/pipeline/article_pipeline

Testing the pipeline:

POST _ingest/pipeline/article_pipeline/_simulate
{
  "docs": [
    {
      "_source": {
        "title": "Introducing cloud computing",
        "tags": "openstack,k8s",
        "body": "You know, for cloud"
      }
    }
  ]
}

Indexing with and without pipeline:

PUT tech_articles/_doc/1
{
  "title": "Introducing big data......",
  "tags": "hadoop,elasticsearch,spark",
  "body": "You know, for big data"
}

PUT tech_articles/_doc/2?pipeline=article_pipeline
{
  "title": "Introducing cloud computing",
  "tags": "openstack,k8s",
  "body": "You know, for cloud"
}

Query results:

POST tech_articles/_search
{}

Rebuilding existing documents with pipeline:

POST tech_articles/_update_by_query?pipeline=article_pipeline
{}

Adding query conditions:

POST tech_articles/_update_by_query?pipeline=article_pipeline
{
  "query": {
    "bool": {
      "must_not": {
        "exists": {
          "field": "views"
        }
      }
    }
  }
}

Common built-in Processors:

Split Processor: Splits field values into arrays
Remove/Rename Processor: Removes or renames fields
Append: Adds new values to fields
Convert: Changes field types (e.g., string to float)
Date/JSON: Date format conversion, string to JSON
Date Index Name Processor: Routes documents to time-based indices
Fail Processor: Returns custom error messages
Foreach Processor: Applies processors to array elements
Grok Processor: Parses log formats
Gsub/Join/Split: String replacement, array conversions
Lowercase/Uppercase: Case transformations

Painless Scripting

Introduced in Elasticsearch 5.x, Painless is purpose-built for Elasticsearch, extending Java syntax. From 6.0 onwards, Painless is the only supported scripting language.

Painless characteristics:

High performance with security features
Supports both explicit and dynamic typing
Compatible with Java data types and API subsets

Painless use cases:

Updating or removing fields
Data aggregation operations
Script Fields: Pre-computing returned fields
Function Score: Modifying document relevance scoring
Ingest Pipeline transformations
Reindex and Update By Query operations

Example 1: Script Processor:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "process blog data",
    "processors": [
      {
        "split": {
          "field": "tags",
          "separator": ","
        }
      },
      {
        "script": {
          "source": """
          if(ctx.containsKey("body")){
            ctx.body_length = ctx.body.length();
          }else{
            ctx.body_length=0;
          }
"""
        }
      },
      {
        "set": {
          "field": "views",
          "value": 0
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "title": "Introducing big data......",
        "tags": "hadoop,elasticsearch,spark",
        "body": "You know, for big data"
      }
    }
  ]
}

Example 2: Document Update Counter:

DELETE tech_articles
PUT tech_articles/_doc/1
{
  "title": "Introducing big data......",
  "tags": "hadoop,elasticsearch,spark",
  "body": "You know, for big data",
  "views": 0
}

POST tech_articles/_update/1
{
  "script": {
    "source": "ctx._source.views += params.count",
    "params": {
      "count": 100
    }
  }
}

POST tech_articles/_search
{}

Example 3: Search Script Fields:

GET tech_articles/_search
{
  "script_fields": {
    "random_views": {
      "script": {
        "lang": "painless",
        "source": """
          java.util.Random rnd = new Random();
          doc['views'].value+rnd.nextInt(1000);
"""
      }
    }
  },
  "query": {
    "match_all": {}
  }
}

Storing scripts in Cluster State:

POST _scripts/update_views
{
  "script": {
    "lang": "painless",
    "source": "ctx._source.views += params.count"
  }
}

POST tech_articles/_update/1
{
  "script": {
    "id": "update_views",
    "params": {
      "count": 1000
    }
  }
}

Elasticsearch Data Modeling Examples

What is Data Modeling?

Data modeling is the process of creating data models that abstractly describe the real world:

Blog posts / authors / comments
Mapping real-world entities to searchable documents

Three stages: Conceptual Model → Logical Model → Data Model

Data model: Determines final field definitions based on specific database capabilities
Balances functional requirements with performance needs

Field Type Selection

Text vs Keyword:

Text: Full-text fields that get analyzed. Doesn't support aggregation or sorting by default without enabling fielddata.
Keyword: For IDs, enumerations, and text that doesn't need tokenization. Ideal for filtering, sorting, and aggregations.

Multi-field types:

Text fields automatically include a keyword sub-field
Additional analyzers (English, pinyin, standard) improve search results for human language

Numeric types:

Choose the smallest appropriate type (use byte instead of long when sufficient)

Enumerated types:

Always use keyword type for better performance, even for numeric values

Other types: Date, boolean, geo information

Search Configuration

If search isn't needed:

Set enabled to false

If search isn't needed but aggregation is:

Set index to false

Configuring search granularity:

index_options / norms: Disable when normalization isn't needed

Aggregation and Sorting

If aggregation and sorting aren't needed:

Set enabled to false

If sorting and aggregation aren't needed:

Set doc_values or fielddata to false

Frequently updated fields with heavy aggregation:

Set eager_global_ordinals to true

Storage Options

Store field data separately:

Set store to true to store raw field content
Typically used when _source is disabled

Disable _source: Saves disk space but limits functionality:

No visibility into original document
Reindex and Update operations become unavailable
Kibana discovery features won't work

Practical Data Modeling Example

Optimizing field definitions:

PUT books/_doc/1
{
  "title":"Mastering Elasticsearch 5.0",
  "description":"Master the searching, indexing, and aggregation features in Elasticsearch",
  "author":"Bharvi Dixit",
  "published_date":"2017",
  "cover_url":"https://images-na.ssl-images-amazon.com/images/I/51OeaMFxcML.jpg"
}

GET books/_mapping

DELETE books

PUT books
{
  "mappings": {
    "properties": {
      "author": {"type": "keyword"},
      "cover_url": {"type": "keyword","index": false},
      "description": {"type": "text"},
      "published_date": {"type": "date"},
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 100
          }
        }
      }
    }
  }
}

POST books/_search
{
  "query": {
    "term": {
      "cover_url": {
        "value": "https://images-na.ssl-images-amazon.com/images/I/51OeaMFxcML.jpg"
      }
    }
  }
}

POST books/_search
{
  "aggs": {
    "covers": {
      "terms": {
        "field": "cover_url",
        "size": 10
      }
    }
  }
}

Handling large content fields:

DELETE books

PUT books
{
  "mappings": {
    "_source": {
      "enabled": false
    },
    "properties": {
      "author": {
        "type": "keyword",
        "store": true
      },
      "cover_url": {
        "type": "keyword",
        "index": false,
        "store": true
      },
      "description": {
        "type": "text",
        "store": true
      },
      "content": {
        "type": "text",
        "store": true
      },
      "published_date": {
        "type": "date",
        "store": true
      },
      "title": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 100
          }
        },
        "store": true
      }
    }
  }
}

PUT books/_doc/1
{
  "title": "Mastering Elasticsearch 5.0",
  "description": "Master the searching, indexing, and aggregation features in Elasticsearch",
  "content": "The content of the book......Indexing data, aggregation, searching.",
  "author": "Bharvi Dixit",
  "published_date": "2017",
  "cover_url": "https://images-na.ssl-images-amazon.com/images/I/51OeaMFxcML.jpg"
}

POST books/_search
{}

POST books/_search
{
  "stored_fields": ["title","author","published_date"],
  "query": {
    "match": {
      "content": "searching"
    }
  },
  "highlight": {
    "fields": {
      "content":{}
    }
  }
}

Mapping Parameters Reference

enabled: Set to false for storage-only fields without search or aggregation support
index: Set to false to disable search while preserving aggregation capability
norms: Disable when filtering and aggregation are the primary use cases
doc_values: Enable for sorting and aggregation
fielddata: Enable for text field sorting and aggregation
store: Default false; store raw field content separately from _source
coerce: Default true; enables automatic type conversion (e.g., string to number)
dynamic: true/false/strict controls automatic mapping updates

Script-based field updates:

POST legislation/_update_by_query
{
  "track_total_hits": true,
  "query": {
    "term": {
      "source_type": {
        "value": "migrate"
      }
    }
  },
  "script": {
    "source": "ctx._source.norm_citation = ctx._source.enactment_citation"
  }
}

Modifying document IDs during reindex:

POST _reindex
{
  "source": {
    "index": "legislation_clean_dev"
  },
  "dest": {
    "index": "legislation_clean_dev_test"
  },
  "script": {
    "inline": "ctx._id = ctx._source['object_id']",
    "lang": "painless"
  }
}

Tags: elasticsearch

Back to List

Prev: NTC Thermistor Temperature Measurement: Mathematical Models and Circuit Implementation

Next: Docker Fundamentals: Installation, Image Management, and Container Operations

Fading Coder

Data Modeling in Elasticsearch: Objects, Relationships, and Pipelines

Objects and Nested Objects

Relational Data in the Real World

Denormalization vs Normalization

Handling Relationships in Elasticsearch

Example 1: Blog Posts with Author Information

Example 2: Documents with Object Arrays

The Nested Data Type Solution

Parent/Child Relationships

Limitations of Objects and Nested Objects

Defining Parenet/Child Relationships

Parent/Child Query Types

Nested Objects vs Parent/Child

Update By Query and Reindex API

Common Use Cases

Example 1: Adding Sub-fields to an Index

Example 2: Changing Existing Field Types

Cross-cluster Reindex

Ingest Pipeline and Painless Script

Requirements

Ingest Node

Painless Scripting

Elasticsearch Data Modeling Examples

What is Data Modeling?

Field Type Selection

Search Configuration

Aggregation and Sorting

Storage Options

Practical Data Modeling Example

Mapping Parameters Reference

Related Articles

Designing Alertmanager Templates for Prometheus Notifications

Skipping Errors in MySQL Asynchronous Replication

Spring Boot MyBatis with Two MySQL DataSources Using Druid

Leave a Comment

Copyright © fadingcoder.top

Fading Coder

Data Modeling in Elasticsearch: Objects, Relationships, and Pipelines

Objects and Nested Objects

Relational Data in the Real World

Denormalization vs Normalization

Handling Relationships in Elasticsearch

Example 1: Blog Posts with Author Information

Example 2: Documents with Object Arrays

The Nested Data Type Solution

Parent/Child Relationships

Limitations of Objects and Nested Objects

Defining Parenet/Child Relationships

Parent/Child Query Types

Nested Objects vs Parent/Child

Update By Query and Reindex API

Common Use Cases

Example 1: Adding Sub-fields to an Index

Example 2: Changing Existing Field Types

Cross-cluster Reindex

Ingest Pipeline and Painless Script

Requirements

Ingest Node

Painless Scripting

Elasticsearch Data Modeling Examples

What is Data Modeling?

Field Type Selection

Search Configuration

Aggregation and Sorting

Storage Options

Practical Data Modeling Example

Mapping Parameters Reference

Related Articles

Designing Alertmanager Templates for Prometheus Notifications

Skipping Errors in MySQL Asynchronous Replication

Spring Boot MyBatis with Two MySQL DataSources Using Druid

Leave a CommentCancel Reply

Copyright © fadingcoder.top

Leave a Comment