Data Modeling in Elasticsearch: Objects, Relationships, and Pipelines
Objects and Nested Objects
Relational Data in the Real World
Many real-world scenarios involve complex relationships between entities:
- Blog posts linked to authors and comments
- Bank accounts with multiple transaction records
- Customers owning multiple bank accounts
- Directories containing files and subdirectories
Denormalization vs Normalization
Denormalization involves flattening data structures by storing redundant copies directly within documents rather than using traditional joins.
Advantages:
- Eliminates expensive join operations
- Improves readd performance significantly
- Elasticsearch compresses the
_sourcefield to reduce disk overhead
Disadvantages:
- Not ideal for frequently updated data
- Modifying a single value (like a username) may require updating numerous documents
Handling Relationships in Elasticsearch
Relational databases favor normalization, while Elasticsearch typically works better with denormalized data:
- Faster read operations
- No table joins required
- No row-level locks needed
Elasticsearch doesn't handle relationships efficiently by default. Four common approaches exist:
- Object types
- Nested objects
- Parent/child relationships
- Application-level joins
Example 1: Blog Posts with Author Information
Object Type Approach:
Store author details directly within each blog document. If author information changes, update all related blog documents.
DELETE blog
PUT /blog
{
"mappings": {
"properties": {
"content": {
"type": "text"
},
"posted_at": {
"type": "date"
},
"author": {
"properties": {
"location": {
"type": "text"
},
"author_id": {
"type": "long"
},
"display_name": {
"type": "keyword"
}
}
}
}
}
}
PUT blog/_doc/1
{
"content": "I like Elasticsearch",
"posted_at": "2019-01-01T00:00:00",
"author": {
"author_id": 1,
"display_name": "Jack",
"location": "Shanghai"
}
}
Retrieve both blog and author information with a single query:
POST blog/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"content": "Elasticsearch"
}
},
{
"match": {
"author.display_name": "Jack"
}
}
]
}
}
}
Example 2: Documents with Object Arrays
DELETE my_films
PUT my_films
{
"mappings": {
"properties": {
"cast": {
"properties": {
"first_name": {
"type": "keyword"
},
"last_name": {
"type": "keyword"
}
}
},
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
POST my_films/_doc/1
{
"title": "Speed",
"cast": [
{
"first_name": "Keanu",
"last_name": "Reeves"
},
{
"first_name": "Dennis",
"last_name": "Hopper"
}
]
}
Searching documents with object arrays:
POST my_films/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"cast.first_name": "Keanu"
}
},
{
"match": {
"cast.last_name": "Hopper"
}
}
]
}
}
}
Why this produces unexpected results:
- Internally, Elasticsearch flattens nested object boundaries into flat key-value structures
- When querying multiple fields, this causes false positive matches
The Nested Data Type Solution
What is Nested Data Type:
- Allows objects within arrays to be indexed independently
- Uses
nestedtype combined withpropertiesto index all actors into separate Lucene documents - Internal joins are performed during queries
Creating nested object mappings:
DELETE my_films
PUT my_films
{
"mappings": {
"properties": {
"cast": {
"type": "nested",
"properties": {
"first_name": {
"type": "keyword"
},
"last_name": {
"type": "keyword"
}
}
},
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
POST my_films/_doc/1
{
"title": "Speed",
"cast": [
{
"first_name": "Keanu",
"last_name": "Reeves"
},
{
"first_name": "Dennis",
"last_name": "Hopper"
}
]
}
Nested query:
POST my_films/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title": "Speed"
}
},
{
"nested": {
"path": "cast",
"query": {
"bool": {
"must": [
{
"match": {
"cast.first_name": "Keanu"
}
},
{
"match": {
"cast.last_name": "Hopper"
}
}
]
}
}
}
}
]
}
}
}
Nested aggregation:
POST my_films/_search
{
"size": 0,
"aggs": {
"actors": {
"nested": {
"path": "cast"
},
"aggs": {
"actor_name": {
"terms": {
"field": "cast.first_name",
"size": 10
}
}
}
}
}
}
Regular aggregations won't work on nested objects without the nested aggregation wrapper:
POST my_films/_search
{
"size": 0,
"aggs": {
"NAME": {
"terms": {
"field": "cast.first_name",
"size": 10
}
}
}
}
Parent/Child Relationships
Limitations of Objects and Nested Objects
- Every update requires reindexing the entire object, including root and nested objects
Elasticsearch provides a Join datatype similar to relational database joins:
- Parent and child documents are independent
- Updating a parent document doesn't require reindexing child documents
- Adding, updating, or deleting child documents doesn't affect parents or other siblings
Defining Parenet/Child Relationships
Steps:
- Configure index mapping
- Index parent documents
- Index child documents
- Query as needed
Setting up mappings:
DELETE my_posts
PUT my_posts
{
"settings": {
"number_of_shards": 2
},
"mappings": {
"properties": {
"post_comment_relation": {
"type": "join",
"relations": {
"post": "comment"
}
},
"content": {
"type": "text"
},
"title": {
"type": "keyword"
}
}
}
}
Indexing parent documents:
PUT my_posts/_doc/post1
{
"title": "Learning Elasticsearch",
"content": "learning ELK @ geektime",
"post_comment_relation": {
"name": "post"
}
}
PUT my_posts/_doc/post2
{
"title": "Learning Hadoop",
"content": "learning Hadoop",
"post_comment_relation": {
"name": "post"
}
}
Indexing child documents:
PUT my_posts/_doc/reply1?routing=post1
{
"reply_text": "I am learning ELK",
"username": "Jack",
"post_comment_relation": {
"name": "comment",
"parent": "post1"
}
}
PUT my_posts/_doc/reply2?routing=post2
{
"reply_text": "I like Hadoop!!!!!",
"username": "Jack",
"post_comment_relation": {
"name": "comment",
"parent": "post2"
}
}
PUT my_posts/_doc/reply3?routing=post2
{
"reply_text": "Hello Hadoop",
"username": "Bob",
"post_comment_relation": {
"name": "comment",
"parent": "post2"
}
}
Parent/Child Query Types
Query all documents:
POST my_posts/_search
{}
Parent ID query: Returns all related children for a given parent:
POST my_posts/_search
{
"query": {
"parent_id": {
"type": "comment",
"id": "post2"
}
}
}
Has Child query: Returns parent documents:
POST my_posts/_search
{
"query": {
"has_child": {
"type": "comment",
"query": {
"match": {
"username": "Jack"
}
}
}
}
}
Has Parent query:
POST my_posts/_search
{
"query": {
"has_parent": {
"parent_type": "post",
"query": {
"match": {
"title": "Learning Hadoop"
}
}
}
}
}
Accessing child documents:
GET my_posts/_doc/reply3
GET my_posts/_doc/reply3?routing=post2
Updating child documents:
PUT my_posts/_doc/reply3?routing=post2
{
"reply_text": "Hello Hadoop??",
"post_comment_relation": {
"name": "comment",
"parent": "post2"
}
}
Nested Objects vs Parent/Child
Update By Query and Reindex API
Common Use Cases
Reindexing becomes necessary when:
- Index mappings change: field type modifications, analyzer updates
- Index settings change: primary shard count adjustments
- Data migration: within or across clusters
Elasticsearch provides two APIs:
- Update By Query: Rebuilds documents in the existing index
- Reindex: Rebuilds documents into a different index
Example 1: Adding Sub-fields to an Index
Indexing initial documents:
DELETE articles/
PUT articles/_doc/1
{
"body": "Hadoop is cool",
"category": "hadoop"
}
Modifying mappings to add sub-fields with English analyzer:
PUT articles/_mapping
{
"properties": {
"body": {
"type": "text",
"fields": {
"english": {
"type": "text",
"analyzer": "english"
}
}
}
}
}
PUT articles/_doc/2
{
"body": "Elasticsearch rocks",
"category": "elasticsearch"
}
Query newly indexed documents:
POST articles/_search
{
"query": {
"match": {
"body.english": "Elasticsearch"
}
}
}
Query pre-mapping-change documents:
POST articles/_search
{
"query": {
"match": {
"body.english": "Hadoop"
}
}
}
Execute Update By Query to resolve the issue:
POST articles/_update_by_query
{}
POST articles/_search
{
"query": {
"match": {
"body.english": "Hadoop"
}
}
}
Example 2: Changing Existing Field Types
Elasticsearch doesn't allow modifying field types on existing mappings once data exists. The solution requires:
- Creating a new index with correct field types
- Reimporting the data
GET articles/_mapping
PUT articles/_mapping
{
"properties": {
"body": {
"type": "text",
"fields": {
"english": {
"type": "text",
"analyzer": "english"
}
}
},
"category": {
"type": "keyword"
}
}
}
Reindex API:
Copies documents from one index to another. Use cases include:
- Modifying primary shard count
- Changing field types
- Migrating data within or across clusters
DELETE articles_v2
PUT articles_v2/
{
"mappings": {
"properties": {
"body": {
"type": "text",
"fields": {
"english": {
"type": "text",
"analyzer": "english"
}
}
},
"category": {
"type": "keyword"
}
}
}
}
Migrate data from old index:
POST _reindex
{
"source": {
"index": "articles"
},
"dest": {
"index": "articles_v2"
}
}
Verify term aggregation:
GET articles_v2/_doc/1
POST articles_v2/_search
{
"size": 0,
"aggs": {
"blog_category": {
"terms": {
"field": "category",
"size": 10
}
}
}
}
OP Type:
POST _reindex
{
"source": {
"index": "articles"
},
"dest": {
"index": "articles_v2",
"op_type": "create"
}
}
Cross-cluster Reindex
Ingest Pipeline and Painless Script
Requirements
Common preprocessing needs:
- Converting comma-separated tags from strings to arrays
- Supporting aggregation on tag fields
Ingest Node
Introduced in Elasticsearch 5.0, each node is an Ingest Node by default:
- Intercepts Index or Bulk API requests for preprocessing
- Transforms data and returns it to the indexing pipeline
Preprocessing capabilities without Logstash:
- Setting default field values
- Renaming fields
- Splitting field values
- Custom Painless scripts for complex transformations
Pipeline and Processor:
A Pipeline processes documents sequentially. Processors are abstract wrappers for transformation operations.
Elasticsearch includes numerous built-in Processors and supports custom ones via plugins.
Splitting strings with Pipeline:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "split blog tags",
"processors": [
{
"split": {
"field": "tags",
"separator": ","
}
}
]
},
"docs": [
{
"_index": "index",
"_id": "id",
"_source": {
"title": "Introducing big data......",
"tags": "hadoop,elasticsearch,spark",
"body": "You know, for big data"
}
},
{
"_index": "index",
"_id": "idxx",
"_source": {
"title": "Introducing cloud computing",
"tags": "openstack,k8s",
"body": "You know, for cloud"
}
}
]
}
Adding fields to documents:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "split and enhance blog data",
"processors": [
{
"split": {
"field": "tags",
"separator": ","
}
},
{
"set": {
"field": "views",
"value": 0
}
}
]
},
"docs": [
{
"_index": "index",
"_id": "id",
"_source": {
"title": "Introducing big data......",
"tags": "hadoop,elasticsearch,spark",
"body": "You know, for big data"
}
}
]
}
Pipeline API usage:
DELETE tech_articles
PUT tech_articles/_doc/1
{
"title": "Introducing big data......",
"tags": "hadoop,elasticsearch,spark",
"body": "You know, for big data"
}
Creating a pipeline:
PUT _ingest/pipeline/article_pipeline
{
"description": "an article pipeline",
"processors": [
{
"split": {
"field": "tags",
"separator": ","
}
},
{
"set": {
"field": "views",
"value": 0
}
}
]
}
GET _ingest/pipeline/article_pipeline
Testing the pipeline:
POST _ingest/pipeline/article_pipeline/_simulate
{
"docs": [
{
"_source": {
"title": "Introducing cloud computing",
"tags": "openstack,k8s",
"body": "You know, for cloud"
}
}
]
}
Indexing with and without pipeline:
PUT tech_articles/_doc/1
{
"title": "Introducing big data......",
"tags": "hadoop,elasticsearch,spark",
"body": "You know, for big data"
}
PUT tech_articles/_doc/2?pipeline=article_pipeline
{
"title": "Introducing cloud computing",
"tags": "openstack,k8s",
"body": "You know, for cloud"
}
Query results:
POST tech_articles/_search
{}
Rebuilding existing documents with pipeline:
POST tech_articles/_update_by_query?pipeline=article_pipeline
{}
Adding query conditions:
POST tech_articles/_update_by_query?pipeline=article_pipeline
{
"query": {
"bool": {
"must_not": {
"exists": {
"field": "views"
}
}
}
}
}
Common built-in Processors:
- Split Processor: Splits field values into arrays
- Remove/Rename Processor: Removes or renames fields
- Append: Adds new values to fields
- Convert: Changes field types (e.g., string to float)
- Date/JSON: Date format conversion, string to JSON
- Date Index Name Processor: Routes documents to time-based indices
- Fail Processor: Returns custom error messages
- Foreach Processor: Applies processors to array elements
- Grok Processor: Parses log formats
- Gsub/Join/Split: String replacement, array conversions
- Lowercase/Uppercase: Case transformations
Painless Scripting
Introduced in Elasticsearch 5.x, Painless is purpose-built for Elasticsearch, extending Java syntax. From 6.0 onwards, Painless is the only supported scripting language.
Painless characteristics:
- High performance with security features
- Supports both explicit and dynamic typing
- Compatible with Java data types and API subsets
Painless use cases:
- Updating or removing fields
- Data aggregation operations
- Script Fields: Pre-computing returned fields
- Function Score: Modifying document relevance scoring
- Ingest Pipeline transformations
- Reindex and Update By Query operations
Example 1: Script Processor:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "process blog data",
"processors": [
{
"split": {
"field": "tags",
"separator": ","
}
},
{
"script": {
"source": """
if(ctx.containsKey("body")){
ctx.body_length = ctx.body.length();
}else{
ctx.body_length=0;
}
"""
}
},
{
"set": {
"field": "views",
"value": 0
}
}
]
},
"docs": [
{
"_index": "index",
"_id": "id",
"_source": {
"title": "Introducing big data......",
"tags": "hadoop,elasticsearch,spark",
"body": "You know, for big data"
}
}
]
}
Example 2: Document Update Counter:
DELETE tech_articles
PUT tech_articles/_doc/1
{
"title": "Introducing big data......",
"tags": "hadoop,elasticsearch,spark",
"body": "You know, for big data",
"views": 0
}
POST tech_articles/_update/1
{
"script": {
"source": "ctx._source.views += params.count",
"params": {
"count": 100
}
}
}
POST tech_articles/_search
{}
Example 3: Search Script Fields:
GET tech_articles/_search
{
"script_fields": {
"random_views": {
"script": {
"lang": "painless",
"source": """
java.util.Random rnd = new Random();
doc['views'].value+rnd.nextInt(1000);
"""
}
}
},
"query": {
"match_all": {}
}
}
Storing scripts in Cluster State:
POST _scripts/update_views
{
"script": {
"lang": "painless",
"source": "ctx._source.views += params.count"
}
}
POST tech_articles/_update/1
{
"script": {
"id": "update_views",
"params": {
"count": 1000
}
}
}
Elasticsearch Data Modeling Examples
What is Data Modeling?
Data modeling is the process of creating data models that abstractly describe the real world:
- Blog posts / authors / comments
- Mapping real-world entities to searchable documents
Three stages: Conceptual Model → Logical Model → Data Model
- Data model: Determines final field definitions based on specific database capabilities
- Balances functional requirements with performance needs
Field Type Selection
Text vs Keyword:
-
Text: Full-text fields that get analyzed. Doesn't support aggregation or sorting by default without enabling fielddata.
-
Keyword: For IDs, enumerations, and text that doesn't need tokenization. Ideal for filtering, sorting, and aggregations.
Multi-field types:
- Text fields automatically include a keyword sub-field
- Additional analyzers (English, pinyin, standard) improve search results for human language
Numeric types:
- Choose the smallest appropriate type (use byte instead of long when sufficient)
Enumerated types:
- Always use keyword type for better performance, even for numeric values
Other types: Date, boolean, geo information
Search Configuration
If search isn't needed:
- Set
enabledto false
If search isn't needed but aggregation is:
- Set
indexto false
Configuring search granularity:
index_options/norms: Disable when normalization isn't needed
Aggregation and Sorting
If aggregation and sorting aren't needed:
- Set
enabledto false
If sorting and aggregation aren't needed:
- Set
doc_valuesorfielddatato false
Frequently updated fields with heavy aggregation:
- Set
eager_global_ordinalsto true
Storage Options
Store field data separately:
- Set
storeto true to store raw field content - Typically used when
_sourceis disabled
Disable _source: Saves disk space but limits functionality:
- No visibility into original document
- Reindex and Update operations become unavailable
- Kibana discovery features won't work
Practical Data Modeling Example
Optimizing field definitions:
PUT books/_doc/1
{
"title":"Mastering Elasticsearch 5.0",
"description":"Master the searching, indexing, and aggregation features in Elasticsearch",
"author":"Bharvi Dixit",
"published_date":"2017",
"cover_url":"https://images-na.ssl-images-amazon.com/images/I/51OeaMFxcML.jpg"
}
GET books/_mapping
DELETE books
PUT books
{
"mappings": {
"properties": {
"author": {"type": "keyword"},
"cover_url": {"type": "keyword","index": false},
"description": {"type": "text"},
"published_date": {"type": "date"},
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 100
}
}
}
}
}
}
POST books/_search
{
"query": {
"term": {
"cover_url": {
"value": "https://images-na.ssl-images-amazon.com/images/I/51OeaMFxcML.jpg"
}
}
}
}
POST books/_search
{
"aggs": {
"covers": {
"terms": {
"field": "cover_url",
"size": 10
}
}
}
}
Handling large content fields:
DELETE books
PUT books
{
"mappings": {
"_source": {
"enabled": false
},
"properties": {
"author": {
"type": "keyword",
"store": true
},
"cover_url": {
"type": "keyword",
"index": false,
"store": true
},
"description": {
"type": "text",
"store": true
},
"content": {
"type": "text",
"store": true
},
"published_date": {
"type": "date",
"store": true
},
"title": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 100
}
},
"store": true
}
}
}
}
PUT books/_doc/1
{
"title": "Mastering Elasticsearch 5.0",
"description": "Master the searching, indexing, and aggregation features in Elasticsearch",
"content": "The content of the book......Indexing data, aggregation, searching.",
"author": "Bharvi Dixit",
"published_date": "2017",
"cover_url": "https://images-na.ssl-images-amazon.com/images/I/51OeaMFxcML.jpg"
}
POST books/_search
{}
POST books/_search
{
"stored_fields": ["title","author","published_date"],
"query": {
"match": {
"content": "searching"
}
},
"highlight": {
"fields": {
"content":{}
}
}
}
Mapping Parameters Reference
- enabled: Set to false for storage-only fields without search or aggregation support
- index: Set to false to disable search while preserving aggregation capability
- norms: Disable when filtering and aggregation are the primary use cases
- doc_values: Enable for sorting and aggregation
- fielddata: Enable for text field sorting and aggregation
- store: Default false; store raw field content separately from
_source - coerce: Default true; enables automatic type conversion (e.g., string to number)
- dynamic: true/false/strict controls automatic mapping updates
Script-based field updates:
POST legislation/_update_by_query
{
"track_total_hits": true,
"query": {
"term": {
"source_type": {
"value": "migrate"
}
}
},
"script": {
"source": "ctx._source.norm_citation = ctx._source.enactment_citation"
}
}
Modifying document IDs during reindex:
POST _reindex
{
"source": {
"index": "legislation_clean_dev"
},
"dest": {
"index": "legislation_clean_dev_test"
},
"script": {
"inline": "ctx._id = ctx._source['object_id']",
"lang": "painless"
}
}