Fading Coder

One Final Commit for the Last Sprint

Home > Tech > Content

Elasticsearch Document Operations, Mapping Configuration, and Search APIs

Tech 1

Document Lifecycle and Batch Operations

Index a document with auto-generated identifier:

POST inventory/_doc
{
  "item_code": "SKU-8842",
  "timestamp": "2023-08-21T14:35:22Z",
  "notes": "Initial stock entry"
}

Create with explicit ID, failing if document exists:

PUT inventory/_doc/8842?op_type=create
{
  "item_code": "SKU-8842",
  "timestamp": "2023-08-21T15:00:00Z",
  "notes": "Reserved inventory"
}

Alternative syntax for conditional creation:

PUT inventory/_create/8842
{
  "item_code": "SKU-8842",
  "timestamp": "2023-08-21T15:00:00Z",
  "notes": "Reserved inventory"
}

Retrieve by identifier:

GET inventory/_doc/8842

Full document replacement:

PUT inventory/_doc/8842
{
  "item_code": "SKU-8842",
  "status": "active"
}

Partial update with doc merging:

POST inventory/_update/8842/
{
  "doc": {
    "last_updated": "2023-08-21T16:00:00Z",
    "warehouse": "East-01"
  }
}

Remove document:

DELETE inventory/_doc/8842

Bulk Processing

Execute multiple operations atomically. The create action fails on duplicates, while index performs upsert operations:

POST _bulk
{"index":{"_index":"transactions","_id":"txn-001"}}
{"amount":150.00,"currency":"USD"}
{"delete":{"_index":"transactions","_id":"txn-099"}}
{"create":{"_index":"archive","_id":"txn-001"}}
{"amount":150.00,"archived":true}
{"update":{"_id":"txn-001","_index":"transactions"}}
{"doc":{"status":"processed"}}

First execution creates the document with version 1. Subsequent executions of the same bulk request will update the existing document when using index, fail on create if ID exists, and return not_found for non-existent deletions.

Multi-Document Retrieval

Fetch multiple documents across indices:

GET /_mget
{
  "docs": [
    {"_index": "transactions", "_id": "txn-001"},
    {"_index": "transactions", "_id": "txn-002"}
  ]
}

With implicit index context:

GET /transactions/_mget
{
  "docs": [
    {"_id": "txn-001"},
    {"_id": "txn-002"}
  ]
}

Control source field inclusion:

GET /_mget
{
  "docs": [
    {"_index": "transactions", "_id": "txn-001", "_source": false},
    {"_index": "transactions", "_id": "txn-002", "_source": ["amount", "currency"]},
    {"_index": "transactions", "_id": "txn-003", "_source": {"include": ["metadata"], "exclude": ["metadata.internal"]}}
  ]
}

Inverted Index Fundamentals

Elasticsearch utilizes inverted indices for high-performance full-text retrieval. This structure maintains a vocabulary of unique terms extracted from the document corpus, with each term mapping to a posting list containing document references and positional metadata.

Text Analysis and Tokenization

Built-in analyzers process text during indexing and search:

  • Standard: Grammar-aware tokenization with lowercase normalization
  • Simple: Non-letter character delimiting with lowercase conversion
  • Stop: Simple analyzer with stop word removal (articles, prepositions)
  • Whitespace: Space-delimited splitting preserving original case
  • Keyword: No-op analyzer treating input as single token
  • Pattern: Regular expression-based splitting (default: non-word characters)
  • Language: Language-specific tokenization with stemming (30+ languages available)

Standard Analyzer Behavior:

GET _analyze
{
  "analyzer": "standard",
  "text": "The Quick Brown-Fox jumps over 3 lazy dogs!"
}

Produces tokens: the, quick, brown, fox, jumps, over, 3, lazy, dogs

Simple Analyzer Behavior:

GET _analyze
{
  "analyzer": "simple",
  "text": "The Quick Brown-Fox jumps over 3 lazy dogs!"
}

Produces tokens: the, quick, brown, fox, jumps, over, lazy, dogs (numeric tokens excluded)

Stop Analyzer Behavior:

GET _analyze
{
  "analyzer": "stop",
  "text": "The Quick Brown-Fox jumps over 3 lazy dogs!"
}

Produces tokens: quick, brown, fox, jumps, lazy, dogs (removes the, over)

Whitespace Analyzer Behavior:

GET _analyze
{
  "analyzer": "whitespace",
  "text": "The Quick Brown-Fox jumps over 3 lazy dogs!"
}

Preserves case and punctuation: The, Quick, Brown-Fox, jumps, over, 3, lazy, dogs!

Keyword Analyzer Behavior:

GET _analyze
{
  "analyzer": "keyword",
  "text": "The Quick Brown-Fox jumps over 3 lazy dogs!"
}

Single token: The Quick Brown-Fox jumps over 3 lazy dogs!

English Analyzer with Stemming:

GET _analyze
{
  "analyzer": "english",
  "text": "The Quick Brown-Fox jumps over 3 lazy dogs!"
}

Produces stemmed tokens: quick, brown, fox, jump, over, 3, lazi, dog

CJK Text Processing:

POST _analyze
{
  "analyzer": "icu_analyzer",
  "text": "搜索引擎技术"
}

Segments text into meaningful units: 搜索, 引擎, 技术

Compared to Standard analyzer on CJK text which produces single character tokens: , , , , ,

Search APIs

URI-based Search:

GET ecommerce/_search?q=customer_name:Alice
GET ecommerce*/_search?q=status:pending
GET /_all/_search?q=amount:>100

Request Body Search:

POST sales/_search
{
  "query": {
    "match": {
      "description": "laptop computer"
    }
  }
}

Relevance Metrics:

  • Precision: Ratio of relevant documents in retrieved results
  • Recall: Ratio of retrieved relevant documents to total relevant documents
  • Ranking: Ordering by relevance score

URI Search Syntax

Query string parameters:

GET /products/_search?q=wireless&df=name&sort=price:asc&from=0&size=20&timeout=1s
  • q: Query expression using Query String Syntax
  • df: Default search field (searches all fields if omitted)
  • sort, from, size: Pagination controls
  • profile: Include execution plan details

Field Specification vs Generic Search:

GET /products/_search?q=name:headphones
GET /products/_search?q=headphones

Term vs Phrase Queries:

GET /products/_search?q=name:wireless headphones

Executes as: wireless OR headphones

GET /products/_search?q=name:"wireless headphones"

Executes as phrase match requiring exact word adjacency and order.

Boolean Logic:

GET /products/_search?q=name:(wireless AND headphones)
GET /products/_search?q=name:(wireless NOT bluetooth)
GET /products/_search?q=name:(+wireless +noise-canceling)

Range Queries:

GET /products/_search?q=price:>100
GET /products/_search?q=created:[2023-01-01 TO 2023-12-31}
GET /products/_search?q=stock:[10 TO *]

Wildcard and Fuzzy Matching:

GET /products/_search?q=name:head*
GET /products/_search?q=name:headphon~1

Query DSL

Match query with OR logic (default):

POST articles/_search
{
  "query": {
    "match": {
      "content": "machine learning"
    }
  }
}

Match query with AND logic:

POST articles/_search
{
  "query": {
    "match": {
      "content": {
        "query": "machine learning",
        "operator": "and"
      }
    }
  }
}

Phrase matching:

POST articles/_search
{
  "query": {
    "match_phrase": {
      "content": {
        "query": "artificial intelligence"
      }
    }
  }
}

Phrase with slop (word distance tolerance):

POST articles/_search
{
  "query": {
    "match_phrase": {
      "content": {
        "query": "artificial general",
        "slop": 2
      }
    }
  }
}

Multi-field phrase match:

POST articles/_search
{
  "query": {
    "multi_match": {
      "query": "neural networks",
      "type": "phrase",
      "fields": ["title", "abstract", "content"],
      "slop": 1
    }
  }
}

Query String Syntax:

POST users/_search
{
  "query": {
    "query_string": {
      "default_field": "bio",
      "query": "(Java AND Spring) OR (Python AND Django)"
    }
  }
}

Simple Query String:

POST users/_search
{
  "query": {
    "simple_query_string": {
      "query": "developer engineer",
      "fields": ["title", "skills"],
      "default_operator": "AND"
    }
  }
}

Note: Simple Query String does not support complex boolean expressions; reserved characters are treated as literals.

Mapping Configuration

Dynamic Mapping Behavior:

PUT sensor_data/_doc/1
{
  "device_id": "dev-001",
  "temperature": 23.5,
  "is_active": true,
  "metadata": {
    "location": "building-a"
  }
}

Elasticsearch infers:

  • device_id: text with keyword subfield
  • temperature: float
  • is_active: boolean
  • metadata: object

Dynamic Mapping Controls:

dynamic: true (default) - New fields indexed and searchable dynamic: false - New fields ignored in indexing but stored in _source dynamic: strict - Reject documents with unrecognized fields (HTTP 400)

PUT strict_index
{
  "mappings": {
    "dynamic": "strict",
    "properties": {
      "title": { "type": "text" }
    }
  }
}

Explicit Mapping Definition:

PUT customer_profiles
{
  "mappings": {
    "properties": {
      "email": {
        "type": "keyword",
        "index": false
      },
      "full_name": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword"
          }
        }
      },
      "phone": {
        "type": "keyword",
        "null_value": "N/A"
      },
      "join_date": {
        "type": "date",
        "format": "yyyy-MM-dd"
      }
    }
  }
}

Copy_to for Composite Fields:

PUT contact_directory
{
  "mappings": {
    "properties": {
      "first_name": {
        "type": "text",
        "copy_to": "full_name"
      },
      "last_name": {
        "type": "text",
        "copy_to": "full_name"
      },
      "full_name": {
        "type": "text"
      }
    }
  }
}

Search across both fields:

GET contact_directory/_search?q=full_name:(John Smith)

Array Handling:

Arrays require no special mapping configuration. Any field accepting single values accepts multiple values:

PUT tags/_doc/1
{
  "category": "electronics",
  "labels": ["gadget", "mobile", "wireless"]
}

Custom Analysis Pipelines

Character Filters:

Remove HTML entities:

POST _analyze
{
  "tokenizer": "standard",
  "char_filter": ["html_strip"],
  "text": "<p>Device configuration</p>"
}

Mapping filter for synonym preprocessing:

POST _analyze
{
  "tokenizer": "standard",
  "char_filter": [
    {
      "type": "mapping",
      "mappings": ["- => _", ":) => positive", ":( => negative"]
    }
  ],
  "text": "State-of-the-art :)"
}

Tokenizers:

Path hierarchy tokenizer:

POST _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "/var/log/elasticsearch/cluster/nodes"
}

Produces: /var, /var/log, /var/log/elasticsearch, etc.

Token Filters:

Snowball stemmer with stop word removal:

GET _analyze
{
  "tokenizer": "whitespace",
  "filter": ["lowercase", "stop", "snowball"],
  "text": "The computers are computing computational problems"
}

Produces: comput, problem (stemmed forms)

Custom Analyzer Definition:

PUT log_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "log_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "char_filter": ["html_strip"],
          "filter": ["lowercase", "stop"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "message": {
        "type": "text",
        "analyzer": "log_analyzer"
      }
    }
  }
}

Index Templates

Index templates apply configurations to new indices matching patterns:

PUT _template/base_configuration
{
  "index_patterns": ["logs-*", "metrics-*"],
  "order": 0,
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 1
  },
  "mappings": {
    "dynamic_templates": [
      {
        "strings_as_keywords": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "keyword"
          }
        }
      }
    ]
  }
}

Template inheritance by order (higher values override lower):

PUT _template/priority_configuration
{
  "index_patterns": ["logs-critical-*"],
  "order": 1,
  "settings": {
    "number_of_replicas": 2,
    "index.refresh_interval": "1s"
  }
}

Dynamic Templates

Dynamic templates control field type inference based on naming conventions:

PUT dynamic_content
{
  "mappings": {
    "dynamic_templates": [
      {
        "boolean_flags": {
          "match_mapping_type": "string",
          "match": "is_*",
          "unmatch": "*_text",
          "mapping": {
            "type": "boolean"
          }
        }
      },
      {
        "text_content": {
          "match_mapping_type": "string",
          "mapping": {
            "type": "text",
            "fields": {
              "keyword": {
                "type": "keyword"
              }
            }
          }
        }
      }
    ]
  }
}

Path-based matching:

PUT hierarchical_data
{
  "mappings": {
    "dynamic_templates": [
      {
        "path_based_copy": {
          "path_match": "user.*",
          "path_unmatch": "user.password",
          "mapping": {
            "type": "text",
            "copy_to": "user_profile"
          }
        }
      }
    ]
  }
}

Aggregation Framework

Aggregations enable data summarization and analytics:

Bucket Aggregation:

GET orders/_search
{
  "size": 0,
  "aggs": {
    "by_category": {
      "terms": {
        "field": "category.keyword"
      }
    }
  }
}

Metric Aggregations within Buckets:

GET orders/_search
{
  "size": 0,
  "aggs": {
    "by_category": {
      "terms": {
        "field": "category.keyword"
      },
      "aggs": {
        "avg_price": {
          "avg": {
            "field": "price"
          }
        },
        "max_price": {
          "max": {
            "field": "price"
          }
        },
        "min_price": {
          "min": {
            "field": "price"
          }
        }
      }
    }
  }
}

Nested Sub-aggregations:

GET orders/_search
{
  "size": 0,
  "aggs": {
    "by_region": {
      "terms": {
        "field": "region"
      },
      "aggs": {
        "price_stats": {
          "stats": {
            "field": "total_amount"
          }
        },
        "by_payment_method": {
          "terms": {
            "field": "payment_type",
            "size": 5
          }
        }
      }
    }
  }
}

Schema Evolution

Add fields to existing mappings:

PUT existing_index/_mapping
{
  "properties": {
    "processed_date": {
      "type": "date",
      "format": "yyyy-MM-dd"
    },
    "processing_time": {
      "type": "integer"
    }
  }
}

Performance Tuning

Script Compilation Limits:

When encountering max_compilations_rate errors during bulk updates:

PUT _cluster/settings
{
  "transient": {
    "script.max_compilations_rate": "1000/1m"
  }
}

Result Window Expansion:

For deep pagination requirements:

PUT large_dataset/_settings
{
  "index": {
    "max_result_window": 100000
  }
}

Alternatively, use search_after or scroll APIs for deep pagination instead of expanding max_result_window.

Total Hit Count Accuracy:

For accurate hit counts beyond 10,000 documents:

POST large_dataset/_search
{
  "track_total_hits": true,
  "query": {
    "match_all": {}
  }
}

Related Articles

Understanding Strong and Weak References in Java

Strong References Strong reference are the most prevalent type of object referencing in Java. When an object has a strong reference pointing to it, the garbage collector will not reclaim its memory. F...

Comprehensive Guide to SSTI Explained with Payload Bypass Techniques

Introduction Server-Side Template Injection (SSTI) is a vulnerability in web applications where user input is improper handled within the template engine and executed on the server. This exploit can r...

Implement Image Upload Functionality for Django Integrated TinyMCE Editor

Django’s Admin panel is highly user-friendly, and pairing it with TinyMCE, an effective rich text editor, simplifies content management significantly. Combining the two is particular useful for bloggi...

Leave a Comment

Anonymous

◎Feel free to join the discussion and share your thoughts.